Zejun Ma
38 papers · 2021–2026 · 11 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+10 more ↓ Show less ↑
π§ Keyword Pioneer πΊοΈ Taxonomy Completionist (21) π Renaissance Researcher (6) π Interdisciplinary Bridge π Conference Polyglot (10)
πΊοΈ
Taxonomy Completionist
(21)
π§
Keyword Pioneer
π
Conference Loyalist
(21)
π€
Dynamic Duo
(11)
π
Keyword Champion
(2)
ποΈ
Keyword Collector
(59)
π₯
Unstoppable
(5)
β‘
Prolific Year
(5)
β
The Questioner
π
Century Club
(37)
Conferences
INTERSPEECH (21)
ICLR (5)
ICML (3)
IJCAI (2)
AAAI (1)
ACL (1)
CVPR (1)
ECCV (1)
EMNLP (1)
ICCV (1)
NAACL (1)
Top co-authors
Research topics
Keywords
automatic speech recognition
(6)
speech recognition
(3)
domain adaptation
(3)
non-native speech
(2)
internal language model
(2)
large language model
(2)
shallow fusion
(2)
connectionist temporal classification
(2)
word error rate
(2)
end-to-end speech recognition
(2)
visual question answering
(2)
zero-shot learning
(2)
voice conversion
(2)
attention mechanism
(2)
data augmentation
(2)
video understanding
(2)
end-to-end model
(2)
word timing
(2)
knowledge distillation
(1)
self-supervised learning
(1)
Papers
MMSearch-R1: Incentivizing LMMs to Search
ACL 2026
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
CVPR 2025
Audio-centric Video Understanding Benchmark without Text Shortcut
EMNLP 2025
LLaVA-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
ICLR 2025
Improving LLM Video Understanding with 16 Frames Per Second
ICML 2025
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
ICML 2025
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
ICML 2024
Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis
ICLR 2024
PolyVoice: Language Models for Speech to Speech Translation
ICLR 2024
SALMONN: Towards Generic Hearing Abilities for Large Language Models
ICLR 2024
Can Large Language Models Understand Spatial Audio?
INTERSPEECH 2024
MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning
INTERSPEECH 2024
Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis
ICLR 2024
RePOSE: 3D Human Pose Estimation via Spatio-Temporal Depth Relational Consistency
ECCV 2024
Text-only Domain Adaptation using Unified Speech-Text Representation in Transducer
INTERSPEECH 2023
Phonetic and Prosody-aware Self-supervised Learning Approach for Non-native Fluency Scoring
INTERSPEECH 2023
Disentangling the Contribution of Non-native Speech in Automated Pronunciation Assessment
INTERSPEECH 2023
Knowledge Distillation Approach for Efficient Internal Language Model Estimation
INTERSPEECH 2023
S2CD: Self-heuristic Speaker Content Disentanglement for Any-to-Any Voice Conversion
INTERSPEECH 2023
Improving Frame-level Classifier for Word Timings with Non-peaky CTC in End-to-End Automatic Speech Recognition
INTERSPEECH 2023
Language-specific Boundary Learning for Improving Mandarin-English Code-switching Speech Recognition
INTERSPEECH 2023
AudioQR: Deep Neural Audio Watermarks For QR Code
IJCAI 2023
GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech
INTERSPEECH 2023
Virtual Try-On with Pose-Garment Keypoints Guided Inpainting
ICCV 2023
StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation
INTERSPEECH 2023
Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition
INTERSPEECH 2023
Internal Language Model Estimation Through Explicit Context Vector Learning for Attention-based Encoder-decoder ASR
INTERSPEECH 2022
BiFSMN: Binary Neural Network for Keyword Spotting
IJCAI 2022
Zero-Shot Audio Source Separation through Query-Based Learning from Weakly-Labeled Data
AAAI 2022
Bring dialogue-context into RNN-T for streaming ASR
INTERSPEECH 2022
Token-level Speaker Change Detection Using Speaker Difference and Speech Content via Continuous Integrate-and-fire
INTERSPEECH 2022
Towards high-fidelity singing voice conversion with acoustic reference and contrastive predictive coding
INTERSPEECH 2022
A Transfer and Multi-Task Learning based Approach for MOS Prediction
INTERSPEECH 2022
Improving Contextual Representation with Gloss Regularized Pre-training
NAACL 2022
Emitting Word Timings with HMM-Free End-to-End System in Automatic Speech Recognition
INTERSPEECH 2021
HMM-Free Encoder Pre-Training for Streaming RNN Transducer
INTERSPEECH 2021
Attention-Based Cross-Modal Fusion for Audio-Visual Voice Activity Detection in Musical Video Streams
INTERSPEECH 2021
Fine-Grained Prosody Modeling in Neural Speech Synthesis Using ToBI Representation
INTERSPEECH 2021