Zejun Ma

38 papers · 2021–2026 · 11 conferences · across top CS/AI conferences

Achievements

+10 more ↓

🧭 Keyword Pioneer 🗺️ Taxonomy Completionist (21) 🌈 Renaissance Researcher (6) 🌉 Interdisciplinary Bridge 🌍 Conference Polyglot (10)

🗺️ Taxonomy Completionist (21) 🧭 Keyword Pioneer 🏠 Conference Loyalist (21) 🤝 Dynamic Duo (11) 🏆 Keyword Champion (2) 🗃️ Keyword Collector (59) 🔥 Unstoppable (5) ⚡ Prolific Year (5) ❓ The Questioner 💎 Century Club (37)

Conferences

INTERSPEECH (21) ICLR (5) ICML (3) IJCAI (2) AAAI (1) ACL (1) CVPR (1) ECCV (1) EMNLP (1) ICCV (1) NAACL (1)

Top co-authors

Wei Li (12) Lu Lu (10) Xiang Yin (9) Changli Tang (6) Guangzhi Sun (6) Jun Zhang (6) Chao Zhang (6) Chunfeng Wang (5) Xianzhao Chen (5) Yi He (5)

Research topics

Speech & Audio (1)

Keywords

automatic speech recognition (6) speech recognition (3) domain adaptation (3) non-native speech (2) internal language model (2) large language model (2) shallow fusion (2) connectionist temporal classification (2) word error rate (2) end-to-end speech recognition (2) visual question answering (2) zero-shot learning (2) voice conversion (2) attention mechanism (2) data augmentation (2) video understanding (2) end-to-end model (2) word timing (2) knowledge distillation (1) self-supervised learning (1)

Papers

MMSearch-R1: Incentivizing LMMs to Search ACL 2026 LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale CVPR 2025 Audio-centric Video Understanding Benchmark without Text Shortcut EMNLP 2025 LLaVA-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models ICLR 2025 Improving LLM Video Understanding with 16 Frames Per Second ICML 2025 video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model ICML 2025 video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models ICML 2024 Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis ICLR 2024 PolyVoice: Language Models for Speech to Speech Translation ICLR 2024 SALMONN: Towards Generic Hearing Abilities for Large Language Models ICLR 2024 Can Large Language Models Understand Spatial Audio? INTERSPEECH 2024 MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning INTERSPEECH 2024 Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis ICLR 2024 RePOSE: 3D Human Pose Estimation via Spatio-Temporal Depth Relational Consistency ECCV 2024 Text-only Domain Adaptation using Unified Speech-Text Representation in Transducer INTERSPEECH 2023 Phonetic and Prosody-aware Self-supervised Learning Approach for Non-native Fluency Scoring INTERSPEECH 2023 Disentangling the Contribution of Non-native Speech in Automated Pronunciation Assessment INTERSPEECH 2023 Knowledge Distillation Approach for Efficient Internal Language Model Estimation INTERSPEECH 2023 S2CD: Self-heuristic Speaker Content Disentanglement for Any-to-Any Voice Conversion INTERSPEECH 2023 Improving Frame-level Classifier for Word Timings with Non-peaky CTC in End-to-End Automatic Speech Recognition INTERSPEECH 2023 Language-specific Boundary Learning for Improving Mandarin-English Code-switching Speech Recognition INTERSPEECH 2023 AudioQR: Deep Neural Audio Watermarks For QR Code IJCAI 2023 GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech INTERSPEECH 2023 Virtual Try-On with Pose-Garment Keypoints Guided Inpainting ICCV 2023 StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation INTERSPEECH 2023 Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition INTERSPEECH 2023 Internal Language Model Estimation Through Explicit Context Vector Learning for Attention-based Encoder-decoder ASR INTERSPEECH 2022 BiFSMN: Binary Neural Network for Keyword Spotting IJCAI 2022 Zero-Shot Audio Source Separation through Query-Based Learning from Weakly-Labeled Data AAAI 2022 Bring dialogue-context into RNN-T for streaming ASR INTERSPEECH 2022 Token-level Speaker Change Detection Using Speaker Difference and Speech Content via Continuous Integrate-and-fire INTERSPEECH 2022 Towards high-fidelity singing voice conversion with acoustic reference and contrastive predictive coding INTERSPEECH 2022 A Transfer and Multi-Task Learning based Approach for MOS Prediction INTERSPEECH 2022 Improving Contextual Representation with Gloss Regularized Pre-training NAACL 2022 Emitting Word Timings with HMM-Free End-to-End System in Automatic Speech Recognition INTERSPEECH 2021 HMM-Free Encoder Pre-Training for Streaming RNN Transducer INTERSPEECH 2021 Attention-Based Cross-Modal Fusion for Audio-Visual Voice Activity Detection in Musical Video Streams INTERSPEECH 2021 Fine-Grained Prosody Modeling in Neural Speech Synthesis Using ToBI Representation INTERSPEECH 2021