Rongjie Huang

48 papers · 2021–2025 · 10 conferences · across top CS/AI conferences

Achievements

+13 more ↓

🧭 Keyword Pioneer 🌍 Conference Polyglot (10) 🗺️ Taxonomy Completionist (10) 🌉 Interdisciplinary Bridge 🐣 Hot Topic Early Bird

🐣 Hot Topic Early Bird 🗺️ Taxonomy Completionist (10) 🤝 Dynamic Duo (44) 👑 Triple Crown 🏆 Grand Slam 👥 Mega-Team (20) 🔬 Deep Specialist (15) 🗃️ Keyword Collector (167) 📈 Trend Setter ⚡ Prolific Year (22) ❓ The Questioner 💎 Century Club (48) 🔥 Unstoppable (5)

Conferences

ACL (17) NIPS (9) ICLR (7) ICML (5) AAAI (3) EMNLP (3) ICCV (1) IJCAI (1) INTERSPEECH (1) NAACL (1)

Top co-authors

Zhou Zhao (44) Zehan Wang (14) Yi Ren (14) Xize Cheng (14) Jinglin Liu (13) Ruiqi Li (13) Zhenhui Ye (12) Jinzheng He (11) Yongqi Wang (10) Ziyue Jiang (10)

Keywords

speech synthesis (11) singing voice synthesis (9) diffusion model (6) zero-shot learning (4) self-supervised learning (4) style transfer (4) voice conversion (4) contrastive learning (4) speech-to-speech translation (3) diffusion transformer (3) generative model (3) multimodal learning (3) cross-modal learning (3) discrete representation (2) speech-to-singing conversion (2) music generation (2) cross-modal alignment (2) flow matching (2) audio-visual speech (2) denoising diffusion probabilistic model (2)

Papers

Versatile Framework for Song Generation with Prompt-based Control EMNLP 2025 Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization on Multi-party Conversation ACL 2025 FlashAudio: Rectified Flow for Fast and High-Fidelity Text-to-Audio Generation ACL 2025 OmniAudio: Generating Spatial Audio from 360-Degree Video ICML 2025 TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching AAAI 2025 WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling ICLR 2025 Lumina-T2X: Scalable Flow-based Large Diffusion Transformer for Flexible Resolution Generation ICLR 2025 OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup ICLR 2025 OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces ICLR 2025 VoxDialogue: Can Spoken Dialogue Systems Understand Information Beyond Words? ICLR 2025 Text-to-Song: Towards Controllable Music Generation Incorporating Vocal and Accompaniment ACL 2024 Robust Singing Voice Transcription Serves Synthesis ACL 2024 Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners ACL 2024 Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer ACL 2024 Wav2SQL: Direct Generalizable Speech-To-SQL Parsing ACL 2024 Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion ACL 2024 TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation ACL 2024 Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt NAACL 2024 MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes NIPS 2024 UniAudio 1.5: Large Language Model-Driven Audio Codec is A Few-Shot Audio Task Learner NIPS 2024 Extending Multi-modal Contrastive Representations NIPS 2024 Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers NIPS 2024 MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and Correspondence NIPS 2024 Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching NIPS 2024 Lumina-Next : Making Lumina-T2X Stronger and Faster with Next-DiT NIPS 2024 StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis AAAI 2024 AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head AAAI 2024 TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control EMNLP 2024 Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis ICLR 2024 InstructSpeech: Following Speech Editing Instructions via Large Language Models ICML 2024 FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion ICML 2024 UniAudio: Towards Universal Audio Generation with Large Language Models ICML 2024 Prosody-TTS: Improving Prosody with Masked Autoencoder and Conditional Diffusion Model For Expressive Text-to-Speech ACL 2023 TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation ICLR 2023 FluentSpeech: Stutter-Oriented Automatic Speech Editing with Context-Aware Diffusion Models ACL 2023 Contrastive Token-Wise Meta-Learning for Unseen Performer Visual Temporal-Aligned Translation ACL 2023 ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer EMNLP 2023 MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition ICCV 2023 AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment ACL 2023 FastDiff 2: Revisiting and Incorporating GANs and Diffusion Models in High-Fidelity Speech Synthesis ACL 2023 RMSSinger: Realistic-Music-Score based Singing Voice Synthesis ACL 2023 CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-Training ACL 2023 AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation ACL 2023 Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models ICML 2023 GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech NIPS 2022 FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis IJCAI 2022 M4Singer: A Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus NIPS 2022 EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model INTERSPEECH 2021