Rongjie Huang
48 papers · 2021–2025 · 10 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+13 more ↓ Show less ↑
π§ Keyword Pioneer π Conference Polyglot (10) πΊοΈ Taxonomy Completionist (10) π Interdisciplinary Bridge π£ Hot Topic Early Bird
π£
Hot Topic Early Bird
πΊοΈ
Taxonomy Completionist
(10)
π€
Dynamic Duo
(44)
π
Triple Crown
π
Grand Slam
π₯
Mega-Team
(20)
π¬
Deep Specialist
(15)
ποΈ
Keyword Collector
(167)
π
Trend Setter
β‘
Prolific Year
(22)
β
The Questioner
π
Century Club
(48)
π₯
Unstoppable
(5)
Conferences
ACL (17)
NIPS (9)
ICLR (7)
ICML (5)
AAAI (3)
EMNLP (3)
ICCV (1)
IJCAI (1)
INTERSPEECH (1)
NAACL (1)
Top co-authors
Keywords
speech synthesis
(11)
singing voice synthesis
(9)
diffusion model
(6)
zero-shot learning
(4)
self-supervised learning
(4)
style transfer
(4)
voice conversion
(4)
contrastive learning
(4)
speech-to-speech translation
(3)
diffusion transformer
(3)
generative model
(3)
multimodal learning
(3)
cross-modal learning
(3)
discrete representation
(2)
speech-to-singing conversion
(2)
music generation
(2)
cross-modal alignment
(2)
flow matching
(2)
audio-visual speech
(2)
denoising diffusion probabilistic model
(2)
Papers
Versatile Framework for Song Generation with Prompt-based Control
EMNLP 2025
Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization on Multi-party Conversation
ACL 2025
FlashAudio: Rectified Flow for Fast and High-Fidelity Text-to-Audio Generation
ACL 2025
OmniAudio: Generating Spatial Audio from 360-Degree Video
ICML 2025
TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching
AAAI 2025
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
ICLR 2025
Lumina-T2X: Scalable Flow-based Large Diffusion Transformer for Flexible Resolution Generation
ICLR 2025
OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup
ICLR 2025
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
ICLR 2025
VoxDialogue: Can Spoken Dialogue Systems Understand Information Beyond Words?
ICLR 2025
Text-to-Song: Towards Controllable Music Generation Incorporating Vocal and Accompaniment
ACL 2024
Robust Singing Voice Transcription Serves Synthesis
ACL 2024
Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners
ACL 2024
Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer
ACL 2024
Wav2SQL: Direct Generalizable Speech-To-SQL Parsing
ACL 2024
Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion
ACL 2024
TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation
ACL 2024
Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt
NAACL 2024
MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes
NIPS 2024
UniAudio 1.5: Large Language Model-Driven Audio Codec is A Few-Shot Audio Task Learner
NIPS 2024
Extending Multi-modal Contrastive Representations
NIPS 2024
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers
NIPS 2024
MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and Correspondence
NIPS 2024
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching
NIPS 2024
Lumina-Next : Making Lumina-T2X Stronger and Faster with Next-DiT
NIPS 2024
StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis
AAAI 2024
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
AAAI 2024
TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control
EMNLP 2024
Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis
ICLR 2024
InstructSpeech: Following Speech Editing Instructions via Large Language Models
ICML 2024
FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion
ICML 2024
UniAudio: Towards Universal Audio Generation with Large Language Models
ICML 2024
Prosody-TTS: Improving Prosody with Masked Autoencoder and Conditional Diffusion Model For Expressive Text-to-Speech
ACL 2023
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation
ICLR 2023
FluentSpeech: Stutter-Oriented Automatic Speech Editing with Context-Aware Diffusion Models
ACL 2023
Contrastive Token-Wise Meta-Learning for Unseen Performer Visual Temporal-Aligned Translation
ACL 2023
ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer
EMNLP 2023
MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition
ICCV 2023
AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment
ACL 2023
FastDiff 2: Revisiting and Incorporating GANs and Diffusion Models in High-Fidelity Speech Synthesis
ACL 2023
RMSSinger: Realistic-Music-Score based Singing Voice Synthesis
ACL 2023
CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-Training
ACL 2023
AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation
ACL 2023
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
ICML 2023
GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech
NIPS 2022
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis
IJCAI 2022
M4Singer: A Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus
NIPS 2022
EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model
INTERSPEECH 2021