Joon Son Chung

40 papers · 2017–2026 · 9 conferences · across top CS/AI conferences

Achievements

+15 more ↓

🌍 Conference Polyglot (9) 🏃 Academic Marathon (8) 🌉 Interdisciplinary Bridge 🧭 Keyword Pioneer 🐝 Cross-Pollinator (10)

🐝 Cross-Pollinator (10) 🌈 Renaissance Researcher (7) 🗺️ Taxonomy Completionist (44) 🏠 Conference Loyalist (23) 🏆 Keyword Champion (3) 👥 Mega-Team (34) 🤝 Dynamic Duo (10) 🔬 Deep Specialist (11) 🧬 Topic Evolution ⚡ Prolific Year (7) 🔥 Unstoppable (9) 📈 Trend Setter 💎 Century Club (39) 🗃️ Keyword Collector (137) ❓ The Questioner (3)

Conferences

INTERSPEECH (23) CVPR (6) AAAI (2) ECCV (2) ICCV (2) ICLR (2) EMNLP (1) ICML (1) WACV (1)

Top co-authors

Andrew Zisserman (10) Bong-Jin Lee (7) Triantafyllos Afouras (7) Hee-soo Heo (7) Arda Senocak (6) Jee-weon Jung (6) Ji-Hoon Kim (5) Youngki Kwon (4) Jaehun Kim (3) You Jin Kim (3)

Keywords

speaker verification (8) self-supervised learning (5) speaker recognition (5) cross-modal learning (4) multimodal learning (4) lip reading (4) speaker diarization (3) speech synthesis (3) convolutional neural network (3) speaker diarisation (3) visual speech recognition (2) sound source localization (2) audio classification (2) curriculum learning (2) cross-modal retrieval (2) face recognition (2) flow matching (2) representation learning (2) video generation (2) embedding learning (2)

Papers

MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence AAAI 2026 AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models ICLR 2025 Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing EMNLP 2025 High-Quality Joint Image and Video Tokenization with Causal VAE ICLR 2025 From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech CVPR 2025 Seeing Speech and Sound: Distinguishing and Locating Audio Sources in Visual Scenes CVPR 2025 VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models ICCV 2025 Let There Be Sound: Reconstructing High Quality Speech from Silent Videos AAAI 2024 Scaling Up Video Summarization Pretraining with Large Language Models CVPR 2024 Faces that Speak: Jointly Synthesising Talking Face and Speech from Text CVPR 2024 Towards Automated Movie Trailer Generation CVPR 2024 EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning ICML 2024 Lightweight Audio Segmentation for Long-form Speech Translation INTERSPEECH 2024 FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching INTERSPEECH 2024 VoxSim: A perceptual voice similarity dataset INTERSPEECH 2024 To what extent can ASV systems naturally defend against spoofing attacks? INTERSPEECH 2024 ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions INTERSPEECH 2024 Can CLIP Help Sound Source Localization? WACV 2024 Sound Source Localization is All about Cross-Modal Alignment ICCV 2023 FlexiAST: Flexibility is What AST Needs INTERSPEECH 2023 Curriculum Learning for Self-supervised Speaker Verification INTERSPEECH 2023 Disentangled Representation Learning for Multilingual Speaker Recognition INTERSPEECH 2023 Pushing the limits of raw waveform speaker recognition INTERSPEECH 2022 Three-Class Overlapped Speech Detection Using a Convolutional Recurrent Neural Network INTERSPEECH 2021 Adapting Speaker Embeddings for Speaker Diarisation INTERSPEECH 2021 Look Who’s Talking: Active Speaker Detection in the Wild INTERSPEECH 2021 Self-Supervised Learning of Audio-Visual Objects from Video ECCV 2020 Spot the Conversation: Speaker Diarisation in the Wild INTERSPEECH 2020 Now You’re Speaking My Language: Visual Language Identification INTERSPEECH 2020 In Defence of Metric Learning for Speaker Recognition INTERSPEECH 2020 FaceFilter: Audio-Visual Speech Separation Using Still Images INTERSPEECH 2020 Seeing Voices and Hearing Voices: Learning Discriminative Embeddings Using Cross-Modal Self-Supervision INTERSPEECH 2020 BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues ECCV 2020 Who Said That?: Audio-Visual Speaker Diarisation of Real-World Meetings INTERSPEECH 2019 My Lips Are Concealed: Audio-Visual Speech Enhancement Through Obstructions INTERSPEECH 2019 The Conversation: Deep Audio-Visual Speech Enhancement INTERSPEECH 2018 VoxCeleb2: Deep Speaker Recognition INTERSPEECH 2018 Deep Lip Reading: A Comparison of Models and an Online Application INTERSPEECH 2018 Lip Reading Sentences in the Wild CVPR 2017 VoxCeleb: A Large-Scale Speaker Identification Dataset INTERSPEECH 2017