David Harwath

41 papers · 2016–2026 · 12 conferences · across top CS/AI conferences

Achievements

+15 more ↓

🧭 Keyword Pioneer 🗺️ Taxonomy Completionist (10) 🌉 Interdisciplinary Bridge 🌈 Renaissance Researcher (6) 🌍 Conference Polyglot (11)

🌉 Interdisciplinary Bridge 🌍 Conference Polyglot (11) 🗺️ Taxonomy Completionist (10) 🤝 Dynamic Duo (15) 🏆 Keyword Champion 👥 Mega-Team (76) 🏆 Grand Slam 🔬 Deep Specialist (20) 🔥 Unstoppable (10) ❓ The Questioner (2) ⚡ Prolific Year (10) 🗃️ Keyword Collector (166) 💎 Century Club (40) 📈 Trend Setter 🚀 Conference Pioneer

Conferences

INTERSPEECH (16) ACL (5) CVPR (4) EMNLP (4) ICLR (3) ECCV (2) ICCV (2) AAAI (1) ICML (1) IJCNLP (1) NIPS (1) WACV (1)

Top co-authors

James Glass (15) Puyuan Peng (14) Hilde Kuehne (6) Andrew Rouditchenko (6) Anuj Diwan (6) Eunsol Choi (5) Brian Kingsbury (5) Rogerio Feris (5) Samuel Thomas (5) Antonio Torralba (4)

Keywords

self-supervised learning (12) multimodal learning (11) speech synthesis (7) zero-shot learning (5) video retrieval (5) neural codec (4) video understanding (3) visual grounding (3) contrastive learning (3) image captioning (3) video captioning (3) speech processing (3) model architecture (2) image retrieval (2) audio-visual learning (2) transfer learning (2) speech recognition (2) unsupervised learning (2) cross-lingual transfer (2) action recognition (2)

Papers

MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence AAAI 2026 Scaling Rich Style-Prompted Text-to-Speech Datasets EMNLP 2025 SyllableLM: Learning Coarse Semantic Units for Speech Language Models ICLR 2025 VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models ICCV 2025 Temporally Streaming Audio-Visual Synchronization for Real-World Videos WACV 2025 Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks ICLR 2025 VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing EMNLP 2025 Improving Audio Classification with Low-Sampled Microphone Input: An Empirical Study Using Model Self-Distillation INTERSPEECH 2024 Multimodal Contextualized Semantic Parsing from Speech ACL 2024 VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild ACL 2024 SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos CVPR 2024 Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos ECCV 2024 Textless Speech-to-Speech Translation With Limited Parallel Data EMNLP 2024 BAT: Learning to Reason about Spatial Sounds with Large Language Models ICML 2024 Neural Codec Language Models for Disentangled and Textless Voice Conversion INTERSPEECH 2024 Direct Speech Synthesis from Non-Invasive, Neuromagnetic Signals INTERSPEECH 2024 Interface Design for Self-Supervised Speech Models INTERSPEECH 2024 Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization INTERSPEECH 2023 Style-transfer based Speech and Audio-visual Scene understanding for Robot Action Sequence Acquisition from Videos INTERSPEECH 2023 Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages INTERSPEECH 2023 When to Use Efficient Self Attention? Profiling Text, Speech and Image Transformer Variants ACL 2023 Contrastive Audio-Visual Masked Autoencoder ICLR 2023 Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model INTERSPEECH 2023 Everything at Once - Multi-Modal Fusion Transformer for Video Retrieval CVPR 2022 MAE-AST: Masked Autoencoding Audio Spectrogram Transformer INTERSPEECH 2022 Word Discovery in Visually Grounded, Self-Supervised Speech Models INTERSPEECH 2022 Exploring Few-Shot Fine-Tuning Strategies for Models of Visually Grounded Speech INTERSPEECH 2022 Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality EMNLP 2022 Multimodal Clustering Networks for Self-Supervised Learning From Unlabeled Videos ICCV 2021 Text-Free Image-to-Speech Synthesis Using Learned Segmental Units ACL 2021 Text-Free Image-to-Speech Synthesis Using Learned Segmental Units IJCNLP 2021 AVLnet: Learning Audio-Visual Language Representations from Instructional Videos INTERSPEECH 2021 Cascaded Multilingual Audio-Visual Learning from Videos INTERSPEECH 2021 Spoken Moments: Learning Joint Audio-Visual Representations From Video Descriptions CVPR 2021 Pair Expansion for Learning Multilingual Semantic Embeddings Using Disjoint Visually-Grounded Speech Audio Datasets INTERSPEECH 2020 Learning Words by Drawing Images CVPR 2019 Towards Bilingual Lexicon Discovery From Visually Grounded Speech Audio INTERSPEECH 2019 Transfer Learning from Audio-Visual Grounding to Speech Recognition INTERSPEECH 2019 Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input ECCV 2018 Learning Word-Like Units from Joint Audio-Visual Analysis ACL 2017 Unsupervised Learning of Spoken Language with Visual Context NIPS 2016