David Harwath
41 papers · 2016–2026 · 12 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+15 more ↓ Show less ↑
π§ Keyword Pioneer πΊοΈ Taxonomy Completionist (10) π Interdisciplinary Bridge π Renaissance Researcher (6) π Conference Polyglot (11)
π
Interdisciplinary Bridge
π
Conference Polyglot
(11)
πΊοΈ
Taxonomy Completionist
(10)
π€
Dynamic Duo
(15)
π
Keyword Champion
π₯
Mega-Team
(76)
π
Grand Slam
π¬
Deep Specialist
(20)
π₯
Unstoppable
(10)
β
The Questioner
(2)
β‘
Prolific Year
(10)
ποΈ
Keyword Collector
(166)
π
Century Club
(40)
π
Trend Setter
π
Conference Pioneer
Conferences
INTERSPEECH (16)
ACL (5)
CVPR (4)
EMNLP (4)
ICLR (3)
ECCV (2)
ICCV (2)
AAAI (1)
ICML (1)
IJCNLP (1)
NIPS (1)
WACV (1)
Top co-authors
Keywords
self-supervised learning
(12)
multimodal learning
(11)
speech synthesis
(7)
zero-shot learning
(5)
video retrieval
(5)
neural codec
(4)
video understanding
(3)
visual grounding
(3)
contrastive learning
(3)
image captioning
(3)
video captioning
(3)
speech processing
(3)
model architecture
(2)
image retrieval
(2)
audio-visual learning
(2)
transfer learning
(2)
speech recognition
(2)
unsupervised learning
(2)
cross-lingual transfer
(2)
action recognition
(2)
Papers
MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence
AAAI 2026
Scaling Rich Style-Prompted Text-to-Speech Datasets
EMNLP 2025
SyllableLM: Learning Coarse Semantic Units for Speech Language Models
ICLR 2025
VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models
ICCV 2025
Temporally Streaming Audio-Visual Synchronization for Real-World Videos
WACV 2025
Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks
ICLR 2025
VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing
EMNLP 2025
Improving Audio Classification with Low-Sampled Microphone Input: An Empirical Study Using Model Self-Distillation
INTERSPEECH 2024
Multimodal Contextualized Semantic Parsing from Speech
ACL 2024
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
ACL 2024
SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos
CVPR 2024
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
ECCV 2024
Textless Speech-to-Speech Translation With Limited Parallel Data
EMNLP 2024
BAT: Learning to Reason about Spatial Sounds with Large Language Models
ICML 2024
Neural Codec Language Models for Disentangled and Textless Voice Conversion
INTERSPEECH 2024
Direct Speech Synthesis from Non-Invasive, Neuromagnetic Signals
INTERSPEECH 2024
Interface Design for Self-Supervised Speech Models
INTERSPEECH 2024
Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization
INTERSPEECH 2023
Style-transfer based Speech and Audio-visual Scene understanding for Robot Action Sequence Acquisition from Videos
INTERSPEECH 2023
Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages
INTERSPEECH 2023
When to Use Efficient Self Attention? Profiling Text, Speech and Image Transformer Variants
ACL 2023
Contrastive Audio-Visual Masked Autoencoder
ICLR 2023
Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model
INTERSPEECH 2023
Everything at Once - Multi-Modal Fusion Transformer for Video Retrieval
CVPR 2022
MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
INTERSPEECH 2022
Word Discovery in Visually Grounded, Self-Supervised Speech Models
INTERSPEECH 2022
Exploring Few-Shot Fine-Tuning Strategies for Models of Visually Grounded Speech
INTERSPEECH 2022
Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality
EMNLP 2022
Multimodal Clustering Networks for Self-Supervised Learning From Unlabeled Videos
ICCV 2021
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
ACL 2021
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
IJCNLP 2021
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
INTERSPEECH 2021
Cascaded Multilingual Audio-Visual Learning from Videos
INTERSPEECH 2021
Spoken Moments: Learning Joint Audio-Visual Representations From Video Descriptions
CVPR 2021
Pair Expansion for Learning Multilingual Semantic Embeddings Using Disjoint Visually-Grounded Speech Audio Datasets
INTERSPEECH 2020
Learning Words by Drawing Images
CVPR 2019
Towards Bilingual Lexicon Discovery From Visually Grounded Speech Audio
INTERSPEECH 2019
Transfer Learning from Audio-Visual Grounding to Speech Recognition
INTERSPEECH 2019
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
ECCV 2018
Learning Word-Like Units from Joint Audio-Visual Analysis
ACL 2017
Unsupervised Learning of Spoken Language with Visual Context
NIPS 2016