conftrace_

David Harwath

41 papers · 2016–2026 · 12 conferences · across top CS/AI conferences

Achievements

Jump to papers ↓
+15 more ↓ 🧭 Keyword Pioneer πŸ—ΊοΈ Taxonomy Completionist (10) πŸŒ‰ Interdisciplinary Bridge 🌈 Renaissance Researcher (6) 🌍 Conference Polyglot (11)
πŸŒ‰ Interdisciplinary Bridge 🌍 Conference Polyglot (11) πŸ—ΊοΈ Taxonomy Completionist (10) 🀝 Dynamic Duo (15) πŸ† Keyword Champion πŸ‘₯ Mega-Team (76) πŸ† Grand Slam πŸ”¬ Deep Specialist (20) πŸ”₯ Unstoppable (10) ❓ The Questioner (2) ⚑ Prolific Year (10) πŸ—ƒοΈ Keyword Collector (166) πŸ’Ž Century Club (40) πŸ“ˆ Trend Setter πŸš€ Conference Pioneer

Conferences

INTERSPEECH (16) ACL (5) CVPR (4) EMNLP (4) ICLR (3) ECCV (2) ICCV (2) AAAI (1) ICML (1) IJCNLP (1) NIPS (1) WACV (1)

Papers

MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence AAAI 2026 Scaling Rich Style-Prompted Text-to-Speech Datasets EMNLP 2025 SyllableLM: Learning Coarse Semantic Units for Speech Language Models ICLR 2025 VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models ICCV 2025 Temporally Streaming Audio-Visual Synchronization for Real-World Videos WACV 2025 Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks ICLR 2025 VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing EMNLP 2025 Improving Audio Classification with Low-Sampled Microphone Input: An Empirical Study Using Model Self-Distillation INTERSPEECH 2024 Multimodal Contextualized Semantic Parsing from Speech ACL 2024 VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild ACL 2024 SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos CVPR 2024 Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos ECCV 2024 Textless Speech-to-Speech Translation With Limited Parallel Data EMNLP 2024 BAT: Learning to Reason about Spatial Sounds with Large Language Models ICML 2024 Neural Codec Language Models for Disentangled and Textless Voice Conversion INTERSPEECH 2024 Direct Speech Synthesis from Non-Invasive, Neuromagnetic Signals INTERSPEECH 2024 Interface Design for Self-Supervised Speech Models INTERSPEECH 2024 Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization INTERSPEECH 2023 Style-transfer based Speech and Audio-visual Scene understanding for Robot Action Sequence Acquisition from Videos INTERSPEECH 2023 Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages INTERSPEECH 2023 When to Use Efficient Self Attention? Profiling Text, Speech and Image Transformer Variants ACL 2023 Contrastive Audio-Visual Masked Autoencoder ICLR 2023 Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model INTERSPEECH 2023 Everything at Once - Multi-Modal Fusion Transformer for Video Retrieval CVPR 2022 MAE-AST: Masked Autoencoding Audio Spectrogram Transformer INTERSPEECH 2022 Word Discovery in Visually Grounded, Self-Supervised Speech Models INTERSPEECH 2022 Exploring Few-Shot Fine-Tuning Strategies for Models of Visually Grounded Speech INTERSPEECH 2022 Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality EMNLP 2022 Multimodal Clustering Networks for Self-Supervised Learning From Unlabeled Videos ICCV 2021 Text-Free Image-to-Speech Synthesis Using Learned Segmental Units ACL 2021 Text-Free Image-to-Speech Synthesis Using Learned Segmental Units IJCNLP 2021 AVLnet: Learning Audio-Visual Language Representations from Instructional Videos INTERSPEECH 2021 Cascaded Multilingual Audio-Visual Learning from Videos INTERSPEECH 2021 Spoken Moments: Learning Joint Audio-Visual Representations From Video Descriptions CVPR 2021 Pair Expansion for Learning Multilingual Semantic Embeddings Using Disjoint Visually-Grounded Speech Audio Datasets INTERSPEECH 2020 Learning Words by Drawing Images CVPR 2019 Towards Bilingual Lexicon Discovery From Visually Grounded Speech Audio INTERSPEECH 2019 Transfer Learning from Audio-Visual Grounding to Speech Recognition INTERSPEECH 2019 Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input ECCV 2018 Learning Word-Like Units from Joint Audio-Visual Analysis ACL 2017 Unsupervised Learning of Spoken Language with Visual Context NIPS 2016