Puyuan Peng

14 papers · 2022–2025 · 8 conferences · across top CS/AI conferences

Achievements

+9 more ↓

🐝 Cross-Pollinator (6) 🌉 Interdisciplinary Bridge 🌍 Conference Polyglot (8) 🧭 Keyword Pioneer 🌈 Renaissance Researcher (6)

🌈 Renaissance Researcher (6) 🌍 Conference Polyglot (8) 👥 Mega-Team (76) 🤝 Dynamic Duo (14) 🏆 Keyword Champion (4) 📈 Trend Setter ⚡ Prolific Year (5) 🗃️ Keyword Collector (55) 💎 Century Club (14)

Conferences

INTERSPEECH (6) ICLR (2) ACL (1) ECCV (1) EMNLP (1) ICCV (1) ICML (1) WACV (1)

Top co-authors

David Harwath (14) Alan Baade (3) Zhisheng Zheng (2) Abdelrahman Mohamed (2) Anuj Diwan (2) Shang-Wen Li (2) Wei-Cheng Tseng (2) Shinji Watanabe (2) Shao-Xiang Yuan (1) Fabian Alejandro Ritter Gutierrez (1)

Keywords

neural codec (4) self-supervised learning (3) zero-shot learning (3) speech synthesis (3) multimodal learning (2) speech editing (2) video understanding (2) voice conversion (1) autoregressive generation (1) action recognition (1) prompt engineering (1) cross-lingual transfer (1) word segmentation (1) multilingual processing (1) video captioning (1) deep learning (1) visual grounding (1) model architecture (1) weakly-supervised learning (1) speech recognition (1)

Papers

SyllableLM: Learning Coarse Semantic Units for Speech Language Models ICLR 2025 VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing EMNLP 2025 Temporally Streaming Audio-Visual Synchronization for Real-World Videos WACV 2025 VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models ICCV 2025 Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks ICLR 2025 VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild ACL 2024 Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos ECCV 2024 BAT: Learning to Reason about Spatial Sounds with Large Language Models ICML 2024 Neural Codec Language Models for Disentangled and Textless Voice Conversion INTERSPEECH 2024 Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model INTERSPEECH 2023 Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization INTERSPEECH 2023 Style-transfer based Speech and Audio-visual Scene understanding for Robot Action Sequence Acquisition from Videos INTERSPEECH 2023 MAE-AST: Masked Autoencoding Audio Spectrogram Transformer INTERSPEECH 2022 Word Discovery in Visually Grounded, Self-Supervised Speech Models INTERSPEECH 2022