Zehan Wang

30 papers · 2016–2025 · 8 conferences · across top CS/AI conferences

Achievements

+10 more ↓

🏃 Academic Marathon (9) 🌍 Conference Polyglot (8) 🌉 Interdisciplinary Bridge 🧭 Keyword Pioneer 🐝 Cross-Pollinator (12)

🐝 Cross-Pollinator (12) 🌈 Renaissance Researcher (8) 🗺️ Taxonomy Completionist (47) 🤝 Dynamic Duo (24) 👥 Mega-Team (20) 🔬 Deep Specialist (11) ❓ The Questioner 💎 Century Club (30) ⚡ Prolific Year (10) 🗃️ Keyword Collector (127)

Conferences

NIPS (7) ICLR (6) ACL (5) CVPR (5) ICML (3) ICCV (2) EMNLP (1) NAACL (1)

Top co-authors

Zhou Zhao (24) Xize Cheng (15) Rongjie Huang (14) Tao Jin (12) Haifeng Huang (10) Ziang Zhang (9) Yang Zhao (8) Luping Liu (8) Shengpeng Ji (7) Zhenhui Ye (6)

Keywords

multi-modal learning (3) vision-language model (3) speech synthesis (3) representation learning (3) speech translation (2) contrastive learning (2) point cloud (2) semantic alignment (2) cross-modal alignment (2) convolutional neural network (2) zero-shot learning (2) scene understanding (2) 3d visual grounding (2) sub-pixel convolution (2) image restoration (2) multimodal learning (2) video super-resolution (2) embedding learning (1) object detection (1) domain generalization (1)

Papers

Data-Efficiently Learn Large Language Model for Universal 3D Scene Perception NAACL 2025 ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control ACL 2025 T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation via Fine-grained AI Feedback ACL 2025 RoboGround: Robotic Manipulation with Grounded Vision-Language Priors CVPR 2025 SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language CVPR 2025 VoxDialogue: Can Spoken Dialogue Systems Understand Information Beyond Words? ICLR 2025 OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces ICLR 2025 Improving Long-Text Alignment for Text-to-Image Diffusion Models ICLR 2025 OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup ICLR 2025 Diff-Prompt: Diffusion-Driven Prompt Generator with Mask Supervision ICLR 2025 WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling ICLR 2025 Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models ICML 2025 InstructSpeech: Following Speech Editing Instructions via Large Language Models ICML 2024 MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes NIPS 2024 Action Imitation in Common Action Space for Customized Action Image Synthesis NIPS 2024 Extending Multi-modal Contrastive Representations NIPS 2024 Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers NIPS 2024 Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching NIPS 2024 Lumina-Next : Making Lumina-T2X Stronger and Faster with Next-DiT NIPS 2024 Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners ACL 2024 TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation ACL 2024 FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion ICML 2024 MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition ICCV 2023 3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding EMNLP 2023 Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding ICCV 2023 Scene-robust Natural Language Video Localization via Learning Domain-invariant Representations ACL 2023 Connecting Multi-modal Contrastive Representations NIPS 2023 Real-Time Video Super-Resolution With Spatio-Temporal Networks and Motion Compensation CVPR 2017 Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network CVPR 2017 Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network CVPR 2016