Zehan Wang
30 papers · 2016–2025 · 8 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+10 more ↓ Show less ↑
π Academic Marathon (9) π Conference Polyglot (8) π Interdisciplinary Bridge π§ Keyword Pioneer π Cross-Pollinator (12)
π
Cross-Pollinator
(12)
π
Renaissance Researcher
(8)
πΊοΈ
Taxonomy Completionist
(47)
π€
Dynamic Duo
(24)
π₯
Mega-Team
(20)
π¬
Deep Specialist
(11)
β
The Questioner
π
Century Club
(30)
β‘
Prolific Year
(10)
ποΈ
Keyword Collector
(127)
Conferences
NIPS (7)
ICLR (6)
ACL (5)
CVPR (5)
ICML (3)
ICCV (2)
EMNLP (1)
NAACL (1)
Top co-authors
Keywords
multi-modal learning
(3)
vision-language model
(3)
speech synthesis
(3)
representation learning
(3)
speech translation
(2)
contrastive learning
(2)
point cloud
(2)
semantic alignment
(2)
cross-modal alignment
(2)
convolutional neural network
(2)
zero-shot learning
(2)
scene understanding
(2)
3d visual grounding
(2)
sub-pixel convolution
(2)
image restoration
(2)
multimodal learning
(2)
video super-resolution
(2)
embedding learning
(1)
object detection
(1)
domain generalization
(1)
Papers
Data-Efficiently Learn Large Language Model for Universal 3D Scene Perception
NAACL 2025
ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control
ACL 2025
T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation via Fine-grained AI Feedback
ACL 2025
RoboGround: Robotic Manipulation with Grounded Vision-Language Priors
CVPR 2025
SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language
CVPR 2025
VoxDialogue: Can Spoken Dialogue Systems Understand Information Beyond Words?
ICLR 2025
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
ICLR 2025
Improving Long-Text Alignment for Text-to-Image Diffusion Models
ICLR 2025
OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup
ICLR 2025
Diff-Prompt: Diffusion-Driven Prompt Generator with Mask Supervision
ICLR 2025
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
ICLR 2025
Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models
ICML 2025
InstructSpeech: Following Speech Editing Instructions via Large Language Models
ICML 2024
MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes
NIPS 2024
Action Imitation in Common Action Space for Customized Action Image Synthesis
NIPS 2024
Extending Multi-modal Contrastive Representations
NIPS 2024
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers
NIPS 2024
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching
NIPS 2024
Lumina-Next : Making Lumina-T2X Stronger and Faster with Next-DiT
NIPS 2024
Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners
ACL 2024
TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation
ACL 2024
FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion
ICML 2024
MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition
ICCV 2023
3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding
EMNLP 2023
Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding
ICCV 2023
Scene-robust Natural Language Video Localization via Learning Domain-invariant Representations
ACL 2023
Connecting Multi-modal Contrastive Representations
NIPS 2023
Real-Time Video Super-Resolution With Spatio-Temporal Networks and Motion Compensation
CVPR 2017
Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network
CVPR 2017
Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network
CVPR 2016