Zhou Zhao
195 papers · 2015–2026 · 15 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+19 more ↓ Show less ↑
πΊοΈ Taxonomy Completionist (22) π§ Keyword Pioneer π Interdisciplinary Bridge π Renaissance Researcher (7) π Conference Polyglot (15)
π£
Hot Topic Early Bird
π
Renaissance Researcher
(7)
π
Interdisciplinary Bridge
π
Conference Loyalist
(21)
π
Keyword Trendsetter Combo
(6)
π€
Dynamic Duo
(44)
π
Triple Crown
π
Keyword Champion
(4)
π
Grand Slam
π₯
Mega-Team
(29)
π¬
Deep Specialist
(52)
π§¬
Topic Evolution
π₯
Unstoppable
(11)
β
The Questioner
π
Conference Pioneer
π
Century Club
(189)
β‘
Prolific Year
(14)
ποΈ
Keyword Collector
(75)
π
Trend Setter
Conferences
ACL (52)
NIPS (23)
AAAI (22)
IJCAI (21)
CVPR (18)
ICLR (12)
ICML (12)
EMNLP (11)
ICCV (7)
MICCAI (5)
IJCNLP (3)
INTERSPEECH (3)
AACL (2)
COLING (2)
NAACL (2)
Top co-authors
Research topics
Keywords
speech synthesis
(25)
multimodal learning
(19)
video understanding
(16)
singing voice synthesis
(15)
contrastive learning
(13)
zero-shot learning
(13)
diffusion model
(12)
attention mechanism
(12)
multi-modal learning
(11)
representation learning
(8)
generative model
(8)
prosody modeling
(7)
object detection
(7)
voice conversion
(6)
cross-modal learning
(6)
generative adversarial network
(6)
self-supervised learning
(6)
style transfer
(6)
knowledge distillation
(6)
visual grounding
(6)
Papers
Rectifying the Emotional Flow: Aligning Priors and Dynamic Guidance for High-Arousal Text-to-Speech
ACL 2026
Unified Thinker: A General Reasoning Core for Image Generation
ACL 2026
VoxMind: An End-to-End Agentic Spoken Dialogue System
ACL 2026
SDiaReward: Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness
ACL 2026
Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models
ACL 2026
F.A.C.U.L.: Language-Based Interaction with AI Companions in Gaming
AAAI 2026
Rhythm Controllable and Efficient Zero-Shot Voice Conversion via Shortcut Flow Matching
ACL 2025
CART: A Generative Cross-Modal Retrieval Framework With Coarse-To-Fine Semantic Modeling
ACL 2025
FlashAudio: Rectified Flow for Fast and High-Fidelity Text-to-Audio Generation
ACL 2025
Language-Codec: Bridging Discrete Codec Representations and Speech Language Models
ACL 2025
WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models
ACL 2025
OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use
ACL 2025
ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control
ACL 2025
CodeSync: Synchronizing Large Language Models with Dynamic Code Evolution at Scale
ICML 2025
Data-Efficiently Learn Large Language Model for Universal 3D Scene Perception
NAACL 2025
ASAudio: A Survey of Advanced Spatial Audio Research
IJCNLP 2025
Synthetic Singers: A Review of Deep-Learning-based Singing Voice Synthesis Approaches
IJCNLP 2025
OmniAudio: Generating Spatial Audio from 360-Degree Video
ICML 2025
Dataflow-Guided Neuro-Symbolic Language Models for Type Inference
ICML 2025
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
ICLR 2025
OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup
ICLR 2025
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
ICLR 2025
VoxDialogue: Can Spoken Dialogue Systems Understand Information Beyond Words?
ICLR 2025
EcoFace: Audio-Visual Emotional Co-Disentanglement Speech-Driven 3D Talking Face Generation
ICLR 2025
Open-set Cross Modal Generalization via Multimodal Unified Representation
ICCV 2025
Bridging Domain Generalization to Multimodal Domain Generalization via Unified Representations
ICCV 2025
InteractSpeech: A Speech Dialogue Interaction Corpus for Spoken Dialogue Model
EMNLP 2025
Versatile Framework for Song Generation with Prompt-based Control
EMNLP 2025
RecBase: Generative Foundation Model Pretraining for Zero-Shot Recommendation
EMNLP 2025
ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment
EMNLP 2025
SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language
CVPR 2025
Towards Transformer-Based Aligned Generation with Self-Coherence Guidance
CVPR 2025
ExpTalk: Diverse Emotional Expression via Adaptive Disentanglement and Refined Alignment for Speech-Driven 3D Facial Animation
IJCAI 2025
Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models
ICML 2025
IRBridge: Solving Image Restoration Bridge with Pre-trained Generative Diffusion Models
ICML 2025
MergeNet: Knowledge Migration Across Heterogeneous Models, Tasks, and Modalities
AAAI 2025
TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching
AAAI 2025
Speech Watermarking with Discrete Intermediate Representations
AAAI 2025
Synthetic Singers: A Review of Deep-Learning-based Singing Voice Synthesis Approaches
AACL 2025
ASAudio: A Survey of Advanced Spatial Audio Research
AACL 2025
Non-Natural Image Understanding with Advancing Frequency-based Vision Encoders
CVPR 2025
FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation
CVPR 2025
RoboGround: Robotic Manipulation with Grounded Vision-Language Priors
CVPR 2025
Sign2Vis: Automated Data Visualization from Sign Language
ACL 2025
VoxpopuliTTS: a large-scale multilingual TTS corpus for zero-shot speech generation
COLING 2025
STARS: A Unified Framework for Singing Transcription, Alignment, and Refined Style Annotation
ACL 2025
TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis
ACL 2025
Enhancing Multimodal Unified Representations for Cross Modal Generalization
ACL 2025
MIRA: Empowering One-Touch AI Services on Smartphones with MLLM-based Instruction Recommendation
ACL 2025
T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation via Fine-grained AI Feedback
ACL 2025
InstructSpeech: Following Speech Editing Instructions via Large Language Models
ICML 2024
GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks
NIPS 2024
MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes
NIPS 2024
Action Imitation in Common Action Space for Customized Action Image Synthesis
NIPS 2024
Extending Multi-modal Contrastive Representations
NIPS 2024
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers
NIPS 2024
$E^3$: Exploring Embodied Emotion Through A Large-Scale Egocentric Video Dataset
NIPS 2024
MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and Correspondence
NIPS 2024
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching
NIPS 2024
Classifier-guided Gradient Modulation for Enhanced Multimodal Learning
NIPS 2024
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations
AAAI 2024
StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis
AAAI 2024
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
AAAI 2024
Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition
ACL 2024
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension
ACL 2024
Rethinking the Multimodal Correlation of Multimodal Sequential Learning via Generalizable Attentional Results Alignment
ACL 2024
Text-to-Song: Towards Controllable Music Generation Incorporating Vocal and Accompaniment
ACL 2024
Robust Singing Voice Transcription Serves Synthesis
ACL 2024
Uni-Dubbing: Zero-Shot Speech Synthesis from Visual Articulation
ACL 2024
Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners
ACL 2024
MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech
ACL 2024
Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer
ACL 2024
Wav2SQL: Direct Generalizable Speech-To-SQL Parsing
ACL 2024
Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion
ACL 2024
TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation
ACL 2024
AntCritic: Argument Mining for Free-Form and Visually-Rich Financial Comments
COLING 2024
MPOD123: One Image to 3D Content Generation Using Mask-enhanced Progressive Outline-to-Detail Optimization
CVPR 2024
TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control
EMNLP 2024
Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis
ICLR 2024
Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis
ICLR 2024
Non-confusing Generation of Customized Concepts in Diffusion Models
ICML 2024
FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion
ICML 2024
UniAudio: Towards Universal Audio Generation with Large Language Models
ICML 2024
MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis
INTERSPEECH 2024
MoreStyle: Relax Low-frequency Constraint of Fourier-based Image Reconstruction in Generalizable Medical Image Segmentation
MICCAI 2024
Position-Guided Prompt Learning for Anomaly Detection in Chest X-Rays
MICCAI 2024
Prompting Segment Anything Model with Domain-Adaptive Prototype for Generalizable Medical Image Segmentation
MICCAI 2024
Spatial-aware Attention Generative Adversarial Network for Semi-supervised Anomaly Detection in Medical Image
MICCAI 2024
WIA-LD2ND: Wavelet-based Image Alignment for Self-supervised Low-Dose CT Denoising
MICCAI 2024
Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt
NAACL 2024
ART: rule bAsed futuRe-inference deducTion
EMNLP 2023
DATE: Domain Adaptive Product Seeker for E-Commerce
CVPR 2023
WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding
CVPR 2023
Gloss Attention for Gloss-Free Sign Language Translation
CVPR 2023
ANetQA: A Large-Scale Benchmark for Fine-Grained Compositional Reasoning Over Untrimmed Videos
CVPR 2023
Video-Audio Domain Generalization via Confounder Disentanglement
AAAI 2023
ShiftDDPMs: Exploring Conditional Diffusion Models by Shifting Diffusion Trajectories
AAAI 2023
3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding
EMNLP 2023
ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer
EMNLP 2023
Open-Vocabulary Object Detection With an Open Corpus
ICCV 2023
Exploring Group Video Captioning with Efficient Relational Approximation
ICCV 2023
Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding
ICCV 2023
MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition
ICCV 2023
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation
ICLR 2023
GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis
ICLR 2023
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
ICML 2023
Achieving Cross Modal Generalization with Multimodal Unified Representation
NIPS 2023
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks
NIPS 2023
PTADisc: A Cross-Course Dataset Supporting Personalized Learning in Cold-Start Scenarios
NIPS 2023
Prosody-TTS: Improving Prosody with Masked Autoencoder and Conditional Diffusion Model For Expressive Text-to-Speech
ACL 2023
Semantic-conditioned Dual Adaptation for Cross-domain Query-based Visual Segmentation
ACL 2023
Contrastive Token-Wise Meta-Learning for Unseen Performer Visual Temporal-Aligned Translation
ACL 2023
FluentSpeech: Stutter-Oriented Automatic Speech Editing with Context-Aware Diffusion Models
ACL 2023
DopplerBAS: Binaural Audio Synthesis Addressing Doppler Effect
ACL 2023
Connecting Multi-modal Contrastive Representations
NIPS 2023
AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment
ACL 2023
FastDiff 2: Revisiting and Incorporating GANs and Diffusion Models in High-Fidelity Speech Synthesis
ACL 2023
RMSSinger: Realistic-Music-Score based Singing Voice Synthesis
ACL 2023
Scene-robust Natural Language Video Localization via Learning Domain-invariant Representations
ACL 2023
TAVT: Towards Transferable Audio-Visual Text Generation
ACL 2023
Weakly-Supervised Spoken Video Grounding via Semantic Interaction Learning
ACL 2023
CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-Training
ACL 2023
AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation
ACL 2023
OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment
ACL 2023
Multi-modal Action Chain Abductive Reasoning
ACL 2023
Revisiting Over-Smoothness in Text to Speech
ACL 2022
End-to-End Modeling via Information Tree for One-Shot Natural Language Spatial Video Grounding
ACL 2022
Parallel and High-Fidelity Text-to-Lip Generation
AAAI 2022
SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech
IJCAI 2022
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis
IJCAI 2022
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
AAAI 2022
GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech
NIPS 2022
Flow-Based Unconstrained Lip to Speech Generation
AAAI 2022
Pseudo Numerical Methods for Diffusion Models on Manifolds
ICLR 2022
EditSinger: Zero-Shot Text-Based Singing Voice Editing System with Diverse Prosody Modeling
IJCAI 2022
Prior Knowledge and Memory Enriched Transformer for Sign Language Translation
ACL 2022
M4Singer: A Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus
NIPS 2022
Towards Effective Multi-Modal Interchanges in Zero-Resource Sounding Object Localization
NIPS 2022
Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech
NIPS 2022
Cross-Modal Background Suppression for Audio-Visual Event Localization
CVPR 2022
Wnet: Audio-Guided Video Object Segmentation via Wavelet-Based Cross-Modal Denoising Networks
CVPR 2022
MLSLT: Towards Multilingual Sign Language Translation
CVPR 2022
Fine-Grained Predicates Learning for Scene Graph Generation
CVPR 2022
Learning the Beauty in Songs: Neural Singing Voice Beautifier
ACL 2022
Unsupervised Representation Learning from Pre-trained Diffusion Probabilistic Models
NIPS 2022
Cortical Surface Shape Analysis Based on Alexandrov Polyhedra
ICCV 2021
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
ICLR 2021
PortaSpeech: Portable and High-Quality Generative Text-to-Speech
NIPS 2021
Learning to Rehearse in Long Sequence Memorization
ICML 2021
Generalizable Multi-linear Attention Network
NIPS 2021
FedSpeech: Federated Text-to-Speech with Continual Learning
IJCAI 2021
EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model
INTERSPEECH 2021
Cascaded Prediction Network via Segment Tree for Temporal Video Grounding
CVPR 2021
Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval
CVPR 2021
Modeling High-order Interactions across Multi-interests for Micro-video Reommendation (Student Abstract)
AAAI 2021
WSRGlow: A Glow-Based Waveform Generative Model for Audio Super-Resolution
INTERSPEECH 2021
Convolutional Hierarchical Attention Network for Query-Focused Video Summarization
AAAI 2020
A Study of Non-autoregressive Model for Sequence Generation
ACL 2020
SimulSpeech: End-to-End Simultaneous Speech to Text Translation
ACL 2020
Interactive Dual Generative Adversarial Networks for Image Captioning
AAAI 2020
Be Relevant, Non-Redundant, and Timely: Deep Reinforcement Learning for Real-Time Event Summarization
AAAI 2020
Weakly-Supervised Video Moment Retrieval via Semantic Completion Network
AAAI 2020
Task-Level Curriculum Learning for Non-Autoregressive Neural Machine Translation
IJCAI 2020
Multi-Speaker Video Dialog with Frame-Level Temporal Localization
AAAI 2020
Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding
NIPS 2020
Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding
IJCAI 2020
Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences
CVPR 2020
FastSpeech: Fast, Robust and Controllable Text to Speech
NIPS 2019
Almost Unsupervised Text to Speech and Automatic Speech Recognition
ICML 2019
Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction
CVPR 2019
Video Dialog via Progressive Inference and Cross-Transformer
IJCNLP 2019
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
AAAI 2019
Location-Based End-to-End Speech Recognition with Multiple Language Models
AAAI 2019
Beyond Product Quantization: Deep Progressive Quantization for Image Retrieval
IJCAI 2019
Weak Supervision Enhanced Generative Network for Question Generation
IJCAI 2019
Video Dialog via Progressive Inference and Cross-Transformer
EMNLP 2019
Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks
IJCAI 2019
Localizing Unseen Activities in Video via Image Query
IJCAI 2019
Exploring Human-Like Reading Strategy for Abstractive Text Summarization
AAAI 2019
Answer Identification from Product Reviews for User Questions by Multi-Task Attentive Networks
AAAI 2019
Multilingual Neural Machine Translation with Knowledge Distillation
ICLR 2019
Discourse Marker Augmented Network with Reinforcement Learning for Natural Language Inference
ACL 2018
Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks
IJCAI 2018
Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network
IJCAI 2018
MacNet: Transferring Knowledge from Machine Comprehension to Sequence-to-Sequence Models
NIPS 2018
Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding
IJCAI 2018
Investigating Capsule Networks with Dynamic Routing for Text Classification
EMNLP 2018
A Multi-task Learning Approach for Image Captioning
IJCAI 2018
Attentional Image Retweet Modeling via Multi-Faceted Ranking Network Learning
IJCAI 2018
Link Prediction via Ranking Metric Dual-Level Attention Network Learning
IJCAI 2017
Identifying and Tracking Sentiments and Topics from Social Media Texts during Natural Disasters
EMNLP 2017
Video Question Answering via Hierarchical Spatio-Temporal Attention Networks
IJCAI 2017
Microblog Sentiment Classiο¬cation via Recurrent Random Walk Network Learning
IJCAI 2017
Expert Finding for Community-Based Question Answering via Ranking Metric Network Learning
IJCAI 2016
Mobile Query Recommendation via Tensor Function Learning
IJCAI 2015