Zhou Zhao

195 papers · 2015–2026 · 15 conferences · across top CS/AI conferences

Achievements

+19 more ↓

🗺️ Taxonomy Completionist (22) 🧭 Keyword Pioneer 🌉 Interdisciplinary Bridge 🌈 Renaissance Researcher (7) 🌍 Conference Polyglot (15)

🐣 Hot Topic Early Bird 🌈 Renaissance Researcher (7) 🌉 Interdisciplinary Bridge 🏠 Conference Loyalist (21) 🌟 Keyword Trendsetter Combo (6) 🤝 Dynamic Duo (44) 👑 Triple Crown 🏆 Keyword Champion (4) 🏆 Grand Slam 👥 Mega-Team (29) 🔬 Deep Specialist (52) 🧬 Topic Evolution 🔥 Unstoppable (11) ❓ The Questioner 🚀 Conference Pioneer 💎 Century Club (189) ⚡ Prolific Year (14) 🗃️ Keyword Collector (75) 📈 Trend Setter

Conferences

ACL (52) NIPS (23) AAAI (22) IJCAI (21) CVPR (18) ICLR (12) ICML (12) EMNLP (11) ICCV (7) MICCAI (5) IJCNLP (3) INTERSPEECH (3) AACL (2) COLING (2) NAACL (2)

Top co-authors

Rongjie Huang (44) Tao Jin (36) Yi Ren (35) Xize Cheng (30) Jinglin Liu (25) Ziyue Jiang (24) Zehan Wang (24) Fei Wu (22) Shengpeng Ji (21) Zhenhui Ye (17)

Research topics

Education (1)

Keywords

speech synthesis (25) multimodal learning (19) video understanding (16) singing voice synthesis (15) contrastive learning (13) zero-shot learning (13) diffusion model (12) attention mechanism (12) multi-modal learning (11) representation learning (8) generative model (8) prosody modeling (7) object detection (7) voice conversion (6) cross-modal learning (6) generative adversarial network (6) self-supervised learning (6) style transfer (6) knowledge distillation (6) visual grounding (6)

Papers

Rectifying the Emotional Flow: Aligning Priors and Dynamic Guidance for High-Arousal Text-to-Speech ACL 2026 Unified Thinker: A General Reasoning Core for Image Generation ACL 2026 VoxMind: An End-to-End Agentic Spoken Dialogue System ACL 2026 SDiaReward: Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness ACL 2026 Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models ACL 2026 F.A.C.U.L.: Language-Based Interaction with AI Companions in Gaming AAAI 2026 Rhythm Controllable and Efficient Zero-Shot Voice Conversion via Shortcut Flow Matching ACL 2025 CART: A Generative Cross-Modal Retrieval Framework With Coarse-To-Fine Semantic Modeling ACL 2025 FlashAudio: Rectified Flow for Fast and High-Fidelity Text-to-Audio Generation ACL 2025 Language-Codec: Bridging Discrete Codec Representations and Speech Language Models ACL 2025 WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models ACL 2025 OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use ACL 2025 ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control ACL 2025 CodeSync: Synchronizing Large Language Models with Dynamic Code Evolution at Scale ICML 2025 Data-Efficiently Learn Large Language Model for Universal 3D Scene Perception NAACL 2025 ASAudio: A Survey of Advanced Spatial Audio Research IJCNLP 2025 Synthetic Singers: A Review of Deep-Learning-based Singing Voice Synthesis Approaches IJCNLP 2025 OmniAudio: Generating Spatial Audio from 360-Degree Video ICML 2025 Dataflow-Guided Neuro-Symbolic Language Models for Type Inference ICML 2025 WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling ICLR 2025 OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup ICLR 2025 OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces ICLR 2025 VoxDialogue: Can Spoken Dialogue Systems Understand Information Beyond Words? ICLR 2025 EcoFace: Audio-Visual Emotional Co-Disentanglement Speech-Driven 3D Talking Face Generation ICLR 2025 Open-set Cross Modal Generalization via Multimodal Unified Representation ICCV 2025 Bridging Domain Generalization to Multimodal Domain Generalization via Unified Representations ICCV 2025 InteractSpeech: A Speech Dialogue Interaction Corpus for Spoken Dialogue Model EMNLP 2025 Versatile Framework for Song Generation with Prompt-based Control EMNLP 2025 RecBase: Generative Foundation Model Pretraining for Zero-Shot Recommendation EMNLP 2025 ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment EMNLP 2025 SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language CVPR 2025 Towards Transformer-Based Aligned Generation with Self-Coherence Guidance CVPR 2025 ExpTalk: Diverse Emotional Expression via Adaptive Disentanglement and Refined Alignment for Speech-Driven 3D Facial Animation IJCAI 2025 Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models ICML 2025 IRBridge: Solving Image Restoration Bridge with Pre-trained Generative Diffusion Models ICML 2025 MergeNet: Knowledge Migration Across Heterogeneous Models, Tasks, and Modalities AAAI 2025 TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching AAAI 2025 Speech Watermarking with Discrete Intermediate Representations AAAI 2025 Synthetic Singers: A Review of Deep-Learning-based Singing Voice Synthesis Approaches AACL 2025 ASAudio: A Survey of Advanced Spatial Audio Research AACL 2025 Non-Natural Image Understanding with Advancing Frequency-based Vision Encoders CVPR 2025 FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation CVPR 2025 RoboGround: Robotic Manipulation with Grounded Vision-Language Priors CVPR 2025 Sign2Vis: Automated Data Visualization from Sign Language ACL 2025 VoxpopuliTTS: a large-scale multilingual TTS corpus for zero-shot speech generation COLING 2025 STARS: A Unified Framework for Singing Transcription, Alignment, and Refined Style Annotation ACL 2025 TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis ACL 2025 Enhancing Multimodal Unified Representations for Cross Modal Generalization ACL 2025 MIRA: Empowering One-Touch AI Services on Smartphones with MLLM-based Instruction Recommendation ACL 2025 T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation via Fine-grained AI Feedback ACL 2025 InstructSpeech: Following Speech Editing Instructions via Large Language Models ICML 2024 GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks NIPS 2024 MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes NIPS 2024 Action Imitation in Common Action Space for Customized Action Image Synthesis NIPS 2024 Extending Multi-modal Contrastive Representations NIPS 2024 Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers NIPS 2024 $E^3$: Exploring Embodied Emotion Through A Large-Scale Egocentric Video Dataset NIPS 2024 MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and Correspondence NIPS 2024 Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching NIPS 2024 Classifier-guided Gradient Modulation for Enhanced Multimodal Learning NIPS 2024 Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations AAAI 2024 StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis AAAI 2024 AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head AAAI 2024 Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition ACL 2024 AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension ACL 2024 Rethinking the Multimodal Correlation of Multimodal Sequential Learning via Generalizable Attentional Results Alignment ACL 2024 Text-to-Song: Towards Controllable Music Generation Incorporating Vocal and Accompaniment ACL 2024 Robust Singing Voice Transcription Serves Synthesis ACL 2024 Uni-Dubbing: Zero-Shot Speech Synthesis from Visual Articulation ACL 2024 Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners ACL 2024 MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech ACL 2024 Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer ACL 2024 Wav2SQL: Direct Generalizable Speech-To-SQL Parsing ACL 2024 Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion ACL 2024 TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation ACL 2024 AntCritic: Argument Mining for Free-Form and Visually-Rich Financial Comments COLING 2024 MPOD123: One Image to 3D Content Generation Using Mask-enhanced Progressive Outline-to-Detail Optimization CVPR 2024 TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control EMNLP 2024 Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis ICLR 2024 Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis ICLR 2024 Non-confusing Generation of Customized Concepts in Diffusion Models ICML 2024 FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion ICML 2024 UniAudio: Towards Universal Audio Generation with Large Language Models ICML 2024 MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis INTERSPEECH 2024 MoreStyle: Relax Low-frequency Constraint of Fourier-based Image Reconstruction in Generalizable Medical Image Segmentation MICCAI 2024 Position-Guided Prompt Learning for Anomaly Detection in Chest X-Rays MICCAI 2024 Prompting Segment Anything Model with Domain-Adaptive Prototype for Generalizable Medical Image Segmentation MICCAI 2024 Spatial-aware Attention Generative Adversarial Network for Semi-supervised Anomaly Detection in Medical Image MICCAI 2024 WIA-LD2ND: Wavelet-based Image Alignment for Self-supervised Low-Dose CT Denoising MICCAI 2024 Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt NAACL 2024 ART: rule bAsed futuRe-inference deducTion EMNLP 2023 DATE: Domain Adaptive Product Seeker for E-Commerce CVPR 2023 WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding CVPR 2023 Gloss Attention for Gloss-Free Sign Language Translation CVPR 2023 ANetQA: A Large-Scale Benchmark for Fine-Grained Compositional Reasoning Over Untrimmed Videos CVPR 2023 Video-Audio Domain Generalization via Confounder Disentanglement AAAI 2023 ShiftDDPMs: Exploring Conditional Diffusion Models by Shifting Diffusion Trajectories AAAI 2023 3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding EMNLP 2023 ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer EMNLP 2023 Open-Vocabulary Object Detection With an Open Corpus ICCV 2023 Exploring Group Video Captioning with Efficient Relational Approximation ICCV 2023 Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding ICCV 2023 MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition ICCV 2023 TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation ICLR 2023 GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis ICLR 2023 Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models ICML 2023 Achieving Cross Modal Generalization with Multimodal Unified Representation NIPS 2023 Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks NIPS 2023 PTADisc: A Cross-Course Dataset Supporting Personalized Learning in Cold-Start Scenarios NIPS 2023 Prosody-TTS: Improving Prosody with Masked Autoencoder and Conditional Diffusion Model For Expressive Text-to-Speech ACL 2023 Semantic-conditioned Dual Adaptation for Cross-domain Query-based Visual Segmentation ACL 2023 Contrastive Token-Wise Meta-Learning for Unseen Performer Visual Temporal-Aligned Translation ACL 2023 FluentSpeech: Stutter-Oriented Automatic Speech Editing with Context-Aware Diffusion Models ACL 2023 DopplerBAS: Binaural Audio Synthesis Addressing Doppler Effect ACL 2023 Connecting Multi-modal Contrastive Representations NIPS 2023 AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment ACL 2023 FastDiff 2: Revisiting and Incorporating GANs and Diffusion Models in High-Fidelity Speech Synthesis ACL 2023 RMSSinger: Realistic-Music-Score based Singing Voice Synthesis ACL 2023 Scene-robust Natural Language Video Localization via Learning Domain-invariant Representations ACL 2023 TAVT: Towards Transferable Audio-Visual Text Generation ACL 2023 Weakly-Supervised Spoken Video Grounding via Semantic Interaction Learning ACL 2023 CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-Training ACL 2023 AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation ACL 2023 OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment ACL 2023 Multi-modal Action Chain Abductive Reasoning ACL 2023 Revisiting Over-Smoothness in Text to Speech ACL 2022 End-to-End Modeling via Information Tree for One-Shot Natural Language Spatial Video Grounding ACL 2022 Parallel and High-Fidelity Text-to-Lip Generation AAAI 2022 SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech IJCAI 2022 FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis IJCAI 2022 DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism AAAI 2022 GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech NIPS 2022 Flow-Based Unconstrained Lip to Speech Generation AAAI 2022 Pseudo Numerical Methods for Diffusion Models on Manifolds ICLR 2022 EditSinger: Zero-Shot Text-Based Singing Voice Editing System with Diverse Prosody Modeling IJCAI 2022 Prior Knowledge and Memory Enriched Transformer for Sign Language Translation ACL 2022 M4Singer: A Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus NIPS 2022 Towards Effective Multi-Modal Interchanges in Zero-Resource Sounding Object Localization NIPS 2022 Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech NIPS 2022 Cross-Modal Background Suppression for Audio-Visual Event Localization CVPR 2022 Wnet: Audio-Guided Video Object Segmentation via Wavelet-Based Cross-Modal Denoising Networks CVPR 2022 MLSLT: Towards Multilingual Sign Language Translation CVPR 2022 Fine-Grained Predicates Learning for Scene Graph Generation CVPR 2022 Learning the Beauty in Songs: Neural Singing Voice Beautifier ACL 2022 Unsupervised Representation Learning from Pre-trained Diffusion Probabilistic Models NIPS 2022 Cortical Surface Shape Analysis Based on Alexandrov Polyhedra ICCV 2021 FastSpeech 2: Fast and High-Quality End-to-End Text to Speech ICLR 2021 PortaSpeech: Portable and High-Quality Generative Text-to-Speech NIPS 2021 Learning to Rehearse in Long Sequence Memorization ICML 2021 Generalizable Multi-linear Attention Network NIPS 2021 FedSpeech: Federated Text-to-Speech with Continual Learning IJCAI 2021 EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model INTERSPEECH 2021 Cascaded Prediction Network via Segment Tree for Temporal Video Grounding CVPR 2021 Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval CVPR 2021 Modeling High-order Interactions across Multi-interests for Micro-video Reommendation (Student Abstract) AAAI 2021 WSRGlow: A Glow-Based Waveform Generative Model for Audio Super-Resolution INTERSPEECH 2021 Convolutional Hierarchical Attention Network for Query-Focused Video Summarization AAAI 2020 A Study of Non-autoregressive Model for Sequence Generation ACL 2020 SimulSpeech: End-to-End Simultaneous Speech to Text Translation ACL 2020 Interactive Dual Generative Adversarial Networks for Image Captioning AAAI 2020 Be Relevant, Non-Redundant, and Timely: Deep Reinforcement Learning for Real-Time Event Summarization AAAI 2020 Weakly-Supervised Video Moment Retrieval via Semantic Completion Network AAAI 2020 Task-Level Curriculum Learning for Non-Autoregressive Neural Machine Translation IJCAI 2020 Multi-Speaker Video Dialog with Frame-Level Temporal Localization AAAI 2020 Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding NIPS 2020 Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding IJCAI 2020 Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences CVPR 2020 FastSpeech: Fast, Robust and Controllable Text to Speech NIPS 2019 Almost Unsupervised Text to Speech and Automatic Speech Recognition ICML 2019 Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction CVPR 2019 Video Dialog via Progressive Inference and Cross-Transformer IJCNLP 2019 ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering AAAI 2019 Location-Based End-to-End Speech Recognition with Multiple Language Models AAAI 2019 Beyond Product Quantization: Deep Progressive Quantization for Image Retrieval IJCAI 2019 Weak Supervision Enhanced Generative Network for Question Generation IJCAI 2019 Video Dialog via Progressive Inference and Cross-Transformer EMNLP 2019 Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks IJCAI 2019 Localizing Unseen Activities in Video via Image Query IJCAI 2019 Exploring Human-Like Reading Strategy for Abstractive Text Summarization AAAI 2019 Answer Identification from Product Reviews for User Questions by Multi-Task Attentive Networks AAAI 2019 Multilingual Neural Machine Translation with Knowledge Distillation ICLR 2019 Discourse Marker Augmented Network with Reinforcement Learning for Natural Language Inference ACL 2018 Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks IJCAI 2018 Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network IJCAI 2018 MacNet: Transferring Knowledge from Machine Comprehension to Sequence-to-Sequence Models NIPS 2018 Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding IJCAI 2018 Investigating Capsule Networks with Dynamic Routing for Text Classification EMNLP 2018 A Multi-task Learning Approach for Image Captioning IJCAI 2018 Attentional Image Retweet Modeling via Multi-Faceted Ranking Network Learning IJCAI 2018 Link Prediction via Ranking Metric Dual-Level Attention Network Learning IJCAI 2017 Identifying and Tracking Sentiments and Topics from Social Media Texts during Natural Disasters EMNLP 2017 Video Question Answering via Hierarchical Spatio-Temporal Attention Networks IJCAI 2017 Microblog Sentiment Classiﬁcation via Recurrent Random Walk Network Learning IJCAI 2017 Expert Finding for Community-Based Question Answering via Ranking Metric Network Learning IJCAI 2016 Mobile Query Recommendation via Tensor Function Learning IJCAI 2015