Xize Cheng

33 papers · 2023–2026 · 9 conferences · across top CS/AI conferences

Achievements

+8 more ↓

🌍 Conference Polyglot (9) 🐝 Cross-Pollinator (10) 🌉 Interdisciplinary Bridge 🧭 Keyword Pioneer 🌈 Renaissance Researcher (7)

🐝 Cross-Pollinator (10) 🤝 Dynamic Duo (29) 🏆 Grand Slam 🔬 Deep Specialist (17) ⚡ Prolific Year (10) 🗃️ Keyword Collector (146) ❓ The Questioner 💎 Century Club (32)

Conferences

ACL (16) ICLR (4) EMNLP (3) ICCV (3) ICML (2) NIPS (2) AAAI (1) COLING (1) CVPR (1)

Top co-authors

Zhou Zhao (30) Tao Jin (20) Zehan Wang (15) Rongjie Huang (14) Linjun Li (13) Shengpeng Ji (11) Wang Lin (9) Ye Wang (7) Jialong Zuo (7) Xiaoda Yang (7)

Keywords

multimodal learning (6) contrastive learning (5) speech synthesis (4) visual speech recognition (3) zero-shot learning (3) point cloud (2) video generation (2) 3d visual grounding (2) audio-visual speech (2) vision-language model (2) speech-to-speech translation (2) domain adaptation (2) transfer learning (2) generative model (2) cross-modal learning (2) speech translation (2) video captioning (1) preference learning (1) object detection (1) domain generalization (1)

Papers

SDiaReward: Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness ACL 2026 A Wander Through the Multimodal Landscape: Efficient Transfer Learning via Low-rank Sequence Multimodal Adapter AAAI 2025 Rhythm Controllable and Efficient Zero-Shot Voice Conversion via Shortcut Flow Matching ACL 2025 ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control ACL 2025 CART: A Generative Cross-Modal Retrieval Framework With Coarse-To-Fine Semantic Modeling ACL 2025 T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation via Fine-grained AI Feedback ACL 2025 VoxpopuliTTS: a large-scale multilingual TTS corpus for zero-shot speech generation COLING 2025 SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language CVPR 2025 PACHAT: Persona-Aware Speech Assistant for Multi-party Dialogue EMNLP 2025 VoxDialogue: Can Spoken Dialogue Systems Understand Information Beyond Words? ICLR 2025 OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces ICLR 2025 OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup ICLR 2025 WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling ICLR 2025 TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation ACL 2024 Extending Multi-modal Contrastive Representations NIPS 2024 Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers NIPS 2024 InstructSpeech: Following Speech Editing Instructions via Large Language Models ICML 2024 AudioVSR: Enhancing Video Speech Recognition with Audio Data EMNLP 2024 FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion ICML 2024 Rethinking the Multimodal Correlation of Multimodal Sequential Learning via Generalizable Attentional Results Alignment ACL 2024 Text-to-Song: Towards Controllable Music Generation Incorporating Vocal and Accompaniment ACL 2024 Uni-Dubbing: Zero-Shot Speech Synthesis from Visual Articulation ACL 2024 Wav2SQL: Direct Generalizable Speech-To-SQL Parsing ACL 2024 OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment ACL 2023 Exploring Group Video Captioning with Efficient Relational Approximation ICCV 2023 Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding ICCV 2023 AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation ACL 2023 MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition ICCV 2023 Weakly-Supervised Spoken Video Grounding via Semantic Interaction Learning ACL 2023 TAVT: Towards Transferable Audio-Visual Text Generation ACL 2023 3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding EMNLP 2023 Semantic-conditioned Dual Adaptation for Cross-domain Query-based Visual Segmentation ACL 2023 Contrastive Token-Wise Meta-Learning for Unseen Performer Visual Temporal-Aligned Translation ACL 2023