Xize Cheng
33 papers · 2023–2026 · 9 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+8 more ↓ Show less ↑
🌍 Conference Polyglot (9) 🐝 Cross-Pollinator (10) 🌉 Interdisciplinary Bridge 🧭 Keyword Pioneer 🌈 Renaissance Researcher (7)
🐝
Cross-Pollinator
(10)
🤝
Dynamic Duo
(29)
🏆
Grand Slam
🔬
Deep Specialist
(17)
⚡
Prolific Year
(10)
🗃️
Keyword Collector
(146)
❓
The Questioner
💎
Century Club
(32)
Conferences
ACL (16)
ICLR (4)
EMNLP (3)
ICCV (3)
ICML (2)
NIPS (2)
AAAI (1)
COLING (1)
CVPR (1)
Top co-authors
Keywords
multimodal learning
(6)
contrastive learning
(5)
speech synthesis
(4)
visual speech recognition
(3)
zero-shot learning
(3)
point cloud
(2)
video generation
(2)
3d visual grounding
(2)
audio-visual speech
(2)
vision-language model
(2)
speech-to-speech translation
(2)
domain adaptation
(2)
transfer learning
(2)
generative model
(2)
cross-modal learning
(2)
speech translation
(2)
video captioning
(1)
preference learning
(1)
object detection
(1)
domain generalization
(1)
Papers
SDiaReward: Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness
ACL 2026
A Wander Through the Multimodal Landscape: Efficient Transfer Learning via Low-rank Sequence Multimodal Adapter
AAAI 2025
Rhythm Controllable and Efficient Zero-Shot Voice Conversion via Shortcut Flow Matching
ACL 2025
ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control
ACL 2025
CART: A Generative Cross-Modal Retrieval Framework With Coarse-To-Fine Semantic Modeling
ACL 2025
T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation via Fine-grained AI Feedback
ACL 2025
VoxpopuliTTS: a large-scale multilingual TTS corpus for zero-shot speech generation
COLING 2025
SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language
CVPR 2025
PACHAT: Persona-Aware Speech Assistant for Multi-party Dialogue
EMNLP 2025
VoxDialogue: Can Spoken Dialogue Systems Understand Information Beyond Words?
ICLR 2025
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
ICLR 2025
OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup
ICLR 2025
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
ICLR 2025
TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation
ACL 2024
Extending Multi-modal Contrastive Representations
NIPS 2024
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers
NIPS 2024
InstructSpeech: Following Speech Editing Instructions via Large Language Models
ICML 2024
AudioVSR: Enhancing Video Speech Recognition with Audio Data
EMNLP 2024
FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion
ICML 2024
Rethinking the Multimodal Correlation of Multimodal Sequential Learning via Generalizable Attentional Results Alignment
ACL 2024
Text-to-Song: Towards Controllable Music Generation Incorporating Vocal and Accompaniment
ACL 2024
Uni-Dubbing: Zero-Shot Speech Synthesis from Visual Articulation
ACL 2024
Wav2SQL: Direct Generalizable Speech-To-SQL Parsing
ACL 2024
OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment
ACL 2023
Exploring Group Video Captioning with Efficient Relational Approximation
ICCV 2023
Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding
ICCV 2023
AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation
ACL 2023
MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition
ICCV 2023
Weakly-Supervised Spoken Video Grounding via Semantic Interaction Learning
ACL 2023
TAVT: Towards Transferable Audio-Visual Text Generation
ACL 2023
3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding
EMNLP 2023
Semantic-conditioned Dual Adaptation for Cross-domain Query-based Visual Segmentation
ACL 2023
Contrastive Token-Wise Meta-Learning for Unseen Performer Visual Temporal-Aligned Translation
ACL 2023