Yuhang Cao

23 papers · 2017–2026 · 9 conferences · across top CS/AI conferences

Achievements

+10 more ↓

🏃 Academic Marathon (9) 🌍 Conference Polyglot (9) 🌉 Interdisciplinary Bridge 🧭 Keyword Pioneer 🐝 Cross-Pollinator (5)

🐝 Cross-Pollinator (5) 🌈 Renaissance Researcher (6) 🗺️ Taxonomy Completionist (51) 🤝 Dynamic Duo (19) 👥 Mega-Team (24) 🧬 Topic Evolution ❓ The Questioner (2) 💎 Century Club (23) ⚡ Prolific Year (14) 🗃️ Keyword Collector (126)

Conferences

CVPR (6) ICCV (6) ACL (2) ICML (2) INTERSPEECH (2) NIPS (2) ECCV (1) ICLR (1) WACV (1)

Top co-authors

Jiaqi Wang (19) Yuhang Zang (16) Dahua Lin (16) Xiaoyi Dong (15) Pan Zhang (15) Haodong Duan (8) Kai Chen (6) Conghui He (5) Tong Wu (4) Wenwei Zhang (4)

Keywords

object detection (4) vision-language model (3) multimodal learning (3) diffusion model (2) video understanding (2) temporal consistency (2) vision language model (2) reinforcement learning (2) video language model (2) instruction following (2) multimodal large language model (2) direct preference optimization (1) vision transformer (1) attention mechanism (1) preference learning (1) sampling strategy (1) benchmark evaluation (1) 3d reconstruction (1) multi-modal learning (1) speech separation (1)

Papers

OMeGa: Joint Optimization of Explicit Meshes and Gaussian Splats for Robust Scene-Level Surface Reconstruction WACV 2026 InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model ACL 2025 Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings ACL 2025 SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree ICCV 2025 Light-A-Video: Training-free Video Relighting via Progressive Light Fusion ICCV 2025 VideoRoPE: What Makes for Good Video Rotary Position Embedding? ICML 2025 OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? CVPR 2025 Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction CVPR 2025 ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way CVPR 2025 Conical Visual Concentration for Efficient Large Vision-Language Models CVPR 2025 SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation ICML 2025 MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models ICLR 2025 MM-IFEngine: Towards Multimodal Instruction Following ICCV 2025 Visual-RFT: Visual Reinforcement Fine-Tuning ICCV 2025 Deciphering Cross-Modal Alignment in Large Vision-Language Models via Modality Integration Rate ICCV 2025 InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD NIPS 2024 V3Det: Vast Vocabulary Visual Detection Dataset ICCV 2023 Few-Shot Object Detection via Association and DIscrimination NIPS 2021 Seesaw Loss for Long-Tailed Instance Segmentation CVPR 2021 Side-Aware Boundary Localization for More Precise Object Detection ECCV 2020 Prime Sample Attention in Object Detection CVPR 2020 Investigation of Cost Function for Supervised Monaural Speech Separation INTERSPEECH 2019 Speaker Direction-of-Arrival Estimation Based on Frequency-Independent Beampattern INTERSPEECH 2017