Shuhuai Ren

20 papers · 2019–2026 · 9 conferences · across top CS/AI conferences

Achievements

+11 more ↓

🌈 Renaissance Researcher (7) 🌉 Interdisciplinary Bridge 🏃 Academic Marathon (6) 🌍 Conference Polyglot (8) 🗺️ Taxonomy Completionist (44)

🗺️ Taxonomy Completionist (44) 🧭 Keyword Pioneer 🐣 Hot Topic Early Bird 🏆 Keyword Champion (2) 🧬 Topic Evolution 👥 Mega-Team (21) 🤝 Dynamic Duo (14) 🗃️ Keyword Collector (90) ❓ The Questioner (2) ⚡ Prolific Year (5) 💎 Century Club (19)

Conferences

ACL (5) EMNLP (5) CVPR (3) NIPS (2) AAAI (1) ECCV (1) ICCV (1) IJCNLP (1) NAACL (1)

Top co-authors

Xu Sun (15) Lei Li (9) Shicheng Li (6) Lu Hou (5) Yuanxin Liu (5) Jie Zhou (3) Guangxiang Zhao (3) Linli Yao (3) Rundong Gao (3) Yuchi Wang (3)

Keywords

multimodal large language model (4) video understanding (3) multimodal learning (3) autoregressive model (3) image generation (3) relation alignment (2) pre-trained language model (2) video large language model (2) benchmark evaluation (2) vision-language model (2) representation learning (2) image-text retrieval (2) semantic alignment (2) knowledge distillation (2) image captioning (2) text classification (2) zero-shot learning (2) cross-modal retrieval (2) instruction tuning (2) data augmentation (1)

Papers

TEMPLE: Incentivizing Temporal Understanding of Video Large Language Models via Progressive Pre-SFT Alignment AAAI 2026 Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation ICCV 2025 RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction EMNLP 2025 Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis CVPR 2025 Parallelized Autoregressive Visual Generation CVPR 2025 VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models ECCV 2024 PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain ACL 2024 TempCompass: Do Video LLMs Really Understand Videos? ACL 2024 TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding CVPR 2024 LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? NAACL 2024 Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition NIPS 2023 TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding EMNLP 2023 FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation NIPS 2023 Delving into the Openness of CLIP ACL 2023 Learning Relation Alignment for Calibrated Cross-modal Retrieval IJCNLP 2021 Learning Relation Alignment for Calibrated Cross-modal Retrieval ACL 2021 Dynamic Knowledge Distillation for Pre-trained Language Models EMNLP 2021 Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification EMNLP 2021 CascadeBERT: Accelerating Inference of Pre-trained Language Models via Calibrated Complete Models Cascade EMNLP 2021 Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency ACL 2019