Licheng Yu

37 papers · 2015–2026 · 8 conferences · across top CS/AI conferences

Achievements

+11 more ↓

🌈 Renaissance Researcher (9) 🌉 Interdisciplinary Bridge 🌍 Conference Polyglot (8) 🏃 Academic Marathon (10) 🗺️ Taxonomy Completionist (67)

🗺️ Taxonomy Completionist (67) 🧭 Keyword Pioneer 🐣 Hot Topic Early Bird 🔬 Deep Specialist (13) 🧬 Topic Evolution 💎 Century Club (36) 📈 Trend Setter ❓ The Questioner 🔥 Unstoppable (9) 🗃️ Keyword Collector (161) ⚡ Prolific Year (7)

Conferences

CVPR (19) ECCV (6) EMNLP (5) ACL (2) ICCV (2) EACL (1) ICLR (1) NAACL (1)

Top co-authors

Mohit Bansal (9) Ning Zhang (7) Tamara L. Berg (6) Yu Cheng (5) Zhe Gan (5) Jingjing Liu (5) Tamara Berg (4) Bichen Wu (4) Jie Lei (4) Peter Vajda (3)

Keywords

multimodal learning (7) diffusion model (6) image captioning (5) video generation (4) video understanding (4) video question answering (4) reinforcement learning (3) video editing (3) temporal coherence (2) vision-language pre-training (2) temporal consistency (2) cross-modal retrieval (2) vision language model (2) visual question answering (2) image synthesis (2) attention mechanism (2) image generation (2) temporal reasoning (2) visual grounding (2) vision-language model (2)

Papers

AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following ACL 2026 ROICtrl: Boosting Instance Control for Visual Generation CVPR 2025 Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction CVPR 2025 Apollo: An Exploration of Video Understanding in Large Multimodal Models CVPR 2025 Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs CVPR 2025 Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis CVPR 2024 Layout-Agnostic Scene Text Image Synthesis with Diffusion Models CVPR 2024 VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence CVPR 2024 FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis CVPR 2024 Text-to-Sticker: Style Tailoring Latent Diffusion Models for Human Expression ECCV 2024 Ameli: Enhancing Multimodal Entity Linking with Fine-Grained Attributes EACL 2024 AVID: Any-Length Video Inpainting with Diffusion Model CVPR 2024 Learning Procedure-Aware Video Representation From Instructional Videos and Their Narrations CVPR 2023 Tell Me What Happened: Unifying Text-Guided Video Completion via Multimodal Masked Video Generation CVPR 2023 FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks CVPR 2023 RoPAWS: Robust Semi-supervised Representation Learning from Uncurated Data ICLR 2023 CiT: Curation in Training for Effective Vision-Language Data ICCV 2023 Unsupervised Vision-and-Language Pre-Training via Retrieval-Based Multi-Granular Alignment CVPR 2022 FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning EMNLP 2022 "GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval" ECCV 2022 FashionViL: Fashion-Focused Vision-and-Language Representation Learning ECCV 2022 Connecting What To Say With Where To Look by Modeling Human Attention Traces CVPR 2021 BachGAN: High-Resolution Image Synthesis From Salient Object Layout CVPR 2020 TVQA+: Spatio-Temporal Grounding for Video Question Answering ACL 2020 Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models ECCV 2020 TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval ECCV 2020 UNITER: UNiversal Image-TExt Representation Learning ECCV 2020 HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training EMNLP 2020 What is More Likely to Happen Next? Video-and-Language Future Event Prediction EMNLP 2020 Violin: A Large-Scale Dataset for Video-and-Language Inference CVPR 2020 Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout NAACL 2019 Multi-Target Embodied Question Answering CVPR 2019 MAttNet: Modular Attention Network for Referring Expression Comprehension CVPR 2018 TVQA: Localized, Compositional Video Question Answering EMNLP 2018 Hierarchically-Attentive RNN for Album Summarization and Storytelling EMNLP 2017 A Joint Speaker-Listener-Reinforcer Model for Referring Expressions CVPR 2017 Visual Madlibs: Fill in the Blank Description Generation and Question Answering ICCV 2015