Licheng Yu
37 papers · 2015–2026 · 8 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+11 more ↓ Show less ↑
π Renaissance Researcher (9) π Interdisciplinary Bridge π Conference Polyglot (8) π Academic Marathon (10) πΊοΈ Taxonomy Completionist (67)
πΊοΈ
Taxonomy Completionist
(67)
π§
Keyword Pioneer
π£
Hot Topic Early Bird
π¬
Deep Specialist
(13)
π§¬
Topic Evolution
π
Century Club
(36)
π
Trend Setter
β
The Questioner
π₯
Unstoppable
(9)
ποΈ
Keyword Collector
(161)
β‘
Prolific Year
(7)
Conferences
CVPR (19)
ECCV (6)
EMNLP (5)
ACL (2)
ICCV (2)
EACL (1)
ICLR (1)
NAACL (1)
Top co-authors
Keywords
multimodal learning
(7)
diffusion model
(6)
image captioning
(5)
video generation
(4)
video understanding
(4)
video question answering
(4)
reinforcement learning
(3)
video editing
(3)
temporal coherence
(2)
vision-language pre-training
(2)
temporal consistency
(2)
cross-modal retrieval
(2)
vision language model
(2)
visual question answering
(2)
image synthesis
(2)
attention mechanism
(2)
image generation
(2)
temporal reasoning
(2)
visual grounding
(2)
vision-language model
(2)
Papers
AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following
ACL 2026
ROICtrl: Boosting Instance Control for Visual Generation
CVPR 2025
Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction
CVPR 2025
Apollo: An Exploration of Video Understanding in Large Multimodal Models
CVPR 2025
Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs
CVPR 2025
Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis
CVPR 2024
Layout-Agnostic Scene Text Image Synthesis with Diffusion Models
CVPR 2024
VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence
CVPR 2024
FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis
CVPR 2024
Text-to-Sticker: Style Tailoring Latent Diffusion Models for Human Expression
ECCV 2024
Ameli: Enhancing Multimodal Entity Linking with Fine-Grained Attributes
EACL 2024
AVID: Any-Length Video Inpainting with Diffusion Model
CVPR 2024
Learning Procedure-Aware Video Representation From Instructional Videos and Their Narrations
CVPR 2023
Tell Me What Happened: Unifying Text-Guided Video Completion via Multimodal Masked Video Generation
CVPR 2023
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks
CVPR 2023
RoPAWS: Robust Semi-supervised Representation Learning from Uncurated Data
ICLR 2023
CiT: Curation in Training for Effective Vision-Language Data
ICCV 2023
Unsupervised Vision-and-Language Pre-Training via Retrieval-Based Multi-Granular Alignment
CVPR 2022
FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning
EMNLP 2022
"GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval"
ECCV 2022
FashionViL: Fashion-Focused Vision-and-Language Representation Learning
ECCV 2022
Connecting What To Say With Where To Look by Modeling Human Attention Traces
CVPR 2021
BachGAN: High-Resolution Image Synthesis From Salient Object Layout
CVPR 2020
TVQA+: Spatio-Temporal Grounding for Video Question Answering
ACL 2020
Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models
ECCV 2020
TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval
ECCV 2020
UNITER: UNiversal Image-TExt Representation Learning
ECCV 2020
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
EMNLP 2020
What is More Likely to Happen Next? Video-and-Language Future Event Prediction
EMNLP 2020
Violin: A Large-Scale Dataset for Video-and-Language Inference
CVPR 2020
Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout
NAACL 2019
Multi-Target Embodied Question Answering
CVPR 2019
MAttNet: Modular Attention Network for Referring Expression Comprehension
CVPR 2018
TVQA: Localized, Compositional Video Question Answering
EMNLP 2018
Hierarchically-Attentive RNN for Album Summarization and Storytelling
EMNLP 2017
A Joint Speaker-Listener-Reinforcer Model for Referring Expressions
CVPR 2017
Visual Madlibs: Fill in the Blank Description Generation and Question Answering
ICCV 2015