Linjie Li
56 papers · 2019–2026 · 12 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+14 more ↓ Show less ↑
π Conference Polyglot (12) π Academic Marathon (7) π§ Keyword Pioneer π Interdisciplinary Bridge π Cross-Pollinator (14)
π
Cross-Pollinator
(14)
π
Renaissance Researcher
(5)
πΊοΈ
Taxonomy Completionist
(73)
π
Grand Slam
π€
Dynamic Duo
(40)
π
Keyword Champion
(3)
π
Triple Crown
π¬
Deep Specialist
(18)
π
Century Club
(54)
β‘
Prolific Year
(13)
ποΈ
Keyword Collector
(202)
π₯
Unstoppable
(8)
π
Trend Setter
β
The Questioner
(2)
Conferences
CVPR (15)
ICLR (8)
NIPS (7)
ICCV (5)
EMNLP (4)
ICML (4)
ACL (3)
ECCV (3)
AAAI (2)
IJCAI (2)
WACV (2)
NAACL (1)
Top co-authors
Keywords
multimodal learning
(14)
vision-language model
(6)
diffusion model
(5)
video understanding
(5)
large language model
(4)
video question answering
(4)
zero-shot learning
(4)
multi-modal learning
(4)
video generation
(4)
transfer learning
(4)
text-to-video retrieval
(3)
masked language modeling
(3)
image generation
(3)
video captioning
(3)
image segmentation
(3)
visual question answering
(3)
text-to-image generation
(3)
in-context learning
(3)
image-text retrieval
(3)
representation learning
(2)
Papers
TextGround4M: A Prompt-Aligned Dataset for Layout-Aware Text Rendering
AAAI 2026
Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising
WACV 2026
Shanks: Simultaneous Hearing and Thinking for Spoken Language Models
ACL 2026
Synthetic Visual Genome
CVPR 2025
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
CVPR 2025
Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback
ICML 2025
Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark
ICML 2025
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models
ICLR 2025
SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation
ICLR 2025
GenXD: Generating Any 3D and 4D Scenes
ICLR 2025
CertainlyUncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness
ICLR 2025
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
ICLR 2025
EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing
ICLR 2025
ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning
ICCV 2025
Audio-Aware Large Language Models as Judges for Speaking Styles
EMNLP 2025
Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension
ICCV 2025
GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?
EMNLP 2025
LiVOS: Light Video Object Segmentation with Gated Linear Matching
CVPR 2025
MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos
CVPR 2024
Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning
NIPS 2024
Interfacing Foundation Models' Embeddings
NIPS 2024
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
NIPS 2024
Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation
NIPS 2024
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning
CVPR 2024
DisCo: Disentangled Control for Realistic Human Dance Generation
CVPR 2024
IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation
ECCV 2024
Idea2Img: Iterative Self-Refinement with GPT-4V for Automatic Image Design and Generation
ECCV 2024
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
ICLR 2024
The Generative AI Paradox: βWhat It Can Create, It May Not Understandβ
ICLR 2024
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
ICML 2024
Bring Metric Functions into Diffusion Models
IJCAI 2024
NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation
ACL 2023
Learning 3D Photography Videos via Self-supervised Diffusion on Single Images
IJCAI 2023
Generalized Decoding for Pixel, Image, and Language
CVPR 2023
Equivariant Similarity for Vision-Language Foundation Models
ICCV 2023
LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling
CVPR 2023
ReCo: Region-Controlled Text-to-Image Generation
CVPR 2023
An Empirical Study of Multimodal Model Merging
EMNLP 2023
Segment Everything Everywhere All at Once
NIPS 2023
An Empirical Study of End-to-End Video-Language Transformers With Masked Visual Modeling
CVPR 2023
Adaptive Human Matting for Dynamic Videos
CVPR 2023
Playing Lottery Tickets with Vision and Language
AAAI 2022
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
NIPS 2022
SwinBERT: End-to-End Transformers With Sparse Attention for Video Captioning
CVPR 2022
Cross-Modal Representation Learning for Zero-Shot Action Recognition
CVPR 2022
Less Is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
CVPR 2021
Adversarial VQA: A New Benchmark for Evaluating the Robustness of VQA Models
ICCV 2021
Meta Module Network for Compositional Visual Reasoning
WACV 2021
UC2: Universal Cross-Lingual Cross-Modal Vision-and-Language Pre-Training
CVPR 2021
LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval
NAACL 2021
UNITER: UNiversal Image-TExt Representation Learning
ECCV 2020
Graph Optimal Transport for Cross-Domain Alignment
ICML 2020
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
NIPS 2020
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
EMNLP 2020
Relation-Aware Graph Attention Network for Visual Question Answering
ICCV 2019
Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog
ACL 2019