Lijuan Wang
82 papers · 2019–2026 · 11 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+15 more ↓ Show less ↑
π Academic Marathon (7) π Conference Polyglot (11) π§ Keyword Pioneer π Interdisciplinary Bridge π Cross-Pollinator (14)
π
Cross-Pollinator
(14)
π
Renaissance Researcher
(7)
πΊοΈ
Taxonomy Completionist
(97)
π
Conference Loyalist
(29)
π€
Dynamic Duo
(43)
π
Triple Crown
π
Keyword Champion
(2)
π
Grand Slam
π¬
Deep Specialist
(24)
π
Century Club
(80)
β‘
Prolific Year
(19)
ποΈ
Keyword Collector
(286)
π₯
Unstoppable
(8)
π
Trend Setter
β
The Questioner
(2)
Conferences
CVPR (29)
ICLR (10)
NIPS (9)
AAAI (7)
ICCV (7)
ECCV (6)
ICML (4)
EMNLP (3)
WACV (3)
ACL (2)
IJCAI (2)
Top co-authors
Keywords
multimodal learning
(17)
vision-language model
(10)
zero-shot learning
(10)
object detection
(9)
diffusion model
(8)
transfer learning
(8)
image generation
(7)
visual question answering
(6)
image captioning
(6)
large language model
(5)
video understanding
(5)
video generation
(5)
text-to-image generation
(4)
image segmentation
(4)
in-context learning
(4)
few-shot learning
(4)
representation learning
(4)
multi-modal learning
(4)
video captioning
(3)
image classification
(3)
Papers
Shanks: Simultaneous Hearing and Thinking for Spoken Language Models
ACL 2026
Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising
WACV 2026
Towards Zero-Shot Diabetic Retinopathy Grading: Learning Generalized Knowledge via Prompt-Driven Matching and Emulating
AAAI 2026
Conditional Text-to-Image Generation with Reference Guidance
WACV 2026
EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing
ICLR 2025
Audio-Aware Large Language Models as Judges for Speaking Styles
EMNLP 2025
ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning
ICCV 2025
SITE: towards Spatial Intelligence Thorough Evaluation
ICCV 2025
Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension
ICCV 2025
Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark
ICML 2025
GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?
EMNLP 2025
ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation
CVPR 2025
LiVOS: Light Video Object Segmentation with Gated Linear Matching
CVPR 2025
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
CVPR 2025
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models
ICLR 2025
GenXD: Generating Any 3D and 4D Scenes
ICLR 2025
SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation
ICLR 2025
Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization
ICLR 2025
CertainlyUncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness
ICLR 2025
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
ICLR 2025
Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning
CVPR 2024
DisCo: Disentangled Control for Realistic Human Dance Generation
CVPR 2024
MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos
CVPR 2024
GRiT: A Generative Region-to-text Transformer for Object Understanding
ECCV 2024
IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation
ECCV 2024
Idea2Img: Iterative Self-Refinement with GPT-4V for Automatic Image Design and Generation
ECCV 2024
Bring Metric Functions into Diffusion Models
IJCAI 2024
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
ICML 2024
ORES: Open-Vocabulary Responsible Visual Synthesis
AAAI 2024
StrokeNUWAβTokenizing Strokes for Vector Graphic Synthesis
ICML 2024
Completing Visual Objects via Bridging Generation and Segmentation
ICML 2024
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
ICLR 2024
Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning
NIPS 2024
Interfacing Foundation Models' Embeddings
NIPS 2024
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
NIPS 2024
Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation
NIPS 2024
MPT: Mesh Pre-Training With Transformers for Human Pose and Mesh Reconstruction
WACV 2024
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning
CVPR 2024
Segment and Caption Anything
CVPR 2024
An Empirical Study of Multimodal Model Merging
EMNLP 2023
Segment Everything Everywhere All at Once
NIPS 2023
NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation
ACL 2023
Adaptive Human Matting for Dynamic Videos
CVPR 2023
An Empirical Study of End-to-End Video-Language Transformers With Masked Visual Modeling
CVPR 2023
ReCo: Region-Controlled Text-to-Image Generation
CVPR 2023
LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling
CVPR 2023
Generalized Decoding for Pixel, Image, and Language
CVPR 2023
Neural Voting Field for Camera-Space 3D Hand Pose Estimation
CVPR 2023
Weakly Supervised Video Emotion Detection and Prediction via Cross-Modal Temporal Erasing Network
CVPR 2023
Non-Contrastive Learning Meets Language-Image Pre-Training
CVPR 2023
Equivariant Similarity for Vision-Language Foundation Models
ICCV 2023
Prompting GPT-3 To Be Reliable
ICLR 2023
Learning 3D Photography Videos via Self-supervised Diffusion on Single Images
IJCAI 2023
Injecting Semantic Concepts Into End-to-End Image Captioning
CVPR 2022
An Empirical Study of Training End-to-End Vision-and-Language Transformers
CVPR 2022
GLIPv2: Unifying Localization and Vision-Language Understanding
NIPS 2022
NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis
NIPS 2022
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
ECCV 2022
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
AAAI 2022
OVIS: Open-Vocabulary Visual Instance Search via Visual-Semantic Aligned Representation Learning
AAAI 2022
K-LITE: Learning Transferable Visual Models with External Knowledge
NIPS 2022
Playing Lottery Tickets with Vision and Language
AAAI 2022
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
NIPS 2022
SwinBERT: End-to-End Transformers With Sparse Attention for Video Captioning
CVPR 2022
Cross-Modal Representation Learning for Zero-Shot Action Recognition
CVPR 2022
Grounded Language-Image Pre-Training
CVPR 2022
Scaling Up Vision-Language Pre-Training for Image Captioning
CVPR 2022
"A Simple Approach and Benchmark for 21,000-Category Object Detection"
ECCV 2022
TAP: Text-Aware Pre-Training for Text-VQA and Text-Caption
CVPR 2021
Compressing Visual-Linguistic Model via Knowledge Distillation
ICCV 2021
End-to-End Semi-Supervised Object Detection With Soft Teacher
ICCV 2021
Mesh Graphormer
ICCV 2021
SEED: Self-supervised Distillation For Visual Representation
ICLR 2021
DAP: Detection-Aware Pre-Training With Weak Supervision
CVPR 2021
M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-Training
CVPR 2021
End-to-End Human Pose and Mesh Reconstruction with Transformers
CVPR 2021
VinVL: Revisiting Visual Representations in Vision-Language Models
CVPR 2021
VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning
AAAI 2021
Pyramid Constrained Self-Attention Network for Fast Video Salient Object Detection
AAAI 2020
Rethinking Classification and Localization for Object Detection
CVPR 2020
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
ECCV 2020
Large Scale Incremental Learning
CVPR 2019