Papers

2,781 papers found

LLMs are Good Action Recognizers

Haoxuan Qu, Yujun Cai, Jun Liu

2024 CVPR

Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs

Hao Fei, Shengqiong Wu, Wei Ji et al.

2024 CVPR

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Kanchana Ranasinghe, Satya Narayan Shukla, Omid Poursaeed et al.

2024 CVPR

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Shengbang Tong, Zhuang Liu, Yuexiang Zhai et al.

2024 CVPR

Prompt Highlighter: Interactive Control for Multi-Modal LLMs

Yuechen Zhang, Shengju Qian, Bohao Peng et al.

2024 CVPR

Synthesize Step-by-Step: Tools Templates and LLMs as Data Generators for Reasoning-Based Chart VQA

Zhuowan Li, Bhavan Jasani, Peng Tang et al.

2024 CVPR

Low-Rank Approximation for Sparse Attention in Multi-Modal LLMs

Lin Song, Yukang Chen, Shuai Yang et al.

2024 CVPR

Link-Context Learning for Multimodal LLMs

Yan Tai, Weichen Fan, Zhao Zhang et al.

2024 CVPR

V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs

Penghao Wu, Saining Xie

2024 CVPR

Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception

Junwen He, Yifan Wang, Lijun Wang et al.

2024 CVPR

Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs

Shiyu Xuan, Qingpei Guo, Ming Yang et al.

2024 CVPR

ModaVerse: Efficiently Transforming Modalities with LLMs

Xinyu Wang, Bohan Zhuang, Qi Wu

2024 CVPR

Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices

Junyan Lin, Haoran Chen, Yue Fan et al.

2025 CVPR

DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation

Jianzong Wu, Chao Tang, Jingbo Wang et al.

2025 CVPR

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

Junbo Niu, Yifei Li, Ziyang Miao et al.

2025 CVPR

Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs

Lucas Ventura, Antoine Yang, Cordelia Schmid et al.

2025 CVPR

XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?

Fengxiang Wang, Hongzhen Wang, Zonghao Guo et al.

2025 CVPR

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

Rui Qian, Shuangrui Ding, Xiaoyi Dong et al.

2025 CVPR

Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering

Federico Cocchi, Nicholas Moratelli, Marcella Cornia et al.

2025 CVPR

Empowering LLMs to Understand and Generate Complex Vector Graphics

Ximing Xing, Juncheng Hu, Guotao Liang et al.

2025 CVPR

STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training

Haiyi Qiu, Minghe Gao, Long Qian et al.

2025 CVPR

Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos

Chiara Plizzari, Alessio Tonioni, Yongqin Xian et al.

2025 CVPR

MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations

Kyungho Bae, Jinhyung Kim, Sihaeng Lee et al.

2025 CVPR

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo et al.

2025 CVPR

SKE-Layout: Spatial Knowledge Enhanced Layout Generation with LLMs

Junsheng Wang, Nieqing Cao, Yan Ding et al.

2025 CVPR