Papers
2,781 papers found
LLMs are Good Action Recognizers
Haoxuan Qu, Yujun Cai, Jun Liu
Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs
Hao Fei, Shengqiong Wu, Wei Ji et al.
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
Kanchana Ranasinghe, Satya Narayan Shukla, Omid Poursaeed et al.
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Shengbang Tong, Zhuang Liu, Yuexiang Zhai et al.
Prompt Highlighter: Interactive Control for Multi-Modal LLMs
Yuechen Zhang, Shengju Qian, Bohao Peng et al.
Synthesize Step-by-Step: Tools Templates and LLMs as Data Generators for Reasoning-Based Chart VQA
Zhuowan Li, Bhavan Jasani, Peng Tang et al.
Low-Rank Approximation for Sparse Attention in Multi-Modal LLMs
Lin Song, Yukang Chen, Shuai Yang et al.
Link-Context Learning for Multimodal LLMs
Yan Tai, Weichen Fan, Zhao Zhang et al.
V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs
Penghao Wu, Saining Xie
Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception
Junwen He, Yifan Wang, Lijun Wang et al.
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs
Shiyu Xuan, Qingpei Guo, Ming Yang et al.
ModaVerse: Efficiently Transforming Modalities with LLMs
Xinyu Wang, Bohan Zhuang, Qi Wu
Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices
Junyan Lin, Haoran Chen, Yue Fan et al.
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
Jianzong Wu, Chao Tang, Jingbo Wang et al.
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
Junbo Niu, Yifei Li, Ziyang Miao et al.
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
Lucas Ventura, Antoine Yang, Cordelia Schmid et al.
XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?
Fengxiang Wang, Hongzhen Wang, Zonghao Guo et al.
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
Rui Qian, Shuangrui Ding, Xiaoyi Dong et al.
Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering
Federico Cocchi, Nicholas Moratelli, Marcella Cornia et al.
Empowering LLMs to Understand and Generate Complex Vector Graphics
Ximing Xing, Juncheng Hu, Guotao Liang et al.
STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training
Haiyi Qiu, Minghe Gao, Long Qian et al.
Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos
Chiara Plizzari, Alessio Tonioni, Yongqin Xian et al.
MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations
Kyungho Bae, Jinhyung Kim, Sihaeng Lee et al.
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo et al.
SKE-Layout: Spatial Knowledge Enhanced Layout Generation with LLMs
Junsheng Wang, Nieqing Cao, Yan Ding et al.