Papers
PerceptionGPT: Effectively Fusing Visual Perception into LLM
Renjie Pi, Lewei Yao, Jiahui Gao et al.
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs
Shiyu Xuan, Qingpei Guo, Ming Yang et al.
ModaVerse: Efficiently Transforming Modalities with LLMs
Xinyu Wang, Bohan Zhuang, Qi Wu
Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices
Junyan Lin, Haoran Chen, Yue Fan et al.
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
Jianzong Wu, Chao Tang, Jingbo Wang et al.
Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation
Yudi Shi, Shangzhe Di, Qirui Chen et al.
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
Joya Chen, Ziyun Zeng, Yiqi Lin et al.
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
Junbo Niu, Yifei Li, Ziyang Miao et al.
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
Lucas Ventura, Antoine Yang, Cordelia Schmid et al.
XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?
Fengxiang Wang, Hongzhen Wang, Zonghao Guo et al.
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
Rui Qian, Shuangrui Ding, Xiaoyi Dong et al.
Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding
Andong Deng, Zhongpai Gao, Anwesa Choudhuri et al.
Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering
Federico Cocchi, Nicholas Moratelli, Marcella Cornia et al.
InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing
Jinlu Zhang, Yixin Chen, Zan Wang et al.
M-LLM Based Video Frame Selection for Efficient Video Understanding
Kai Hu, Feng Gao, Xiaohan Nie et al.
Period-LLM: Extending the Periodic Capability of Multimodal Large Language Model
Yuting Zhang, Hao Lu, Qingyong Hu et al.
Empowering LLMs to Understand and Generate Complex Vector Graphics
Ximing Xing, Juncheng Hu, Guotao Liang et al.
STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training
Haiyi Qiu, Minghe Gao, Long Qian et al.
Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos
Chiara Plizzari, Alessio Tonioni, Yongqin Xian et al.
MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations
Kyungho Bae, Jinhyung Kim, Sihaeng Lee et al.
FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement
Ian Huang, Yanan Bao, Karen Truong et al.
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo et al.
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin et al.
SKE-Layout: Spatial Knowledge Enhanced Layout Generation with LLMs
Junsheng Wang, Nieqing Cao, Yan Ding et al.
A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation
Andrew Z. Wang, Songwei Ge, Tero Karras et al.