Papers
AutoComPose: Automatic Generation of Pose Transition Descriptions for Composed Pose Retrieval Using Multimodal LLMs
Yi-Ting Shen, Sungmin Eum, Doheon Lee et al.
Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs
Yikang Zhou, Tao Zhang, Shilin Xu et al.
Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations
Jeong Hun Yeo, Minsu Kim, Chae Won Kim et al.
Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections of Images
Boyang Deng, Songyou Peng, Kyle Genova et al.
Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation
Jiaer Xia, Bingkui Tong, Yuhang Zang et al.
Controlling Multimodal LLMs via Reward-guided Decoding
Oscar Mañas, Pierluca D'Oro, Koustuv Sinha et al.
TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding
Zuhao Yang, Yingchen Yu, Yunqing Zhao et al.
Multimodal LLMs as Customized Reward Models for Text-to-Image Generation
Shijie Zhou, Ruiyi Zhang, Huaisheng Zhu et al.
Enrich and Detect: Video Temporal Grounding with Multimodal LLMs
Shraman Pramanick, Effrosyni Mavroudi, Yale Song et al.
Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs
Shaojie Zhang, Jiahui Yang, Jianqin Yin et al.
MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs
Erik Daxberger, Nina Wenzel, David Griffiths et al.
CogNav: Cognitive Process Modeling for Object Goal Navigation with LLMs
Yihan Cao, Jiazhao Zhang, Zhinan Yu et al.
Zero-Shot Vision Encoder Grafting via LLM Surrogates
Kaiyu Yue, Vasu Singla, Menglin Jia et al.
ROVI: A VLM-LLM Re-Captioned Dataset for Open-Vocabulary Instance-Grounded Text-to-Image Generation
Cihang Peng, Qiming Hou, Zhong Ren et al.
AURELIA: Test-time Reasoning Distillation in Audio-Visual LLMs
Sanjoy Chowdhury, Hanan Gani, Nishit Anand et al.
AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs
Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta et al.
Token Activation Map to Visually Explain Multimodal LLMs
Yi Li, Hualiang Wang, Xinpeng Ding et al.
Multimodal LLM Guided Exploration and Active Mapping using Fisher Information
Wen Jiang, Boshu Lei, Katrina Ashton et al.
ARGUS: Hallucination and Omission Evaluation in Video-LLMs
Ruchit Rawal, Reza Shirkavand, Heng Huang et al.
Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs
Jeongseok Hyun, Sukjun Hwang, Su Ho Han et al.
Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description
Mahmoud Ahmed, Junjie Fei, Jian Ding et al.
Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning
Junming Liu, Siyuan Meng, Yanting Gao et al.
CATP-LLM: Empowering Large Language Models for Cost-Aware Tool Planning
Duo Wu, Jinghe Wang, Yuan Meng et al.
SuRe: Summarizing Retrievals using Answer Candidates for Open-domain QA of LLMs
Jaehyung Kim, Jaehyun Nam, Sangwoo Mo et al.