Artificial Intelligence › Core AI ›

Multi-Modal Learning

1457 directly classified papers

Papers per year

Papers

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models CVPR 2025

LLaVA Steering: Visual Instruction Tuning with 500x Fewer Parameters through Modality Linear Representation-Steering ACL 2025

Visual Evidence Prompting Mitigates Hallucinations in Large Vision-Language Models ACL 2025

Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model ACL 2025

Large Multi-modal Models Can Interpret Features in Large Multi-modal Models ICCV 2025

MISP-Meeting: A Real-World Dataset with Multimodal Cues for Long-form Meeting Transcription and Summarization ACL 2025

Donate or Create? Comparing Data Collection Strategies for Emotion-labeled Multimodal Social Media Posts ACL 2025

Can’t See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs ACL 2025

Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method CVPR 2025

Are Any-to-Any Models More Consistent Across Modality Transfers Than Specialists? ACL 2025

PVP: An Image Dataset for Personalized Visual Persuasion with Persuasion Strategies, Viewer Characteristics, and Persuasiveness Ratings ACL 2025

OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference ACL 2025

Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension ACL 2025

Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video-Grounded Dialogues ACL 2025

VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos ACL 2025

AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations ACL 2025

Walk in Others’ Shoes with a Single Glance: Human-Centric Visual Grounding with Top-View Perspective Transformation ACL 2025

FCMR: Robust Evaluation of Financial Cross-Modal Multi-Hop Reasoning ACL 2025

MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation ACL 2025

Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence ACL 2025

Exploring Multimodal Relation Extraction of Hierarchical Tabular Data with Multi-task Learning ACL 2025

Finding Needles in Images: Can Multi-modal LLMs Locate Fine Details? ACL 2025

An Appraisal Theoretic Approach to Modelling Affect Flow in Conversation Corpora ACL 2025

Applying the Character-Role Narrative Framework with LLMs to Investigate Environmental Narratives in Scientific Editorials and Tweets ACL 2025

Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions EMNLP 2025