Artificial Intelligence › Core AI ›

Multi-Modal Learning

1457 directly classified papers

Papers per year

Papers

Free on the Fly: Enhancing Flexibility in Test-Time Adaptation with Online EM CVPR 2025

Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation CVPR 2025

Finding Needles in Images: Can Multi-modal LLMs Locate Fine Details? ACL 2025

Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models ACL 2025

FCMR: Robust Evaluation of Financial Cross-Modal Multi-Hop Reasoning ACL 2025

MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation ACL 2025

Walk in Others’ Shoes with a Single Glance: Human-Centric Visual Grounding with Top-View Perspective Transformation ACL 2025

Exploring Multimodal Relation Extraction of Hierarchical Tabular Data with Multi-task Learning ACL 2025

CheXalign: Preference fine-tuning in chest X-ray interpretation models without human feedback ACL 2025

Sightation Counts: Leveraging Sighted User Feedback in Building a BLV-aligned Dataset of Diagram Descriptions ACL 2025

Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images ACL 2025

CoachMe: Decoding Sport Elements with a Reference-Based Coaching Instruction Generation Model ACL 2025

Predicting Implicit Arguments in Procedural Video Instructions ACL 2025

SpeechIQ: Speech-Agentic Intelligence Quotient Across Cognitive Levels in Voice Understanding by Large Language Models ACL 2025

HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims ACL 2025

Hidden in Plain Sight: Evaluation of the Deception Detection Capabilities of LLMs in Multimodal Settings ACL 2025

Enabling Chatbots with Eyes and Ears: An Immersive Multimodal Conversation System for Dynamic Interactions ACL 2025

CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory ACL 2025

S3E: Self-Supervised State Estimation for Radar-Inertial System ICCV 2025

Grounded, or a Good Guesser? A Per-Question Balanced Dataset to Separate Blind from Grounded Models for Embodied Question Answering ACL 2025

WinSpot: GUI Grounding Benchmark with Multimodal Large Language Models ACL 2025

DRUM: Learning Demonstration Retriever for Large MUlti-modal Models ACL 2025

Do Multimodal Large Language Models Truly See What We Point At? Investigating Indexical, Iconic, and Symbolic Gesture Comprehension ACL 2025

Towards Geo-Culturally Grounded LLM Generations ACL 2025

PEACE: Empowering Geologic Map Holistic Understanding with MLLMs CVPR 2025