Artificial Intelligence › Core AI ›

Multi-Modal Learning

1457 directly classified papers

Papers per year

Papers

SYNAPSE: SYmbolic Neural-Aided Preference Synthesis Engine AAAI 2025

PhishAgent: A Robust Multimodal Agent for Phishing Webpage Detection AAAI 2025

OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use ACL 2025

Aligning VLM Assistants with Personalized Situated Cognition ACL 2025

Differentiated Vision: Unveiling Entity-Specific Visual Modality Requirements for Multimodal Knowledge Graph EMNLP 2025

Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning ACL 2025

Attacking Vision-Language Computer Agents via Pop-ups ACL 2025

AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs ACL 2025

MCS-Bench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in Chinese Classical Studies ACL 2025

Chain-Talker: Chain Understanding and Rendering for Empathetic Conversational Speech Synthesis ACL 2025

Texts or Images? A Fine-grained Analysis on the Effectiveness of Input Representations and Models for Table Question Answering ACL 2025

Activation Steering Decoding: Mitigating Hallucination in Large Vision-Language Models through Bidirectional Hidden State Intervention ACL 2025

FlashAudio: Rectified Flow for Fast and High-Fidelity Text-to-Audio Generation ACL 2025

OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference ACL 2025

MM-R3: On (In-)Consistency of Vision-Language Models (VLMs) ACL 2025

Improving Medical Large Vision-Language Models with Abnormal-Aware Feedback ACL 2025

Cultivating Gaming Sense for Yourself: Making VLMs Gaming Experts ACL 2025

CLAIM: Mitigating Multilingual Object Hallucination in Large Vision-Language Models with Cross-Lingual Attention Intervention ACL 2025

Exploring Compositional Generalization of Multimodal LLMs for Medical Imaging ACL 2025

MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification ACL 2025

RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought ACL 2025

Incongruity-aware Tension Field Network for Multi-modal Sarcasm Detection ACL 2025

Can MLLMs Understand the Deep Implication Behind Chinese Images? ACL 2025

HiddenDetect: Detecting Jailbreak Attacks against Multimodal Large Language Models via Monitoring Hidden States ACL 2025

VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models CVPR 2025