Artificial Intelligence › Core AI ›

Multi-Modal Learning

1457 directly classified papers

Papers per year

Papers

StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text CVPR 2025

Towards Text-Image Interleaved Retrieval ACL 2025

VQAGuider: Guiding Multimodal Large Language Models to Answer Complex Video Questions ACL 2025

TIMotion: Temporal and Interactive Framework for Efficient Human-Human Motion Generation CVPR 2025

Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos CVPR 2025

Sharper and Faster mean Better: Towards More Efficient Vision-Language Model for Hour-scale Long Video Understanding ACL 2025

OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use ACL 2025

Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search ACL 2025

Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning ACL 2025

EgoLM: Multi-Modal Language Model of Egocentric Motions CVPR 2025

Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning ACL 2025

Attacking Vision-Language Computer Agents via Pop-ups ACL 2025

Can MLLMs Understand the Deep Implication Behind Chinese Images? ACL 2025

Agri-CM3: A Chinese Massive Multi-modal, Multi-level Benchmark for Agricultural Understanding and Reasoning ACL 2025

Aligning VLM Assistants with Personalized Situated Cognition ACL 2025

MCS-Bench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in Chinese Classical Studies ACL 2025

AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs ACL 2025

Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation CVPR 2025

Improving Medical Large Vision-Language Models with Abnormal-Aware Feedback ACL 2025

Activation Steering Decoding: Mitigating Hallucination in Large Vision-Language Models through Bidirectional Hidden State Intervention ACL 2025

A Unified Agentic Framework for Evaluating Conditional Image Generation ACL 2025

Exploring Compositional Generalization of Multimodal LLMs for Medical Imaging ACL 2025

Cultivating Gaming Sense for Yourself: Making VLMs Gaming Experts ACL 2025

CLAIM: Mitigating Multilingual Object Hallucination in Large Vision-Language Models with Cross-Lingual Attention Intervention ACL 2025

Video2Roleplay: A Multimodal Dataset and Framework for Video-Guided Role-playing Agents EMNLP 2025