Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
Multi-Modal Learning
1457 directly classified papers
Papers per year
2011: 1
2013: 4
2014: 3
2015: 3
2016: 9
2017: 11
2018: 27
2019: 61
2020: 109
2021: 87
2022: 153
2023: 213
2024: 391
2025: 384
2026: 1
Papers
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text
CVPR 2025
Towards Text-Image Interleaved Retrieval
ACL 2025
VQAGuider: Guiding Multimodal Large Language Models to Answer Complex Video Questions
ACL 2025
TIMotion: Temporal and Interactive Framework for Efficient Human-Human Motion Generation
CVPR 2025
Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos
CVPR 2025
Sharper and Faster mean Better: Towards More Efficient Vision-Language Model for Hour-scale Long Video Understanding
ACL 2025
OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use
ACL 2025
Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search
ACL 2025
Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning
ACL 2025
EgoLM: Multi-Modal Language Model of Egocentric Motions
CVPR 2025
Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning
ACL 2025
Attacking Vision-Language Computer Agents via Pop-ups
ACL 2025
Can MLLMs Understand the Deep Implication Behind Chinese Images?
ACL 2025
Agri-CM3: A Chinese Massive Multi-modal, Multi-level Benchmark for Agricultural Understanding and Reasoning
ACL 2025
Aligning VLM Assistants with Personalized Situated Cognition
ACL 2025
MCS-Bench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in Chinese Classical Studies
ACL 2025
AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs
ACL 2025
Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation
CVPR 2025
Improving Medical Large Vision-Language Models with Abnormal-Aware Feedback
ACL 2025
Activation Steering Decoding: Mitigating Hallucination in Large Vision-Language Models through Bidirectional Hidden State Intervention
ACL 2025
A Unified Agentic Framework for Evaluating Conditional Image Generation
ACL 2025
Exploring Compositional Generalization of Multimodal LLMs for Medical Imaging
ACL 2025
Cultivating Gaming Sense for Yourself: Making VLMs Gaming Experts
ACL 2025
CLAIM: Mitigating Multilingual Object Hallucination in Large Vision-Language Models with Cross-Lingual Attention Intervention
ACL 2025
Video2Roleplay: A Multimodal Dataset and Framework for Video-Guided Role-playing Agents
EMNLP 2025
<
1
…
13
14
15
…
59
>