Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Computer Vision
›
Core AI
›
Multimodal Learning
1257 directly classified papers
Papers per year
2008: 1
2009: 2
2010: 2
2011: 1
2012: 3
2013: 3
2014: 2
2015: 5
2017: 11
2018: 25
2019: 33
2020: 66
2021: 47
2022: 113
2023: 199
2024: 325
2025: 411
2026: 8
Papers
ViLBench: A Suite for Vision-Language Process Reward Modeling
EMNLP 2025
MemeQA: Holistic Evaluation for Meme Understanding
ACL 2025
SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning
EMNLP 2025
Cross-Modal 3D Representation with Multi-View Images and Point Clouds
CVPR 2025
DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement
EMNLP 2025
MegaPairs: Massive Data Synthesis for Universal Multimodal Retrieval
ACL 2025
X-FLoRA: Cross-modal Federated Learning with Modality-expert LoRA for Medical VQA
EMNLP 2025
MTGA: Multi-View Temporal Granularity Aligned Aggregation for Event-Based Lip-Reading
AAAI 2025
Does Acceleration Cause Hidden Instability in Vision Language Models? Uncovering Instance-Level Divergence Through a Large-Scale Empirical Study
EMNLP 2025
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
ACL 2025
VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions
EMNLP 2025
Visual Question Answering for Peruvian Cuisine in Regional Spanish
AAAI 2025
ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents
EMNLP 2025
V-Oracle: Making Progressive Reasoning in Deciphering Oracle Bones for You and Me
ACL 2025
Probing Logical Reasoning of MLLMs in Scientific Diagrams
EMNLP 2025
A Parameter-Efficient and Fine-Grained Prompt Learning for Vision-Language Models
ACL 2025
Can Vision-Language Models Solve Visual Math Equations?
EMNLP 2025
Deep Temporal Reasoning in Video Language Models: A Cross-Linguistic Evaluation of Action Duration and Completion through Perfect Times
ACL 2025
LATTE: Learning to Think with Vision Specialists
EMNLP 2025
VCD: A Dataset for Visual Commonsense Discovery in Images
ACL 2025
Dual-Path Dynamic Fusion with Learnable Query for Multimodal Sentiment Analysis
EMNLP 2025
VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search
ACL 2025
D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition
EMNLP 2025
SignMusketeers: An Efficient Multi-Stream Approach for Sign Language Translation at Scale
ACL 2025
GOAL: Global-local Object Alignment Learning
CVPR 2025
<
1
…
14
15
16
…
51
>