Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Computer Vision
›
Core AI
›
Multimodal Learning
1257 directly classified papers
Papers per year
2008: 1
2009: 2
2010: 2
2011: 1
2012: 3
2013: 3
2014: 2
2015: 5
2017: 11
2018: 25
2019: 33
2020: 66
2021: 47
2022: 113
2023: 199
2024: 325
2025: 411
2026: 8
Papers
CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting
ICCV 2025
Query-LIFE: Query-aware Language Image Fusion Embedding for E-Commerce Relevance
COLING 2025
Gaze-Language Alignment for Zero-Shot Prediction of Visual Search Targets from Human Gaze Scanpaths
ICCV 2025
RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions
ICCV 2025
Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation
ICCV 2025
VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models
ICCV 2025
CVFusion: Cross-View Fusion of 4D Radar and Camera for 3D Object Detection
ICCV 2025
DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance
ICCV 2025
AM-Adapter: Appearance Matching Adapter for Exemplar-based Semantic Image Synthesis in-the-Wild
ICCV 2025
Beyond Visual Understanding Introducing PARROT-360V for Vision Language Model Benchmarking
COLING 2025
ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations
ICCV 2025
Chat-Driven Text Generation and Interaction for Person Retrieval
EMNLP 2025
The Source Image is the Best Attention for Infrared and Visible Image Fusion
ICCV 2025
TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection
EMNLP 2025
DLRG@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian Languages
NAACL 2025
Diagram-Driven Course Questions Generation
EMNLP 2025
SMSTracker: Tri-path Score Mask Sigma Fusion for Multi-Modal Tracking
ICCV 2025
MAKAR: a Multi-Agent framework based Knowledge-Augmented Reasoning for Grounded Multimodal Named Entity Recognition
EMNLP 2025
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation
ACL 2025
Audio-centric Video Understanding Benchmark without Text Shortcut
EMNLP 2025
TemCoCo: Temporally Consistent Multi-modal Video Fusion with Visual-Semantic Collaboration
ICCV 2025
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
EMNLP 2025
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
ACL 2025
Multimodal Language Models See Better When They Look Shallower
EMNLP 2025
AMIA: Automatic Masking and Joint Intention Analysis Makes LVLMs Robust Jailbreak Defenders
EMNLP 2025
<
1
…
13
14
15
…
51
>