Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Computer Vision
›
Core AI
›
Multimodal Learning
1257 directly classified papers
Papers per year
2008: 1
2009: 2
2010: 2
2011: 1
2012: 3
2013: 3
2014: 2
2015: 5
2017: 11
2018: 25
2019: 33
2020: 66
2021: 47
2022: 113
2023: 199
2024: 325
2025: 411
2026: 8
Papers
Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval
CVPR 2023
CLIP2: Contrastive Language-Image-Point Pretraining From Real-World Point Cloud Data
CVPR 2023
VindLU: A Recipe for Effective Video-and-Language Pretraining
CVPR 2023
Delivering Arbitrary-Modal Semantic Segmentation
CVPR 2023
Improving Vision-and-Language Navigation by Generating Future-View Image Semantics
CVPR 2023
Revisiting Temporal Modeling for CLIP-Based Image-to-Video Knowledge Transferring
CVPR 2023
Pic2Word: Mapping Pictures to Words for Zero-Shot Composed Image Retrieval
CVPR 2023
Visual Prediction Improves Zero-Shot Cross-Modal Machine Translation
EMNLP 2023
D2TV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization
EMNLP 2023
A Multi-Modal Multilingual Benchmark for Document Image Classification
EMNLP 2023
MetaReVision: Meta-Learning with Retrieval for Visually Grounded Compositional Concept Acquisition
EMNLP 2023
Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts
EMNLP 2023
Scene Graph Enhanced Pseudo-Labeling for Referring Expression Comprehension
EMNLP 2023
Scaling Vision-Language Models with Sparse Mixture of Experts
EMNLP 2023
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models
EMNLP 2023
Unifying Text, Tables, and Images for Multimodal Question Answering
EMNLP 2023
Seeing What You Miss: Vision-Language Pre-Training With Semantic Completion Learning
CVPR 2023
Hierarchical Semantic Correspondence Networks for Video Paragraph Grounding
CVPR 2023
Aligning Step-by-Step Instructional Diagrams to Video Demonstrations
CVPR 2023
MemeCap: A Dataset for Captioning and Interpreting Memes
EMNLP 2023
Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and Beyond
EMNLP 2023
ESPVR: Entity Spans Position Visual Regions for Multimodal Named Entity Recognition
EMNLP 2023
Hierarchical Fusion for Online Multimodal Dialog Act Classification
EMNLP 2023
MM-Reasoner: A Multi-Modal Knowledge-Aware Framework for Knowledge-Based Visual Question Answering
EMNLP 2023
Visual Elements Mining as Prompts for Instruction Learning for Target-Oriented Multimodal Sentiment Classification
EMNLP 2023
<
1
…
32
33
34
…
51
>