Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Computer Vision
›
Core AI
›
Multimodal Learning
1257 directly classified papers
Papers per year
2008: 1
2009: 2
2010: 2
2011: 1
2012: 3
2013: 3
2014: 2
2015: 5
2017: 11
2018: 25
2019: 33
2020: 66
2021: 47
2022: 113
2023: 199
2024: 325
2025: 411
2026: 8
Papers
DRAG: Dynamic Region-Aware GCN for Privacy-Leaking Image Detection
AAAI 2022
Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective
AAAI 2022
Understanding Attention for Vision-and-Language Tasks
COLING 2022
MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-Based Visual Question Answering
CVPR 2022
Learning Program Representations for Food Images and Cooking Recipes
CVPR 2022
Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality
EMNLP 2022
Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
NIPS 2022
ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models
NIPS 2022
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts
NIPS 2022
I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification
NIPS 2022
TaiSu: A 166M Large-scale High-Quality Dataset for Chinese Vision-Language Pre-training
NIPS 2022
CAESAR: An Embodied Simulator for Generating Multimodal Referring Expression Datasets
NIPS 2022
SAMURAI: Shape And Material from Unconstrained Real-world Arbitrary Image collections
NIPS 2022
MM-GATBT: Enriching Multimodal Representation Using Graph Attention Network
NAACL 2022
Analysing the Correlation between Lexical Ambiguity and Translation Quality in a Multimodal Setting using WordNet
NAACL 2022
Beyond Emotion: A Multi-Modal Dataset for Human Desire Understanding
NAACL 2022
Open-Domain, Content-Based, Multi-Modal Fact-Checking of Out-of-Context Images via Online Resources
CVPR 2022
3MASSIV: Multilingual, Multimodal and Multi-Aspect Dataset of Social Media Short Videos
CVPR 2022
A Proposal-Based Paradigm for Self-Supervised Sound Source Localization in Videos
CVPR 2022
MERLOT Reserve: Neural Script Knowledge Through Vision and Language and Sound
CVPR 2022
Region-Aware Face Swapping
CVPR 2022
NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks
CVPR 2022
An Empirical Study of Training End-to-End Vision-and-Language Transformers
CVPR 2022
OmniFusion: 360 Monocular Depth Estimation via Geometry-Aware Fusion
CVPR 2022
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
ICML 2022
<
1
…
40
41
42
…
51
>