Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
Multi-Modal Learning
1457 directly classified papers
Papers per year
2011: 1
2013: 4
2014: 3
2015: 3
2016: 9
2017: 11
2018: 27
2019: 61
2020: 109
2021: 87
2022: 153
2023: 213
2024: 391
2025: 384
2026: 1
Papers
WebQA: Multihop and Multimodal QA
CVPR 2022
Interact Before Align: Leveraging Cross-Modal Knowledge for Domain Adaptive Action Recognition
CVPR 2022
X-Trans2Cap: Cross-Modal Knowledge Transfer Using Transformer for 3D Dense Captioning
CVPR 2022
Balanced Multimodal Learning via On-the-Fly Gradient Modulation
CVPR 2022
VALHALLA: Visual Hallucination for Machine Translation
CVPR 2022
FLAVA: A Foundational Language and Vision Alignment Model
CVPR 2022
Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation
CVPR 2022
Learning With Twin Noisy Labels for Visible-Infrared Person Re-Identification
CVPR 2022
BMU-MoCo: Bidirectional Momentum Update for Continual Video-Language Modeling
NIPS 2022
Text to Image Generation With Semantic-Spatial Aware GAN
CVPR 2022
Counterfactual Cycle-Consistent Learning for Instruction Following and Generation in Vision-Language Navigation
CVPR 2022
Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation
CVPR 2022
UTC: A Unified Transformer With Inter-Task Contrastive Learning for Visual Dialog
CVPR 2022
Cross-Modal Map Learning for Vision and Language Navigation
CVPR 2022
CAESAR: An Embodied Simulator for Generating Multimodal Referring Expression Datasets
NIPS 2022
Language Conditioned Spatial Relation Reasoning for 3D Object Grounding
NIPS 2022
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
CVPR 2022
Connecting the Complementary-View Videos: Joint Camera Identification and Subject Association
CVPR 2022
V2C: Visual Voice Cloning
CVPR 2022
One Step at a Time: Long-Horizon Vision-and-Language Navigation With Milestones
CVPR 2022
Less Is More: Generating Grounded Navigation Instructions From Landmarks
CVPR 2022
Make It Move: Controllable Image-to-Video Generation With Text Descriptions
CVPR 2022
Two-Stream Network for Sign Language Recognition and Translation
NIPS 2022
HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes
NIPS 2022
ActionSense: A Multimodal Dataset and Recording Framework for Human Activities Using Wearable Sensors in a Kitchen Environment
NIPS 2022
<
1
…
40
41
42
…
59
>