Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
Multi-Modal Learning
1457 directly classified papers
Papers per year
2011: 1
2013: 4
2014: 3
2015: 3
2016: 9
2017: 11
2018: 27
2019: 61
2020: 109
2021: 87
2022: 153
2023: 213
2024: 391
2025: 384
2026: 1
Papers
No Head Left Behind – Multi-Head Alignment Distillation for Transformers
AAAI 2024
ADA-Track: End-to-End Multi-Camera 3D Multi-Object Tracking with Alternating Detection and Association
CVPR 2024
DiVAS: Video and Audio Synchronization with Dynamic Frame Rates
CVPR 2024
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation
AAAI 2024
Vision-and-Language Navigation via Causal Learning
CVPR 2024
Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection
AAAI 2024
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners
CVPR 2024
Composing Object Relations and Attributes for Image-Text Matching
CVPR 2024
VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding
CVPR 2024
Weakly Supervised Multimodal Affordance Grounding for Egocentric Images
AAAI 2024
Label Propagation for Zero-shot Classification with Vision-Language Models
CVPR 2024
ExACT: Language-guided Conceptual Reasoning and Uncertainty Estimation for Event-based Action Recognition and More
CVPR 2024
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
AAAI 2024
Text-conditional Attribute Alignment across Latent Spaces for 3D Controllable Face Image Synthesis
CVPR 2024
Looking Similar Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning
CVPR 2024
Image as a Language: Revisiting Scene Text Recognition via Balanced, Unified and Synchronized Vision-Language Reasoning Network
AAAI 2024
Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models
CVPR 2024
Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer
AAAI 2024
Towards Better Vision-Inspired Vision-Language Models
CVPR 2024
CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual Knowledge Transfer
AAAI 2024
Cycle-Consistency Learning for Captioning and Grounding
AAAI 2024
Previously on ... From Recaps to Story Summarization
CVPR 2024
A Multimodal, Multi-Task Adapting Framework for Video Action Recognition
AAAI 2024
EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI
CVPR 2024
EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering
AAAI 2024
<
1
…
22
23
24
…
59
>