Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
Multi-Modal Learning
1457 directly classified papers
Papers per year
2011: 1
2013: 4
2014: 3
2015: 3
2016: 9
2017: 11
2018: 27
2019: 61
2020: 109
2021: 87
2022: 153
2023: 213
2024: 391
2025: 384
2026: 1
Papers
WonderJourney: Going from Anywhere to Everywhere
CVPR 2024
Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-identification
AAAI 2024
What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models
CVPR 2024
DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model
CVPR 2024
No Head Left Behind – Multi-Head Alignment Distillation for Transformers
AAAI 2024
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
CVPR 2024
LayoutFormer: Hierarchical Text Detection Towards Scene Text Understanding
CVPR 2024
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation
AAAI 2024
Bring Event into RGB and LiDAR: Hierarchical Visual-Motion Fusion for Scene Flow
CVPR 2024
Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection
AAAI 2024
UniHuman: A Unified Model For Editing Human Images in the Wild
CVPR 2024
DiaLoc: An Iterative Approach to Embodied Dialog Localization
CVPR 2024
Improved Visual Grounding through Self-Consistent Explanations
CVPR 2024
Weakly Supervised Multimodal Affordance Grounding for Egocentric Images
AAAI 2024
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
CVPR 2024
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
CVPR 2024
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
AAAI 2024
Weakly-Supervised Audio-Visual Video Parsing with Prototype-based Pseudo-Labeling
CVPR 2024
ExACT: Language-guided Conceptual Reasoning and Uncertainty Estimation for Event-based Action Recognition and More
CVPR 2024
Image as a Language: Revisiting Scene Text Recognition via Balanced, Unified and Synchronized Vision-Language Reasoning Network
AAAI 2024
Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline
CVPR 2024
UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity
CVPR 2024
Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer
AAAI 2024
iKUN: Speak to Trackers without Retraining
CVPR 2024
Enhanced Motion-Text Alignment for Image-to-Video Transfer Learning
CVPR 2024
<
1
…
23
24
25
…
59
>