Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
Multi-Modal Learning
1457 directly classified papers
Papers per year
2011: 1
2013: 4
2014: 3
2015: 3
2016: 9
2017: 11
2018: 27
2019: 61
2020: 109
2021: 87
2022: 153
2023: 213
2024: 391
2025: 384
2026: 1
Papers
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation
AAAI 2024
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
CVPR 2024
Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection
AAAI 2024
Diff-BGM: A Diffusion Model for Video Background Music Generation
CVPR 2024
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
CVPR 2024
Non-autoregressive Sequence-to-Sequence Vision-Language Models
CVPR 2024
By My Eyes: Grounding Multimodal Large Language Models with Sensor Data via Visual Prompting
EMNLP 2024
VidLA: Video-Language Alignment at Scale
CVPR 2024
Weakly Supervised Multimodal Affordance Grounding for Egocentric Images
AAAI 2024
CCEdit: Creative and Controllable Video Editing via Diffusion Models
CVPR 2024
UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity
CVPR 2024
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
AAAI 2024
Universal Segmentation at Arbitrary Granularity with Language Instruction
CVPR 2024
Language Models as Black-Box Optimizers for Vision-Language Models
CVPR 2024
VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding
CVPR 2024
Image as a Language: Revisiting Scene Text Recognition via Balanced, Unified and Synchronized Vision-Language Reasoning Network
AAAI 2024
Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use
CVPR 2024
Text2Loc: 3D Point Cloud Localization from Natural Language
CVPR 2024
Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer
AAAI 2024
Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework
CVPR 2024
Physical Property Understanding from Language-Embedded Feature Fields
CVPR 2024
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
CVPR 2024
CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual Knowledge Transfer
AAAI 2024
Cycle-Consistency Learning for Captioning and Grounding
AAAI 2024
Generating Human Motion in 3D Scenes from Text Descriptions
CVPR 2024
<
1
…
24
25
26
…
59
>