Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
Multi-Modal Learning
1457 directly classified papers
Papers per year
2011: 1
2013: 4
2014: 3
2015: 3
2016: 9
2017: 11
2018: 27
2019: 61
2020: 109
2021: 87
2022: 153
2023: 213
2024: 391
2025: 384
2026: 1
Papers
Looking Similar Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning
CVPR 2024
Revisiting Counterfactual Problems in Referring Expression Comprehension
CVPR 2024
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
CVPR 2024
Mask4Align: Aligned Entity Prompting with Color Masks for Multi-Entity Localization Problems
CVPR 2024
Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment
CVPR 2024
Weakly Supervised Multimodal Affordance Grounding for Egocentric Images
AAAI 2024
Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model
CVPR 2024
VideoCon: Robust Video-Language Alignment via Contrast Captions
CVPR 2024
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
AAAI 2024
AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
CVPR 2024
GlitchBench: Can Large Multimodal Models Detect Video Game Glitches?
CVPR 2024
CADTalk: An Algorithm and Benchmark for Semantic Commenting of CAD Programs
CVPR 2024
Image as a Language: Revisiting Scene Text Recognition via Balanced, Unified and Synchronized Vision-Language Reasoning Network
AAAI 2024
Generate Subgoal Images before Act: Unlocking the Chain-of-Thought Reasoning in Diffusion Model for Robot Manipulation with Multimodal Prompts
CVPR 2024
Instruct-ReID: A Multi-purpose Person Re-identification Task with Instructions
CVPR 2024
Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer
AAAI 2024
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
CVPR 2024
MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception
CVPR 2024
DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance
CVPR 2024
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
CVPR 2024
CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual Knowledge Transfer
AAAI 2024
JRDB-Social: A Multifaceted Robotic Dataset for Understanding of Context and Dynamics of Human Interactions Within Social Groups
CVPR 2024
QDFormer: Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition
CVPR 2024
Cycle-Consistency Learning for Captioning and Grounding
AAAI 2024
Active Open-Vocabulary Recognition: Let Intelligent Moving Mitigate CLIP Limitations
CVPR 2024
<
1
…
21
22
23
…
59
>