Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Computer Vision
›
Core AI
›
Multimodal Learning
1257 directly classified papers
Papers per year
2008: 1
2009: 2
2010: 2
2011: 1
2012: 3
2013: 3
2014: 2
2015: 5
2017: 11
2018: 25
2019: 33
2020: 66
2021: 47
2022: 113
2023: 199
2024: 325
2025: 411
2026: 8
Papers
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
NIPS 2024
LookHere: Vision Transformers with Directed Attention Generalize and Extrapolate
NIPS 2024
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs
NIPS 2024
IMAGPose: A Unified Conditional Framework for Pose-Guided Person Generation
NIPS 2024
ArtQuest: Countering Hidden Language Biases in ArtVQA
WACV 2024
Unified Generative and Discriminative Training for Multi-modal Large Language Models
NIPS 2024
Self-Calibrated Tuning of Vision-Language Models for Out-of-Distribution Detection
NIPS 2024
CLIPCEIL: Domain Generalization through CLIP via Channel rEfinement and Image-text aLignment
NIPS 2024
MmAP: Multi-Modal Alignment Prompt for Cross-Domain Multi-Task Learning
AAAI 2024
Visual Fourier Prompt Tuning
NIPS 2024
Extended Multimodal Hate Speech Event Detection During Russia-Ukraine Crisis - Shared Task at CASE 2024
EACL 2024
Continual Vision-Language Retrieval via Dynamic Knowledge Rectification
AAAI 2024
Unified Lexical Representation for Interpretable Visual-Language Alignment
NIPS 2024
Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels
NIPS 2024
Suppress Content Shift: Better Diffusion Features via Off-the-Shelf Generation Techniques
NIPS 2024
Rethinking Reverse Distillation for Multi-Modal Anomaly Detection
AAAI 2024
An eye for an ear: zero-shot audio description leveraging an image captioner with audio-visual token distribution matching
NIPS 2024
Sketch-Based Video Object Localization
WACV 2024
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
CVPR 2024
SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery
CVPR 2024
Unsegment Anything by Simulating Deformation
CVPR 2024
MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric
CVPR 2024
SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction
CVPR 2024
Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval
CVPR 2024
AHIVE: Anatomy-aware Hierarchical Vision Encoding for Interactive Radiology Report Retrieval
CVPR 2024
<
1
…
17
18
19
…
51
>