Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Computer Vision
›
Core AI
›
Multimodal Learning
1257 directly classified papers
Papers per year
2008: 1
2009: 2
2010: 2
2011: 1
2012: 3
2013: 3
2014: 2
2015: 5
2017: 11
2018: 25
2019: 33
2020: 66
2021: 47
2022: 113
2023: 199
2024: 325
2025: 411
2026: 8
Papers
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
CVPR 2024
Bring Event into RGB and LiDAR: Hierarchical Visual-Motion Fusion for Scene Flow
CVPR 2024
EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI
CVPR 2024
On the Robustness of Large Multimodal Models Against Image Adversarial Attacks
CVPR 2024
SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos
CVPR 2024
Holoported Characters: Real-time Free-viewpoint Rendering of Humans from Sparse RGB Cameras
CVPR 2024
AlignMiF: Geometry-Aligned Multimodal Implicit Field for LiDAR-Camera Joint Synthesis
CVPR 2024
VideoCon: Robust Video-Language Alignment via Contrast Captions
CVPR 2024
Few-Shot Adversarial Prompt Learning on Vision-Language Models
NIPS 2024
VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding
NIPS 2024
Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning
NIPS 2024
$SE(3)$ Equivariant Ray Embeddings for Implicit Multi-View Depth Estimation
NIPS 2024
Beyond Euclidean: Dual-Space Representation Learning for Weakly Supervised Video Violence Detection
NIPS 2024
Needle In A Multimodal Haystack
NIPS 2024
Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare
NIPS 2024
Bridge the Modality and Capability Gaps in Vision-Language Model Selection
NIPS 2024
Multi-Object Hallucination in Vision Language Models
NIPS 2024
Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions
NIPS 2024
Multi-view Masked Contrastive Representation Learning for Endoscopic Video Analysis
NIPS 2024
WATT: Weight Average Test Time Adaptation of CLIP
NIPS 2024
Vision-Language Models are Strong Noisy Label Detectors
NIPS 2024
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
NIPS 2024
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
NIPS 2024
Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control
NIPS 2024
XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation
NIPS 2024
<
1
…
26
27
28
…
51
>