Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Computer Vision
›
Core AI
›
Multimodal Learning
1257 directly classified papers
Papers per year
2008: 1
2009: 2
2010: 2
2011: 1
2012: 3
2013: 3
2014: 2
2015: 5
2017: 11
2018: 25
2019: 33
2020: 66
2021: 47
2022: 113
2023: 199
2024: 325
2025: 411
2026: 8
Papers
Extended Multimodal Hate Speech Event Detection During Russia-Ukraine Crisis - Shared Task at CASE 2024
EACL 2024
Continual Vision-Language Retrieval via Dynamic Knowledge Rectification
AAAI 2024
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs
NIPS 2024
What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction
NIPS 2024
RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models
NIPS 2024
Conjugated Semantic Pool Improves OOD Detection with Pre-trained Vision-Language Models
NIPS 2024
MSPE: Multi-Scale Patch Embedding Prompts Vision Transformers to Any Resolution
NIPS 2024
Rethinking Reverse Distillation for Multi-Modal Anomaly Detection
AAAI 2024
Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment
NIPS 2024
Benchmarking Out-of-Distribution Detection in Visual Question Answering
WACV 2024
Linearly Decomposing and Recomposing Vision Transformers for Diverse-Scale Models
NIPS 2024
Sketch-Based Video Object Localization
WACV 2024
An eye for an ear: zero-shot audio description leveraging an image captioner with audio-visual token distribution matching
NIPS 2024
Text-DiFuse: An Interactive Multi-Modal Image Fusion Framework based on Text-modulated Diffusion Model
NIPS 2024
Interpreting and Analysing CLIP's Zero-Shot Image Classification via Mutual Knowledge
NIPS 2024
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
CVPR 2024
Interfacing Foundation Models' Embeddings
NIPS 2024
SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery
CVPR 2024
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
NIPS 2024
Unsegment Anything by Simulating Deformation
CVPR 2024
MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric
CVPR 2024
SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction
CVPR 2024
Revisiting the Integration of Convolution and Attention for Vision Backbone
NIPS 2024
Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval
CVPR 2024
AHIVE: Anatomy-aware Hierarchical Vision Encoding for Interactive Radiology Report Retrieval
CVPR 2024
<
1
…
19
20
21
…
51
>