Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Computer Vision
›
Core AI
›
Multimodal Learning
1257 directly classified papers
Papers per year
2008: 1
2009: 2
2010: 2
2011: 1
2012: 3
2013: 3
2014: 2
2015: 5
2017: 11
2018: 25
2019: 33
2020: 66
2021: 47
2022: 113
2023: 199
2024: 325
2025: 411
2026: 8
Papers
Does Continual Learning Meet Compositionality? New Benchmarks and An Evaluation Framework
NIPS 2023
Unified Coarse-to-Fine Alignment for Video-Text Retrieval
ICCV 2023
Cross-view Semantic Alignment for Livestreaming Product Recognition
ICCV 2023
Be Everywhere - Hear Everything (BEE): Audio Scene Reconstruction by Sparse Audio-Visual Samples
ICCV 2023
Multi-Modal Gated Mixture of Local-to-Global Experts for Dynamic Image Fusion
ICCV 2023
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
ICCV 2023
Cross-Image-Attention for Conditional Embeddings in Deep Metric Learning
CVPR 2023
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
CVPR 2023
CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation
CVPR 2023
CP3: Channel Pruning Plug-In for Point-Based Networks
CVPR 2023
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
CVPR 2023
Self-Supervised Learning From Images With a Joint-Embedding Predictive Architecture
CVPR 2023
Clover: Towards a Unified Video-Language Alignment and Fusion Model
CVPR 2023
Multi-Modal Representation Learning With Text-Driven Soft Masks
CVPR 2023
CLIPPING: Distilling CLIP-Based Models With a Student Base for Video-Language Retrieval
CVPR 2023
Logical Implications for Visual Question Answering Consistency
CVPR 2023
CLIPPO: Image-and-Language Understanding From Pixels Only
CVPR 2023
Improving Cross-Modal Retrieval With Set of Diverse Embeddings
CVPR 2023
REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory
CVPR 2023
Vision Transformers Are Parameter-Efficient Audio-Visual Learners
CVPR 2023
Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens
CVPR 2023
Egocentric Auditory Attention Localization in Conversations
CVPR 2023
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-Channel Video-Language Retrieval
CVPR 2023
Renderable Neural Radiance Map for Visual Navigation
CVPR 2023
pCON: Polarimetric Coordinate Networks for Neural Scene Representations
CVPR 2023
<
1
…
35
36
37
…
51
>