Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Computer Vision
›
Core AI
›
Multimodal Learning
1257 directly classified papers
Papers per year
2008: 1
2009: 2
2010: 2
2011: 1
2012: 3
2013: 3
2014: 2
2015: 5
2017: 11
2018: 25
2019: 33
2020: 66
2021: 47
2022: 113
2023: 199
2024: 325
2025: 411
2026: 8
Papers
Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models
NIPS 2024
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
NIPS 2024
Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models
NIPS 2024
SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM
NIPS 2024
Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms
NIPS 2024
What matters when building vision-language models?
NIPS 2024
Hidden in Plain Sight: Evaluating Abstract Shape Recognition in Vision-Language Models
NIPS 2024
Fusion-Vital: Video-RF Fusion Transformer for Advanced Remote Physiological Measurement
AAAI 2024
Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA
AAAI 2024
Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding
AAAI 2024
No Head Left Behind – Multi-Head Alignment Distillation for Transformers
AAAI 2024
Towards Robust Visual Understanding: from Recognition to Reasoning
AAAI 2024
Naming, Describing, and Quantifying Visual Objects in Humans and LLMs
ACL 2024
Don’t Buy it! Reassessing the Ad Understanding Abilities of Contrastive Multimodal Models
ACL 2024
Boosting Textural NER with Synthetic Image and Instructive Alignment
ACL 2024
ColorSwap: A Color and Word Order Dataset for Multimodal Evaluation
ACL 2024
Into the Unknown: Generating Geospatial Descriptions for New Environments
ACL 2024
MIKE: A New Benchmark for Fine-grained Multimodal Entity Knowledge Editing
ACL 2024
Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization
ACL 2024
Learning Human Action Representations from Temporal Context in Lifestyle Vlogs
ACL 2024
Open-Vocabulary Segmentation with Semantic-Assisted Calibration
CVPR 2024
Interactive3D: Create What You Want by Interactive 3D Generation
CVPR 2024
VideoLLM-online: Online Video Large Language Model for Streaming Video
CVPR 2024
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
CVPR 2024
DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction
CVPR 2024
<
1
…
27
28
29
…
51
>