Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Computer Vision
›
Core AI
›
Multimodal Learning
1257 directly classified papers
Papers per year
2008: 1
2009: 2
2010: 2
2011: 1
2012: 3
2013: 3
2014: 2
2015: 5
2017: 11
2018: 25
2019: 33
2020: 66
2021: 47
2022: 113
2023: 199
2024: 325
2025: 411
2026: 8
Papers
L4D-Track: Language-to-4D Modeling Towards 6-DoF Tracking and Shape Reconstruction in 3D Point Cloud Stream
CVPR 2024
Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions
CVPR 2024
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
EMNLP 2024
Towards Better Vision-Inspired Vision-Language Models
CVPR 2024
Probing Synergistic High-Order Interaction in Infrared and Visible Image Fusion
CVPR 2024
Quantifying the Gaps Between Translation and Native Perception in Training for Multimodal, Multilingual Retrieval
EMNLP 2024
Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation
CVPR 2024
DIEM: Decomposition-Integration Enhancing Multimodal Insights
CVPR 2024
Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models
CVPR 2024
HEAL-SWIN: A Vision Transformer On The Sphere
CVPR 2024
Concept-skill Transferability-based Data Selection for Large Vision-Language Models
EMNLP 2024
From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis
EMNLP 2024
Mitigating Open-Vocabulary Caption Hallucinations
EMNLP 2024
Multiple Knowledge-Enhanced Interactive Graph Network for Multimodal Conversational Emotion Recognition
EMNLP 2024
SURf: Teaching Large Vision-Language Models to Selectively Utilize Retrieved Information
EMNLP 2024
RWKV-CLIP: A Robust Vision-Language Representation Learner
EMNLP 2024
MEANT: Multimodal Encoder for Antecedent Information
EMNLP 2024
World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering
EMNLP 2024
If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions
EMNLP 2024
Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you!
EMNLP 2024
MAR: Matching-Augmented Reasoning for Enhancing Visual-based Entity Question Answering
EMNLP 2024
CommVQA: Situating Visual Question Answering in Communicative Contexts
EMNLP 2024
Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-modal matching ability
ACL 2024
ChartAssistant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning
ACL 2024
VisDiaHalBench: A Visual Dialogue Benchmark For Diagnosing Hallucination in Large Vision-Language Models
ACL 2024
<
1
…
22
23
24
…
51
>