Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Computer Vision
›
Core AI
›
Multimodal Learning
1257 directly classified papers
Papers per year
2008: 1
2009: 2
2010: 2
2011: 1
2012: 3
2013: 3
2014: 2
2015: 5
2017: 11
2018: 25
2019: 33
2020: 66
2021: 47
2022: 113
2023: 199
2024: 325
2025: 411
2026: 8
Papers
ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models
CVPR 2024
Koala: Key Frame-Conditioned Long Video-LLM
CVPR 2024
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
CVPR 2024
Grounding Everything: Emerging Localization Properties in Vision-Language Transformers
CVPR 2024
Binding Touch to Everything: Learning Unified Multimodal Tactile Representations
CVPR 2024
Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning
CVPR 2024
Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations
CVPR 2024
Draw Step by Step: Reconstructing CAD Construction Sequences from Point Clouds via Multimodal Diffusion.
CVPR 2024
Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation
CVPR 2024
How to Configure Good In-Context Sequence for Visual Question Answering
CVPR 2024
Tune-An-Ellipse: CLIP Has Potential to Find What You Want
CVPR 2024
Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language
CVPR 2024
Tag-grounded Visual Instruction Tuning with Retrieval Augmentation
EMNLP 2024
Decompose and Compare Consistency: Measuring VLMs’ Answer Reliability via Task-Decomposition Consistency Comparison
EMNLP 2024
VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation
EMNLP 2024
MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension
EMNLP 2024
Words Worth a Thousand Pictures: Measuring and Understanding Perceptual Variability in Text-to-Image Generation
EMNLP 2024
UOUO: Uncontextualized Uncommon Objects for Measuring Knowledge Horizons of Vision Language Models
EMNLP 2024
Efficient Vision-Language pre-training via domain-specific learning for human activities
EMNLP 2024
Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding
EMNLP 2024
Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge
EMNLP 2024
OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer
EMNLP 2024
Vision-Language Model Fine-Tuning via Simple Parameter-Efficient Modification
EMNLP 2024
Interpretable Composition Attribution Enhancement for Visio-linguistic Compositional Understanding
EMNLP 2024
VHASR: A Multimodal Speech Recognition System With Vision Hotwords
EMNLP 2024
<
1
…
28
29
30
…
51
>