Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Computer Vision
›
Core AI
›
Multimodal Learning
1257 directly classified papers
Papers per year
2008: 1
2009: 2
2010: 2
2011: 1
2012: 3
2013: 3
2014: 2
2015: 5
2017: 11
2018: 25
2019: 33
2020: 66
2021: 47
2022: 113
2023: 199
2024: 325
2025: 411
2026: 8
Papers
Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment
CVPR 2024
TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes
CVPR 2024
EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models
CVPR 2024
Complementing Event Streams and RGB Frames for Hand Mesh Reconstruction
CVPR 2024
MAFA: Managing False Negatives for Vision-Language Pre-training
CVPR 2024
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning
CVPR 2024
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
CVPR 2024
GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning
CVPR 2024
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
CVPR 2024
C3Net: Compound Conditioned ControlNet for Multimodal Content Generation
CVPR 2024
MuseChat: A Conversational Music Recommendation System for Videos
CVPR 2024
Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models
CVPR 2024
Beyond Text: Frozen Large Language Models in Visual Signal Comprehension
CVPR 2024
4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling
CVPR 2024
Open Vocabulary Semantic Scene Sketch Understanding
CVPR 2024
MTMMC: A Large-Scale Real-World Multi-Modal Camera Tracking Benchmark
CVPR 2024
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing
EMNLP 2024
Interactive Visual Task Learning for Robots
AAAI 2024
G^2SAM: Graph-Based Global Semantic Awareness Method for Multimodal Sarcasm Detection
AAAI 2024
MICap: A Unified Model for Identity-Aware Movie Descriptions
CVPR 2024
Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification
CVPR 2024
Repairs in a Block World: A New Benchmark for Handling User Corrections with Multi-Modal Language Models
EMNLP 2024
Holistic Features are almost Sufficient for Text-to-Video Retrieval
CVPR 2024
360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model
CVPR 2024
Pelican: Correcting Hallucination in Vision-LLMs via Claim Decomposition and Program of Thought Verification
EMNLP 2024
<
1
…
21
22
23
…
51
>