Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Deep Learning
›
Learning Types
›
Multi-Modal Learning
3194 directly classified papers
Papers per year
2003: 1
2010: 1
2011: 1
2013: 5
2014: 3
2015: 9
2016: 23
2017: 49
2018: 78
2019: 158
2020: 223
2021: 261
2022: 354
2023: 471
2024: 705
2025: 835
2026: 17
Papers
AnimateAnything: Consistent and Controllable Animation for Video Generation
CVPR 2025
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
CVPR 2025
Video Language Model Pretraining with Spatio-temporal Masking
CVPR 2025
Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy
CVPR 2025
OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
CVPR 2025
PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models
CVPR 2025
3D Part Segmentation via Geometric Aggregation of 2D Visual Features
WACV 2025
FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations
CVPR 2025
CASP: Compression of Large Multimodal Models Based on Attention Sparsity
CVPR 2025
DH-Set: Improving Vision-Language Alignment with Diverse and Hybrid Set-Embeddings Learning
CVPR 2025
Non-Natural Image Understanding with Advancing Frequency-based Vision Encoders
CVPR 2025
EasyCraft: A Robust and Efficient Framework for Automatic Avatar Crafting
CVPR 2025
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity
CVPR 2025
AesthetiQ: Enhancing Graphic Layout Design via Aesthetic-Aware Preference Alignment of Multi-modal Large Language Models
CVPR 2025
Feature Design for Bridging SAM and CLIP toward Referring Image Segmentation
WACV 2025
From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration
CVPR 2025
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
CVPR 2025
VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
CVPR 2025
Is this Generated Person Existed in Real-world? Fine-grained Detecting and Calibrating Abnormal Human-body
CVPR 2025
UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting
CVPR 2025
Sonic: Shifting Focus to Global Audio Perception in Portrait Animation
CVPR 2025
Question-Aware Gaussian Experts for Audio-Visual Question Answering
CVPR 2025
Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes
CVPR 2025
4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models
CVPR 2025
CausalVLBench: Benchmarking Visual Causal Reasoning in Large Vision-Language Models
EMNLP 2025
<
1
…
32
33
34
…
128
>