Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Deep Learning
›
Learning Types
›
Multi-Modal Learning
3194 directly classified papers
Papers per year
2003: 1
2010: 1
2011: 1
2013: 5
2014: 3
2015: 9
2016: 23
2017: 49
2018: 78
2019: 158
2020: 223
2021: 261
2022: 354
2023: 471
2024: 705
2025: 835
2026: 17
Papers
OVG-HQ: Online Video Grounding with Hybrid-modal Queries
ICCV 2025
VAQUUM: Are Vague Quantifiers Grounded in Visual Data?
ACL 2025
V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
ICCV 2025
VC4VG: Optimizing Video Captions for Text-to-Video Generation
EMNLP 2025
Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation
ICCV 2025
Audio-centric Video Understanding Benchmark without Text Shortcut
EMNLP 2025
Everything is a Video: Unifying Modalities through Next-Frame Prediction
ICCV 2025
SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning
EMNLP 2025
Counting Stacked Objects
ICCV 2025
CoMMIT: Coordinated Multimodal Instruction Tuning
EMNLP 2025
YOLO-Count: Differentiable Object Counting for Text-to-Image Generation
ICCV 2025
AnyMAC: Cascading Flexible Multi-Agent Collaboration via Next-Agent Prediction
EMNLP 2025
OrderChain: Towards General Instruct-Tuning for Stimulating the Ordinal Understanding Ability of MLLM
ICCV 2025
MathBuddy: A Multimodal System for Affective Math Tutoring
EMNLP 2025
Fix-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text
ICCV 2025
SUTrack: Towards Simple and Unified Single Object Tracking
AAAI 2025
Benchmarking Multimodal Large Language Models Against Image Corruptions
ICCV 2025
PresentAgent: Multimodal Agent for Presentation Video Generation
EMNLP 2025
Balancing Task-invariant Interaction and Task-specific Adaptation for Unified Image Fusion
ICCV 2025
TokenMatcher: Diverse Tokens Matching for Unsupervised Visible-Infrared Person Re-Identification
AAAI 2025
What's Making That Sound Right Now? Video-centric Audio-Visual Localization
ICCV 2025
ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning
EMNLP 2025
PASTA: Part-Aware Sketch-to-3D Shape Generation with Text-Aligned Prior
ICCV 2025
AnyTalk: Multi-modal Driven Multi-domain Talking Head Generation
AAAI 2025
GraphCheck: Breaking Long-Term Text Barriers with Extracted Knowledge Graph-Powered Fact-Checking
ACL 2025
<
1
…
22
23
24
…
128
>