Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Deep Learning
›
Learning Types
›
Multi-Modal Learning
3194 directly classified papers
Papers per year
2003: 1
2010: 1
2011: 1
2013: 5
2014: 3
2015: 9
2016: 23
2017: 49
2018: 78
2019: 158
2020: 223
2021: 261
2022: 354
2023: 471
2024: 705
2025: 835
2026: 17
Papers
AIMA at SemEval-2025 Task 1: Bridging Text and Image for Idiomatic Knowledge Extraction via Mixture of Experts
ACL 2025
RITT: A Retrieval-Assisted Framework with Image and Text Table Representations for Table Question Answering
ACL 2025
Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains
ACL 2025
ClimbingCap: Multi-Modal Dataset and Method for Rock Climbing in World Coordinate
CVPR 2025
Galaxy Walker: Geometry-aware VLMs For Galaxy-scale Understanding
CVPR 2025
ViKIENet: Towards Efficient 3D Object Detection with Virtual Key Instance Enhanced Network
CVPR 2025
SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization
CVPR 2025
Combining Inherent Knowledge of Vision-Language Models with Unsupervised Domain Adaptation through Strong-Weak Guidance
WACV 2025
DrVideo: Document Retrieval Based Long Video Understanding
CVPR 2025
AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models
CVPR 2025
SeCap: Self-Calibrating and Adaptive Prompts for Cross-view Person Re-Identification in Aerial-Ground Networks
CVPR 2025
HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization
CVPR 2025
StyleMaster: Stylize Your Video with Artistic Generation and Translation
CVPR 2025
URWKV: Unified RWKV Model with Multi-state Perspective for Low-light Image Restoration
CVPR 2025
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
CVPR 2025
Click&Describe: Multimodal Grounding and Tracking for Aerial Objects
WACV 2025
Animate and Sound an Image
CVPR 2025
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
CVPR 2025
HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation
CVPR 2025
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models
CVPR 2025
Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues
CVPR 2025
Continual SFT Matches Multimodal RLHF with Negative Supervision
CVPR 2025
Semantic-guided Cross-Modal Prompt Learning for Skeleton-based Zero-shot Action Recognition
CVPR 2025
PrevPredMap: Exploring Temporal Modeling with Previous Predictions for Online Vectorized HD Map Construction
WACV 2025
LILaC: Late Interacting in Layered Component Graph for Open-domain Multimodal Multihop Retrieval
EMNLP 2025
<
1
…
31
32
33
…
128
>