← Learning Types

Deep Learning › Learning Types ›

Multi-Modal Learning

3194 directly classified papers

Papers per year

Papers

Stable Diffusion Models are Secretly Good at Visual In-Context Learning ICCV 2025

CUET-NLP_MP@DravidianLangTech 2025: A Transformer-Based Approach for Bridging Text and Vision in Misogyny Meme Detection in Dravidian Languages NAACL 2025

Augmented and Softened Matching for Unsupervised Visible-Infrared Person Re-Identification ICCV 2025

Factors Affecting Translation Quality in In-context Learning for Multilingual Medical Domain EMNLP 2025

Spatial Alignment and Temporal Matching Adapter for Video-Radar Remote Physiological Measurement ICCV 2025

Exploring Multimodal Language Models for Sustainability Disclosure Extraction: A Comparative Study NAACL 2025

Probabilistic Prototype Calibration of Vision-language Models for Generalized Few-shot Semantic Segmentation ICCV 2025

S²MILE: Semantic-and-Structure-Aware Music-Driven Lyric Generation AAAI 2025

Triad: Empowering LMM-based Anomaly Detection with Expert-guided Region-of-Interest Tokenizer and Manufacturing Process ICCV 2025

Caption Generation in Cultural Heritage: Crowdsourced Data and Tuning Multimodal Large Language Models NAACL 2025

Steering Guidance for Personalized Text-to-Image Diffusion Models ICCV 2025

Attention Bootstrapping for Multi-Modal Test-Time Adaptation AAAI 2025

ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction ICCV 2025

Cross-Modal Learning for Music-to-Music-Video Description Generation NAACL 2025

Clink! Chop! Thud! - Learning Object Sounds from Real-World Interactions ICCV 2025

Deep Submodular Optimization and LLM for Multimodal Content Extraction and Automatic Poster Generation from Long Document AAAI 2025

Towards Accurate and Efficient 3D Object Detection for Autonomous Driving: A Mixture of Experts Computing System on Edge ICCV 2025

VLind-Bench: Measuring Language Priors in Large Vision-Language Models NAACL 2025

Dynamic-DINO: Fine-Grained Mixture of Experts Tuning for Real-time Open-Vocabulary Object Detection ICCV 2025

Is Your Image a Good Storyteller? AAAI 2025

Scaling Language-Free Visual Representation Learning ICCV 2025

Survival Prediction in Lung Cancer through Multi-Modal Representation Learning WACV 2025

Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation ICCV 2025

ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning EMNLP 2025

SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization CVPR 2025