← Learning Types

Deep Learning › Learning Types ›

Multi-Modal Learning

3194 directly classified papers

Papers per year

Papers

AIMA at SemEval-2025 Task 1: Bridging Text and Image for Idiomatic Knowledge Extraction via Mixture of Experts ACL 2025

RITT: A Retrieval-Assisted Framework with Image and Text Table Representations for Table Question Answering ACL 2025

Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains ACL 2025

ClimbingCap: Multi-Modal Dataset and Method for Rock Climbing in World Coordinate CVPR 2025

Galaxy Walker: Geometry-aware VLMs For Galaxy-scale Understanding CVPR 2025

ViKIENet: Towards Efficient 3D Object Detection with Virtual Key Instance Enhanced Network CVPR 2025

SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization CVPR 2025

Combining Inherent Knowledge of Vision-Language Models with Unsupervised Domain Adaptation through Strong-Weak Guidance WACV 2025

DrVideo: Document Retrieval Based Long Video Understanding CVPR 2025

AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models CVPR 2025

SeCap: Self-Calibrating and Adaptive Prompts for Cross-view Person Re-Identification in Aerial-Ground Networks CVPR 2025

HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization CVPR 2025

StyleMaster: Stylize Your Video with Artistic Generation and Translation CVPR 2025

URWKV: Unified RWKV Model with Multi-state Perspective for Low-light Image Restoration CVPR 2025

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models CVPR 2025

Click&Describe: Multimodal Grounding and Tracking for Aerial Objects WACV 2025

Animate and Sound an Image CVPR 2025

LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale CVPR 2025

HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation CVPR 2025

Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models CVPR 2025

Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues CVPR 2025

Continual SFT Matches Multimodal RLHF with Negative Supervision CVPR 2025

Semantic-guided Cross-Modal Prompt Learning for Skeleton-based Zero-shot Action Recognition CVPR 2025

PrevPredMap: Exploring Temporal Modeling with Previous Predictions for Online Vectorized HD Map Construction WACV 2025

LILaC: Late Interacting in Layered Component Graph for Open-domain Multimodal Multihop Retrieval EMNLP 2025