Artificial Intelligence › Core AI ›

Multimodal Learning

13057 directly classified papers

Papers per year

Papers

Bridging the Domain Gap in Small Multimodal Models: A Dual-level Alignment Perspective WACV 2026

Referring Change Detection in Remote Sensing Imagery WACV 2026

VLMs Guided Interpretable Decision Making in Autonomous Driving WACV 2026

Large Sign Language Models: Toward 3D American Sign Language Translation WACV 2026

KFS-Bench: Comprehensive Evaluation of Key Frame Sampling in Long Video Understanding WACV 2026

Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning WACV 2026

ITSELF: Attention Guided Fine-Grained Alignment for Vision-Language Retrieval WACV 2026

SAVE: Sparse Autoencoder-Driven Visual Information Enhancement for Mitigating Object Hallucination WACV 2026

MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps WACV 2026

VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models WACV 2026

Generalizing Sports Feedback Generation by Watching Competitions and Reading Books: A Rock Climbing Case Study WACV 2026

Ego-EXTRA: video-language Egocentric Dataset for EXpert-TRAinee assistance WACV 2026

VRAgent: Self-Refining Agent for Zero-Shot Multimodal Video Retrieval WACV 2026

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models WACV 2026

BanglaProtha: Evaluating Vision Language Models in Underrepresented Long-tail Cultural Contexts WACV 2026

One-shot Portrait Stylizaiton via Geometric Alignment WACV 2026

AuViRe: Audio-visual Speech Representation Reconstruction for Deepfake Temporal Localization WACV 2026

Patch Your Matcher: Correspondence-Aware Image-to-Image Translation Unlocks Cross-Modal Matching via Single-Modality Priors WACV 2026

Broadcast2Pitch: Game State Reconstruction from Unconstrained Soccer Videos WACV 2026

ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos WACV 2026

Countering Multi-modal Representation Collapse through Rank-targeted Fusion WACV 2026

MapleGrasp: Mask-guided Feature Pooling for Language-driven Efficient Robotic Grasping WACV 2026

SuperRivolution: Fine-Scale Rivers from Coarse Temporal Satellite Imagery WACV 2026

VISTA: A Vision and Intent-Aware Social Attention Framework for Multi-Agent Trajectory Prediction WACV 2026

Visual–Linguistic Abductive Reasoning with LLMs for Knowledge-based Visual Question Answering EACL 2026