Artificial Intelligence › Core AI ›

Multimodal Learning

13057 directly classified papers

Papers per year

Papers

From Prompt to Production: Automating Brand-Safe Marketing Imagery with Text-to-Image Models WACV 2026

SOAF: Scene Occlusion-aware Neural Acoustic Field WACV 2026

Test-Time Consistency in Vision Language Models WACV 2026

Can We Challenge Open-Vocabulary Object Detectors with Generated Content in Street Scenes? WACV 2026

RampWatch: An In-the-Wild Dataset and Text-Guided Detection Framework for Recreational Vessels WACV 2026

MemeTAG: Keyword-Driven Meme Classification through Tag Embedding Reconstruction WACV 2026

SimForce: Force and Surface Electromyography from Full Body Video with Graph Neural Nets WACV 2026

SceneProp: Combining Neural Network and Markov Random Field for Scene-Graph Grounding WACV 2026

Learnable Query-Enhanced Pose Transformation WACV 2026

Reconstructing Realistic and Relightable Eyes WACV 2026

UNO: Unifying One-stage Video Scene Graph Generation via Object-Centric Visual Representation Learning WACV 2026

Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships WACV 2026

Towards Fine-Grained Adaptation of CLIP via a Self-Trained Alignment Score WACV 2026

PromptGAR: Flexible Promptive Group Activity Recognition WACV 2026

brat: Aligned Multi-View Embeddings for Brain MRI Analysis WACV 2026

Semantic Map Guided Bird's-Eye View Learning for Online HD Map Construction WACV 2026

Gene-DML: Dual-Pathway Multi-Level Discrimination for Gene Expression Prediction from Histopathology Images WACV 2026

FG-TRACER: Tracing Information Flow in Multimodal Large Language Models in Free-Form Generation WACV 2026

Zero-shot Hierarchical Plant Segmentation via Foundation Segmentation Models and Text-to-image Attention WACV 2026

SurgXBench: Explainable Vision-Language Model Benchmark for Surgery WACV 2026

SmokeBench: Evaluating Multimodal Large Language Models for Wildfire Smoke Detection WACV 2026

FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs WACV 2026

Root Completion from Intraoral Scans of Tooth Crowns using Diffusion with Patch Perturbation WACV 2026

MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding WACV 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models EACL 2026