Artificial Intelligence › Core AI ›

Multimodal Learning

13057 directly classified papers

Papers per year

Papers

The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs WACV 2026

Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models WACV 2026

Analysis of Text Accuracy and Visual Alignment in Vision-Language Models for Artistic Text Generation WACV 2026

ZonUI-3B: Competitive GUI Grounding with a 3B VLM Trained on a Single Consumer GPU WACV 2026

You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction WACV 2026

mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval WACV 2026

Being Positive about Negative Queries: Exclusion Aware Multimodal Retrieval using Disentangled Representations WACV 2026

SceneEval: Evaluating Semantic Coherence in Text-Conditioned 3D Indoor Scene Synthesis WACV 2026

Narrating For You: Prompt-guided Audio-visual Narrating Face Generation Employing Multi-entangled Latent Space WACV 2026

Guiding What Not to Generate: Automated Negative Prompting for Text-Image Alignment WACV 2026

Leveraging LLM-GNN Integration for Open-World Question Answering over Knowledge Graphs EACL 2026

BigTokDetect: A Clinically-Informed Vision–Language Modeling Framework for Detecting Pro-Bigorexia Videos on TikTok EACL 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection EACL 2026

How effective are VLMs in assisting humans in inferring the quality of mental models from Multimodal short answers? EACL 2026

SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space EACL 2026

Expanding the Boundaries of Vision Prior Knowledge in Multi-modal Large Language Models EACL 2026

Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA EACL 2026

A Unified View on Emotion Representation in Large Language Models EACL 2026

Is Information Density Uniform when Utterances are Grounded on Perception and Discourse? EACL 2026

Rethinking Reading Order: Toward Generalizable Document Understanding with LLM-based Relation Modeling EACL 2026

Zer0-Jack: A memory-efficient gradient-based jailbreaking method for black box Multi-modal Large Language Models EACL 2026

CHROMIC: Chronological Reasoning Across Multi-Panel Comics EACL 2026

DART: Leveraging Multi-Agent Disagreement for Tool Recruitment in Multimodal Reasoning EACL 2026

RotBench: Evaluating Multi-modal Large Language Models on Identifying Image Rotation EACL 2026

VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use AAAI 2026