Computer Vision › Core AI ›

Multimodal Learning

1257 directly classified papers

Papers per year

Papers

CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting ICCV 2025

Query-LIFE: Query-aware Language Image Fusion Embedding for E-Commerce Relevance COLING 2025

Gaze-Language Alignment for Zero-Shot Prediction of Visual Search Targets from Human Gaze Scanpaths ICCV 2025

RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions ICCV 2025

Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation ICCV 2025

VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models ICCV 2025

CVFusion: Cross-View Fusion of 4D Radar and Camera for 3D Object Detection ICCV 2025

DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance ICCV 2025

AM-Adapter: Appearance Matching Adapter for Exemplar-based Semantic Image Synthesis in-the-Wild ICCV 2025

Beyond Visual Understanding Introducing PARROT-360V for Vision Language Model Benchmarking COLING 2025

ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations ICCV 2025

Chat-Driven Text Generation and Interaction for Person Retrieval EMNLP 2025

The Source Image is the Best Attention for Infrared and Visible Image Fusion ICCV 2025

TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection EMNLP 2025

DLRG@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian Languages NAACL 2025

Diagram-Driven Course Questions Generation EMNLP 2025

SMSTracker: Tri-path Score Mask Sigma Fusion for Multi-Modal Tracking ICCV 2025

MAKAR: a Multi-Agent framework based Knowledge-Augmented Reasoning for Grounded Multimodal Named Entity Recognition EMNLP 2025

Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation ACL 2025

Audio-centric Video Understanding Benchmark without Text Shortcut EMNLP 2025

TemCoCo: Temporally Consistent Multi-modal Video Fusion with Visual-Semantic Collaboration ICCV 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration EMNLP 2025

Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia ACL 2025

Multimodal Language Models See Better When They Look Shallower EMNLP 2025

AMIA: Automatic Masking and Joint Intention Analysis Makes LVLMs Robust Jailbreak Defenders EMNLP 2025