← Learning Types

Deep Learning › Learning Types ›

Multi-Modal Learning

3194 directly classified papers

Papers per year

Papers

R2-MultiOmnia: Leading Multilingual Multimodal Reasoning via Self-Training ACL 2025

Deduce and Select Evidences with Language Models for Training-Free Video Goal Inference WACV 2025

HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models ACL 2025

Locate-and-Focus: Enhancing Terminology Translation in Speech Language Models ACL 2025

A Strategic Coordination Framework of Small LMs Matches Large LMs in Data Synthesis ACL 2025

VC4VG: Optimizing Video Captions for Text-to-Video Generation EMNLP 2025

MIO: A Foundation Model on Multimodal Tokens EMNLP 2025

TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection EMNLP 2025

Audio-centric Video Understanding Benchmark without Text Shortcut EMNLP 2025

R-Bind: Unified Enhancement of Attribute and Relation Binding in Text-to-Image Diffusion Models EMNLP 2025

SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning EMNLP 2025

LEO-MINI: An Efficient Multimodal Large Language Model using Conditional Token Reduction and Mixture of Multi-Modal Experts EMNLP 2025

VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions EMNLP 2025

HVGuard: Utilizing Multimodal Large Language Models for Hateful Video Detection EMNLP 2025

LATTE: Learning to Think with Vision Specialists EMNLP 2025

CoMMIT: Coordinated Multimodal Instruction Tuning EMNLP 2025

AnyMAC: Cascading Flexible Multi-Agent Collaboration via Next-Agent Prediction EMNLP 2025

CEMTM: Contextual Embedding-based Multimodal Topic Modeling EMNLP 2025

Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression EMNLP 2025

ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning EMNLP 2025

PresentAgent: Multimodal Agent for Presentation Video Generation EMNLP 2025

MathBuddy: A Multimodal System for Affective Math Tutoring EMNLP 2025

RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks EMNLP 2025

PCRI: Measuring Context Robustness in Multimodal Models for Enterprise Applications EMNLP 2025

MoEdit: On Learning Quantity Perception for Multi-object Image Editing CVPR 2025