multimodal learning

4622 papers

Explore in graph

Also known as

VLM VLLM MM VLA MLLMS MLM MML MULLM LMM MLLM MMT

Co-occurring keywords

large language model (12755) vision-language model (2235) visual question answering (1000) video understanding (1647) multi-modal learning (1276) contrastive learning (3979) representation learning (6174) transfer learning (5442) zero-shot learning (3637) vision language model (752)

Papers

Multi-View Empowered Structural Graph Wordification for Language Models AAAI 2025

Does Visual Grounding Enhance the Understanding of Embodied Knowledge in Large Language Models? EMNLP 2025

Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search Engines AAAI 2025

Learning Sparsity for Effective and Efficient Music Performance Question Answering ACL 2025

LLM-empowered Dynamic Prompt Routing for Vision-Language Models Tuning under Long-Tailed Distributions EMNLP 2025

Cross-modal Multi-task Learning for Multimedia Event Extraction AAAI 2025

When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models EMNLP 2025

Judge and Improve: Towards a Better Reasoning of Knowledge Graphs with Large Language Models EMNLP 2025

MAKAR: a Multi-Agent framework based Knowledge-Augmented Reasoning for Grounded Multimodal Named Entity Recognition EMNLP 2025

UnCo: Uncertainty-Driven Collaborative Framework of Large and Small Models for Grounded Multimodal NER EMNLP 2025

ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents EMNLP 2025

M-LongDoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework EMNLP 2025

SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation EMNLP 2025

AI Knows Where You Are: Exposure, Bias, and Inference in Multimodal Geolocation with KoreaGEO EMNLP 2025

Stop Looking for “Important Tokens” in Multimodal Language Models: Duplication Matters More EMNLP 2025

Probing Logical Reasoning of MLLMs in Scientific Diagrams EMNLP 2025

Static or Dynamic: Towards Query-Adaptive Token Selection for Video Question Answering EMNLP 2025

Can Vision-Language Models Solve Visual Math Equations? EMNLP 2025

Evaluating LLM-Generated Diagrams as Graphs EMNLP 2025

DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning EMNLP 2025

PRIM: Towards Practical In-Image Multilingual Machine Translation EMNLP 2025

TVQACML: Benchmarking Text-Centric Visual Question Answering in Multilingual Chinese Minority Languages EMNLP 2025

Transparent and Coherent Procedural Mistake Detection EMNLP 2025

Learning to See through Sound: From VggCaps to Multi2Cap for Richer Automated Audio Captioning EMNLP 2025

URO-Bench: Towards Comprehensive Evaluation for End-to-End Spoken Dialogue Models EMNLP 2025