multimodal learning

4622 papers

Explore in graph

Also known as

VLM VLLM MM VLA MLLMS MLM MML MULLM LMM MLLM MMT

Co-occurring keywords

large language model (12755) vision-language model (2235) visual question answering (1000) video understanding (1647) multi-modal learning (1276) contrastive learning (3979) representation learning (6174) transfer learning (5442) zero-shot learning (3637) vision language model (752)

Papers

CTYUN-AI at SemEval-2025 Task 1: Learning to Rank for Idiomatic Expressions SEMEVAL 2025

How Do Large Vision-Language Models See Text in Image? Unveiling the Distinctive Role of OCR Heads EMNLP 2025

UoR-NCL at SemEval-2025 Task 1: Using Generative LLMs and CLIP Models for Multilingual Multimodal Idiomaticity Representation SEMEVAL 2025

Read, Watch and Scream! Sound Generation from Text and Video AAAI 2025

Representation Learning with Mutual Influence of Modalities for Node Classification in Multi-Modal Heterogeneous Networks IJCAI 2025

PunMemeCN: A Benchmark to Explore Vision-Language Models’ Understanding of Chinese Pun Memes EMNLP 2025

Retrieval over Classification: Integrating Relation Semantics for Multimodal Relation Extraction EMNLP 2025

What if Othello-Playing Language Models Could See? EMNLP 2025

InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning EMNLP 2025

BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion EMNLP 2025

MAKAR: a Multi-Agent framework based Knowledge-Augmented Reasoning for Grounded Multimodal Named Entity Recognition EMNLP 2025

Judge and Improve: Towards a Better Reasoning of Knowledge Graphs with Large Language Models EMNLP 2025

SandboxSocial: A Sandbox for Social Media Using Multimodal AI Agents IJCAI 2025

UnCo: Uncertainty-Driven Collaborative Framework of Large and Small Models for Grounded Multimodal NER EMNLP 2025

ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents EMNLP 2025

Feature Design for Bridging SAM and CLIP toward Referring Image Segmentation WACV 2025

When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models EMNLP 2025

Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions CVPR 2025

VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting AAAI 2025

UCSC NLP T6 at SemEval-2025 Task 1: Leveraging LLMs and VLMs for Idiomatic Understanding ACL 2025

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks RSS 2025

Advancing Chart Question Answering with Robust Chart Component Recognition WACV 2025

YNU-HPCC at SemEval-2025 Task 1: Enhancing Multimodal Idiomaticity Representation via LoRA and Hybrid Loss Optimization SEMEVAL 2025

FlashVTG: Feature Layering and Adaptive Score Handling Network for Video Temporal Grounding WACV 2025

Generative Agents for Multimodal Controversy Detection IJCAI 2025