conftrace_

multimodal learning

4645 papers

Explore in graph

Co-occurring keywords

large language model (13587) vision-language model (2348) visual question answering (1017) video understanding (1658) multi-modal learning (1278) contrastive learning (4032) representation learning (6206) transfer learning (5449) zero-shot learning (3650) vision language model (767)

Papers

IntelliCockpitBench: A Comprehensive Benchmark to Evaluate VLMs for Intelligent Cockpit ACL 2025

Can VLMs Actually See and Read? A Survey on Modality Collapse in Vision-Language Models ACL 2025

Analyzing the Sensitivity of Vision Language Models in Visual Question Answering ACL 2025

DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles CVPR 2025

SimpleDoc: Multi‐Modal Document Understanding with Dual‐Cue Page Retrieval and Iterative Refinement EMNLP 2025

Patch Ranking: Token Pruning as Ranking Prediction for Efficient CLIP WACV 2025

Hidden in Plain Sight: Evaluation of the Deception Detection Capabilities of LLMs in Multimodal Settings ACL 2025

VinaBench: Benchmark for Faithful and Consistent Visual Narratives CVPR 2025

UCSC NLP T6 at SemEval-2025 Task 1: Leveraging LLMs and VLMs for Idiomatic Understanding ACL 2025

CTPD: Cross-Modal Temporal Pattern Discovery for Enhanced Multimodal Electronic Health Records Analysis ACL 2025

Table Understanding and (Multimodal) LLMs: A Cross-Domain Case Study on Scientific vs. Non-Scientific Data ACL 2025

Ges3ViG : Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding CVPR 2025

YNU-HPCC at SemEval-2025 Task 1: Enhancing Multimodal Idiomaticity Representation via LoRA and Hybrid Loss Optimization ACL 2025

Aria-UI: Visual Grounding for GUI Instructions ACL 2025

3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer CVPR 2025

One Model for ALL: Low-Level Task Interaction Is a Key to Task-Agnostic Image Fusion CVPR 2025

PresentAgent: Multimodal Agent for Presentation Video Generation EMNLP 2025

From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding EMNLP 2025

GeoSAFE - A Novel Geospatial Artificial Intelligence Safety Assurance Framework and Evaluation for LLM Moderation IJCNLP 2025

Challenging Multimodal LLMs with African Standardized Exams: A Document VQA Evaluation ACL 2025

Text or Pixels? Evaluating Efficiency and Understanding of LLMs with Visual Text Inputs EMNLP 2025

ORID: Organ-Regional Information Driven Framework for Radiology Report Generation WACV 2025

External Reliable Information-enhanced Multimodal Contrastive Learning for Fake News Detection AAAI 2025

ViFT: Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models EMNLP 2025

Bringing RNNs Back to Efficient Open-Ended Video Understanding ICCV 2025