← Models

Deep Learning › Models ›

Vision-Language Models

685 directly classified papers

Papers per year

Papers

Making LVLMs Look Twice: Contrastive Decoding with Contrast Images ACL 2025

VisTRA: Visual Tool-use Reasoning Analyzer for Small Object Visual Question Answering ACL 2025

SciVQA 2025: Overview of the First Scientific Visual Question Answering Shared Task ACL 2025

Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling ACL 2025

Modgenix at SemEval-2025 Task 1: Context Aware Vision Language Ranking (CAViLR) for Multimodal Idiomaticity Understanding ACL 2025

FJWU_Squad at SemEval-2025 Task 1: An Idiom Visual Understanding Dataset for Idiom Learning ACL 2025

RITT: A Retrieval-Assisted Framework with Image and Text Table Representations for Table Question Answering ACL 2025

Table Understanding and (Multimodal) LLMs: A Cross-Domain Case Study on Scientific vs. Non-Scientific Data ACL 2025

Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains ACL 2025

What's in the Image? A Deep-Dive into the Vision of Vision Language Models CVPR 2025

CLIP-MSM: A Multi-Semantic Mapping Brain Representation for Human High-Level Visual Cortex AAAI 2025

MP-GUI: Modality Perception with MLLMs for GUI Understanding CVPR 2025

ChatGarment: Garment Estimation, Generation and Editing via Large Language Models CVPR 2025

PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models CVPR 2025

Unified Coding for Both Human Perception and Generalized Machine Analytics with CLIP Supervision AAAI 2025

PAVE: Patching and Adapting Video Large Language Models CVPR 2025

Video Language Model Pretraining with Spatio-temporal Masking CVPR 2025

COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training CVPR 2025

Building a Multi-modal Spatiotemporal Expert for Zero-shot Action Recognition with CLIP AAAI 2025

DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models CVPR 2025

SAIST: Segment Any Infrared Small Target Model Guided by Contrastive Language-Image Pretraining CVPR 2025

CoLLM: A Large Language Model for Composed Image Retrieval CVPR 2025

Position-Aware Guided Point Cloud Completion with CLIP Model AAAI 2025

DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models’ Understanding on Indian Culture EMNLP 2025

M2Edit: Locate and Edit Multi-Granularity Knowledge in Multimodal Large Language Model EMNLP 2025