← Models

Deep Learning › Models ›

Vision-Language Models

685 directly classified papers

Papers per year

Papers

From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis EMNLP 2024

Concept-skill Transferability-based Data Selection for Large Vision-Language Models EMNLP 2024

Benchmarking Vision Language Models for Cultural Understanding EMNLP 2024

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection EMNLP 2024

UOUO: Uncontextualized Uncommon Objects for Measuring Knowledge Horizons of Vision Language Models EMNLP 2024

Unifying Multimodal Retrieval via Document Screenshot Embedding EMNLP 2024

Encoding and Controlling Global Semantics for Long-form Video Question Answering EMNLP 2024

Divide and Conquer Radiology Report Generation via Observation Level Fine-grained Pretraining and Prompt Tuning EMNLP 2024

Pelican: Correcting Hallucination in Vision-LLMs via Claim Decomposition and Program of Thought Verification EMNLP 2024

If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions EMNLP 2024

VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models EMNLP 2024

TroL: Traversal of Layers for Large Language and Vision Models EMNLP 2024

UNICORN: A Unified Causal Video-Oriented Language-Modeling Framework for Temporal Video-Language Tasks EMNLP 2024

Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modeling EMNLP 2024

On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and Reasoning EMNLP 2024

Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models EMNLP 2024

Granular Privacy Control for Geolocation with Vision Language Models EMNLP 2024

MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification EMNLP 2024

In-Context Compositional Generalization for Large Vision-Language Models EMNLP 2024

Game on Tree: Visual Hallucination Mitigation via Coarse-to-Fine View Tree and Game Theory EMNLP 2024

Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models EMNLP 2024

GRIZAL: Generative Prior-guided Zero-Shot Temporal Action Localization EMNLP 2024

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality EMNLP 2024

IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning EMNLP 2024

Retrieval-enriched zero-shot image classification in low-resource domains EMNLP 2024