Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Models
Deep Learning
›
Models
›
Vision-Language Models
685 directly classified papers
Papers per year
2015: 1
2016: 1
2017: 3
2018: 1
2019: 7
2020: 12
2021: 26
2022: 57
2023: 94
2024: 235
2025: 248
Papers
MIKE: A New Benchmark for Fine-grained Multimodal Entity Knowledge Editing
ACL 2024
MLeVLM: Improve Multi-level Progressive Capabilities based on Multimodal Large Language Model for Medical Visual Question Answering
ACL 2024
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions
AAAI 2024
Your Vision-Language Model Itself Is a Strong Filter: Towards High-Quality Instruction Tuning with Data Selection
ACL 2024
InfiMM: Advancing Multimodal Understanding with an Open-Sourced Visual Language Model
ACL 2024
Active Prompt Learning in Vision Language Models
CVPR 2024
LEGENT: Open Platform for Embodied Agents
ACL 2024
Cross-Modal Projection in Multimodal LLMs Doesn’t Really Project Visual Attributes to Textual Space
ACL 2024
AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language Models
AAAI 2024
Explaining CLIP's Performance Disparities on Data from Blind/Low Vision Users
CVPR 2024
MTP: A Dataset for Multi-Modal Turning Points in Casual Conversations
ACL 2024
Can't Make an Omelette Without Breaking Some Eggs: Plausible Action Anticipation Using Large Video-Language Models
CVPR 2024
DeVAn: Dense Video Annotation for Video-Language Models
ACL 2024
Sieve: Multimodal Dataset Pruning using Image Captioning Models
CVPR 2024
The Devil is in the Fine-Grained Details: Evaluating Open-Vocabulary Object Detectors for Fine-Grained Understanding
CVPR 2024
Peacock: A Family of Arabic Multimodal Large Language Models and Benchmarks
ACL 2024
Pixel-Aligned Language Model
CVPR 2024
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
ACL 2024
SoftCLIP: Softer Cross-Modal Alignment Makes CLIP Stronger
AAAI 2024
One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models
CVPR 2024
Distilling Vision-Language Models on Millions of Videos
CVPR 2024
Discovering Syntactic Interaction Clues for Human-Object Interaction Detection
CVPR 2024
Learning Multi-Dimensional Human Preference for Text-to-Image Generation
CVPR 2024
DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection
CVPR 2024
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
CVPR 2024
<
1
…
10
11
12
…
28
>