From Words to Pixels: A Comprehensive Survey on Large Language Models in Visual Segmentation

Yizhou Wang; Mang Tik Chiu; Lingzhi Zhang; Xuan Shen; Sohrab Amirghodsi; Yun Fu

2026 ACL ACL 2026

From Words to Pixels: A Comprehensive Survey on Large Language Models in Visual Segmentation

Abstract

AbstractVisual segmentation, the task of segmenting an image into semantically meaningful regions, is a cornerstone in machine learning and has widespread applications in industry. Nevertheless, visual segmentation with instruction has been a challenging task for many years. This largely stems from the cross-modal discrepancy between language and image domains, resulting in difficulty in relating the instruction semantics and the pixel-level predictions. In recent years, the remarkable reasoning capabilities of Large Language Models (LLMs) and Large Multimodal Models (LMMs) have spurred a new wave of research aiming to bridge the disparity between natural language instructions and pixel-level understanding. This survey offers the first comprehensive overview of the rapidly evolving field of LLM-driven visual segmentation. We categorize existing approaches based on their core objectives and methodologies, including reasoning-based segmentation, open-vocabulary segmentation, grounding techniques connecting language to pixels, and extensions to video domains. We review recent seminal works in LLM-based visual segmentation, analyzing their architectural innovations, training strategies, and benchmark performance. Furthermore, we discuss the common datasets, evaluation metrics, and identify key challenges and promising future directions at the intersection of language and visual segmentation. We hope this survey serves as a valuable resource for researchers and practitioners seeking to understand the current landscape and future directions of leveraging LLMs for sophisticated visual segmentation tasks and applications. The resource summary is available at https://github.com/wyzjack/Awesome-LLM-Visual-Segmentation.

Authors

Yizhou Wang , Mang Tik Chiu , Lingzhi Zhang , Xuan Shen , Sohrab Amirghodsi , Yun Fu

Topics

Computer Vision > Analysis > Semantic Segmentation Natural Language Processing > Resources & Methods > Large Language Models Artificial Intelligence > Core AI > Vision-Language Models

Keywords

large multimodal model open-vocabulary segmentation visual segmentation large language model reasoning-based segmentation

Download PDF

Related papers

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand 2026

One-step Nonautoregressive Natural Language Generation with Shortcut Flow Matching Models 2026

Optimizing Retrieval-Augmented Generation for E-Commerce How-To Assistance 2026

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing 2026

MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation 2026