Action Boundary Blindness: When LLM Agents Cannot Tell Where One Action Ends and Another Begins

Zhangyi Wang; Bingnan Yu; Jiexiang Xu; Zongze Li

2026 ACL ACL 2026

Action Boundary Blindness: When LLM Agents Cannot Tell Where One Action Ends and Another Begins

Abstract

AbstractLarge language model (LLM) agents excel at multi-step tasks yet frequently exhibit Action Boundary Blindness—the inability to correctly determine action granularity, scope, and completeness. Grounded in Event Segmentation Theory from cognitive science, we formalize three violation types: granularity confusion, scope creep, and boundary ambiguity. We propose four automatic metrics—Action Boundary Score (ABS), Granularity Alignment Rate (GAR), Scope Violation Rate (SVR), and Boundary-Aware Success Rate (BASR)—requiring no human annotation. Experiments on 1,655 tasks across six benchmarks (𝜏-bench, WebArena, ALFWorld, TheAgentCompany, OSWorld) with seven LLMs reveal that: (1) the best model achieves only 0.424 ABS; (2) using a multi-label attribution framework validated by inter-annotator agreement (𝜅 = 0.78), boundary blindness is the primary failure mode in 37.2% of failures (25.8% as sole cause; 55.9% total involvement including contributing factors); (3) under-action dominates at 48.4%; (4) BASR is consistently ∼4 points lower than traditional success rate, exposing “lucky successes.” Critically, Explicit Boundary Prompting (EBP) improves ABS by 0.08–0.13 across all models, demonstrating that boundary blindness is better characterized as an elicitation gap rather than a fundamental capability limitation—LLMs possess latent boundary perception not activated by default. This finding has implications for alignment and instruction tuning. We validate metrics through state-based cross-validation and human audit, estimating ∼22% false positive rate from valid alternative paths, with model rankings remaining stable (Spearman 𝜌 = 1.0).

Authors

Zhangyi Wang , Bingnan Yu , Jiexiang Xu , Zongze Li

Topics

Artificial Intelligence > Core AI > Agent Systems Artificial Intelligence > Core AI > Large Language Models Artificial Intelligence > Core AI > Evaluation

Keywords

large language model agent event segmentation action boundary blindness explicit boundary prompting

Download PDF

Related papers

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand 2026

One-step Nonautoregressive Natural Language Generation with Shortcut Flow Matching Models 2026

Optimizing Retrieval-Augmented Generation for E-Commerce How-To Assistance 2026

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing 2026

MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation 2026