conftrace_
2026 ACL ACL 2026

Action Boundary Blindness: When LLM Agents Cannot Tell Where One Action Ends and Another Begins

Abstract

AbstractLarge language model (LLM) agents excel at multi-step tasks yet frequently exhibit Action Boundary Blindness—the inability to correctly determine action granularity, scope, and completeness. Grounded in Event Segmentation Theory from cognitive science, we formalize three violation types: granularity confusion, scope creep, and boundary ambiguity. We propose four automatic metrics—Action Boundary Score (ABS), Granularity Alignment Rate (GAR), Scope Violation Rate (SVR), and Boundary-Aware Success Rate (BASR)—requiring no human annotation. Experiments on 1,655 tasks across six benchmarks (𝜏-bench, WebArena, ALFWorld, TheAgentCompany, OSWorld) with seven LLMs reveal that: (1) the best model achieves only 0.424 ABS; (2) using a multi-label attribution framework validated by inter-annotator agreement (𝜅 = 0.78), boundary blindness is the primary failure mode in 37.2% of failures (25.8% as sole cause; 55.9% total involvement including contributing factors); (3) under-action dominates at 48.4%; (4) BASR is consistently ∼4 points lower than traditional success rate, exposing “lucky successes.” Critically, Explicit Boundary Prompting (EBP) improves ABS by 0.08–0.13 across all models, demonstrating that boundary blindness is better characterized as an elicitation gap rather than a fundamental capability limitation—LLMs possess latent boundary perception not activated by default. This finding has implications for alignment and instruction tuning. We validate metrics through state-based cross-validation and human audit, estimating ∼22% false positive rate from valid alternative paths, with model rankings remaining stable (Spearman 𝜌 = 1.0).