Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

Asaf Yehudai; Lilach Eden; Michal Shmueli-Scheuer

2026 ACL ACL 2026

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

Abstract

AbstractAgentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses serious challenges for overseeing and assessing agent behavior. Most current tools are limited, focusing on observability with basic evaluation or creating a static taxonomy of agent errors. To address this gap, we present Agentic CLEAR, an automatic, dynamic, and easy-to-use evaluation framework. It produces textual insights into the agent behavior on three levels of granularity: system, trace, and node. Agentic CLEAR operates above the observability layer, enabling seamless integration and featuring an intuitive UI that makes agent evaluation highly accessible. In our experiments on four benchmarks, seven agentic settings, and tens of thousands of LLM calls, we show that Agentic CLEAR produces high-quality, data-driven, insightful feedback. Our analysis shows strong alignment with human-annotated errors and the ability to predict task success rate.

Authors

Asaf Yehudai , Lilach Eden , Michal Shmueli-Scheuer

Topics

Artificial Intelligence > Core AI > Agent Systems Artificial Intelligence > Core AI > Large Language Models Artificial Intelligence > Core AI > Evaluation

Keywords

agent evaluation error taxonomy agentic system large language model task success rate

Download PDF

Related papers

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand 2026

One-step Nonautoregressive Natural Language Generation with Shortcut Flow Matching Models 2026

Optimizing Retrieval-Augmented Generation for E-Commerce How-To Assistance 2026

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing 2026

MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation 2026