N-GLARE: An Non-Generative Latent Representation-Efficient LLM Safety Evaluator

Zheyu Lin; Jirui Yang; Yukui Qiu; Yubing Bao; Hengqi Guo; Yao Guan

2026 ACL ACL 2026

N-GLARE: An Non-Generative Latent Representation-Efficient LLM Safety Evaluator

Abstract

AbstractEvaluating the safety robustness of LLMs is critical for their deployment. However, mainstream Red Teaming methods rely on online generation and black-box output analysis. These approaches are not only costly but also suffer from feedback latency, making them unsuitable for agile diagnostics after training a new model.To address this, we propose N-GLARE (A Non-Generative, Latent Representation-Efficient LLM Safety Evaluator). N-GLARE operates entirely on the model’s latent representations, bypassing the need for full text generation. It characterizes hidden layer dynamics by analyzing the APT (Angular-Probabilistic Trajectory) of latent representations and introducing the JSS (Jensen-Shannon Separability) metric.Experiments on over 40 models and 20 red teaming strategies demonstrate that the JSS metric exhibits high consistency with Red Teaming safety rankings at less than 1% token and runtime cost.

Authors

Zheyu Lin , Jirui Yang , Yukui Qiu , Yubing Bao , Hengqi Guo , Yao Guan

Topics

Artificial Intelligence > Core AI > Large Language Models Artificial Intelligence > Core AI > Safety Artificial Intelligence > Core AI > Evaluation

Keywords

latent representation safety evaluation red teaming large language model safety robustness

Download PDF

Related papers

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand 2026

One-step Nonautoregressive Natural Language Generation with Shortcut Flow Matching Models 2026

Optimizing Retrieval-Augmented Generation for E-Commerce How-To Assistance 2026

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing 2026

MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation 2026