Tracing Logit Trajectories Across Layer Depth: Dataset-Level Explainability for Language Models

Jeesu Jung; Sangkeun Jung

2026 ACL ACL 2026

Tracing Logit Trajectories Across Layer Depth: Dataset-Level Explainability for Language Models

Abstract

AbstractSentence-level explanations can miss the bigger picture of how a black-box model behaves across data, which matters most for complex criteria like safety that cannot be defined by a single rule. We trace **Logit-Trajectory**, which tracks adjacent-layer logit updates as vectors and aggregates them into a reproducible dataset-level trajectory pattern, enabling depth-wise explainability through signals such as coherence and angular rotation. Across 6 languages and 5 NLP tasks, we show these trajectory summaries reveal consistent depth-wise patterns that divergence- and similarity-based baselines often wash out due to scalarization. As a case study where dataset-level intermediate decision structure matters, we evaluate safety classification, reporting both trajectory-level visual separability and classification performance.

Authors

Jeesu Jung , Sangkeun Jung

Topics

Artificial Intelligence > Core AI > Large Language Models Deep Learning > Optimization & Theory > Interpretability Artificial Intelligence > Core AI > Explainability

Keywords

safety classification logit trajectory dataset-level explainability depth-wise explainability

Download PDF

Related papers

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand 2026

One-step Nonautoregressive Natural Language Generation with Shortcut Flow Matching Models 2026

Optimizing Retrieval-Augmented Generation for E-Commerce How-To Assistance 2026

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing 2026

MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation 2026