Interpretable Traces, Unexpected Outcomes: Investigating the Disconnect in Trace-Based Knowledge Distillation

Siddhant Bhambri; Upasana Biswas; Subbarao Kambhampati

2026 ACL ACL 2026

Interpretable Traces, Unexpected Outcomes: Investigating the Disconnect in Trace-Based Knowledge Distillation

Abstract

AbstractRecent advances in reasoning-oriented Large Language Models (LLMs) have been driven by the introduction of Chain-of-Thought (CoT) traces, where models generate intermediate reasoning traces before producing an answer. These traces, as in DeepSeek R1, are not only used to guide model inference but also serve as supervision signals for Knowledge Distillation (KD) to improve smaller models. A prevailing but under-examined implicit assumption is that these CoT traces when emitted at inference time are both semantically correct and interpretable for the end-users. While there are reasons to believe that these intermediate tokens help improve solution accuracy, in this work, we question their validity (semantic correctness) and interpretability to the end user. To isolate the effect of trace semantics, we design experiments in the Question Answering (QA) domain using a rule-based problem decomposition method. This enables us to create Supervised Fine-Tuning (SFT) datasets for LLMs where - each QA problem is paired with either verifiably correct or incorrect CoT traces, while always providing the correct final solution. Trace correctness at inference time is then evaluated by checking the accuracy of every sub-step in decomposed reasoning chains. To assess end-user interpretability, we finetune LLMs with three additional types of CoT traces: R1 traces, R1 trace summaries, and post-hoc explanations of R1 traces. We further conduct a human-subject study with 100 participants asking them to rate the interpretability of each trace type on a standardized Likert scale. Our experiments reveal two key findings - (1) CoT trace correctness is not reliably correlated with the model’s generation of correct final answers: correct traces led to correct solutions only for 28% test-set problems while incorrect traces don’t necessarily degrade solution accuracy. (2) In end-user interpretability studies, fine-tuning on verbose R1 traces produced the best model performance but these

Authors

Siddhant Bhambri , Upasana Biswas , Subbarao Kambhampati

Topics

Artificial Intelligence > Core AI > Interpretability Deep Learning > Learning Types > Knowledge Distillation Deep Learning > Learning Types > Chain-of-Thought Reasoning

Keywords

knowledge distillation question answering chain-of-thought reasoning model interpretability supervised fine-tuning large language model

Download PDF

Related papers

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand 2026

One-step Nonautoregressive Natural Language Generation with Shortcut Flow Matching Models 2026

Optimizing Retrieval-Augmented Generation for E-Commerce How-To Assistance 2026

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing 2026

MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation 2026