conftrace_
2026 ACL ACL 2026

Trajectory Signatures of Deception in Large Language Models

Abstract

AbstractDetecting deceptive behavior in LLMs is typically done post-hoc on outputs or by probing static activations. We instead treat deception as a dynamic process, a trajectory through the model’s hidden-state space during inference. We capture layerwise activations at sparse "decision points" where the model is uncertain between competing tokens, forming activation trajectories for matched truthful vs. deceptive responses across strategic deception, sycophancy, instructed deception, and confabulation. Across GPT-2 and Llama variants, deceptive generation is associated with changes in trajectory geometry, but increases in path length are model and deception-type-dependent. Sycophancy shows the clearest signal, whereas instructed deception yields near-null signatures. With just 7 geometric features, a lightweight classifier achieves performance comparable to PCA-reduced probing at matched dimensionality for binary sycophancy detection and shows preliminary utility for 4-way deception-type classification. These findings indicate that trajectory-based monitoring can provide process-level signals associated with deceptive generation during inference, complementing methods that focus on endpoint activation states.