conftrace_
2026 ACL ACL 2026

REMIND: Memorization and Unlearning in LLMs Through the Lens of Input Loss Landscapes

Abstract

AbstractUnderstanding how large language models (LLMs) store, retain, and remove knowledge is critical for interpretability, reliability, and privacy compliance. We reveal a key phenomenon: machine unlearning imprints distinct geometric signatures in the model’s input loss landscape (ILL), with unlearned examples forming flat, low-curvature plateaus that contrast sharply with the high-curvature basins of retained or unseen examples. Remarkably, these patterns emerge even when pointwise losses overlap, exposing residual memorization through input-output behavior alone. Building on this insight, we introduce **REMIND (Residual Memorization in Neighborhood Dynamics)**, a framework that diagnoses memorization states (retained, forgotten, holdout) by probing local ILL curvature over semantically coherent neighborhoods. REMIND operates using only loss queries and a novel embedding-proximity perturbation method to generate controlled, interpretable variants. In evaluations, REMIND achieves 82% multi-class ROC-AUC, outperforming baselines like ROUGE-L and MIN-K%++, with roughly 2× higher AUC at 1% FPR, and remains robust on paraphrased inputs. This neighborhood-level geometric analysis provides a practical, interpretable lens on LLM knowledge retention and unlearning, detecting subtle residual signals missed by pointwise or aggregated metrics.