CachePrune: Teaching LLMs What Not to Follow via KV-Cache Editing

Rui Wang; Junda Wu; Yu Xia; Tong Yu; Ruiyi Zhang; Ryan A. Rossi; Subrata Mitra; Lina Yao; Julian McAuley

2026 ACL ACL 2026

CachePrune: Teaching LLMs What Not to Follow via KV-Cache Editing

Abstract

AbstractLarge Language Models (LLMs) are susceptible to indirect prompt injection attack, where the model inadvertently responds to instructions injected into the prompt context. This vulnerability stems from LLMs’ inability to distinguish between data and instructions within a prompt. We propose CachePrune that defends against this attack by identifying and pruning neurons associated with instruction-following, during KV cache encoding of the prompt context. The pruning steers the LLM toward interpreting the context purely as data rather than as instructions to follow. To identify these neurons, we introduce a neural attribution mechanism guided by a preferential attribution loss, and theoretically connect this loss to an upper bound of the Direct Preference Optimization (DPO) objective. Further, we improve on the fidelity of neural attribution by leveraging an observed triggering effect in instruction-following. Our approach does not interfere with prompt formatting or incur test-time overhead in response generation. Experiments show that CachePrune significantly reduces the attack success rate while preserving the LLM’s ability to follow user instructions.

Authors

Rui Wang , Junda Wu , Yu Xia , Tong Yu , Ruiyi Zhang , Ryan A. Rossi , Subrata Mitra , Lina Yao , Julian McAuley

Topics

Artificial Intelligence > Core AI > Large Language Models Artificial Intelligence > Core AI > Safety Artificial Intelligence > Core AI > Security

Keywords

instruction following large language model prompt injection attack kv-cache editing neural attribution

Download PDF

Related papers

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand 2026

One-step Nonautoregressive Natural Language Generation with Shortcut Flow Matching Models 2026

Optimizing Retrieval-Augmented Generation for E-Commerce How-To Assistance 2026

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing 2026

MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation 2026