GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs

Duy Nguyen; Archiki Prasad; Elias Stengel-Eskin; Mohit Bansal

2026 ACL ACL 2026

GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs

Abstract

AbstractInference-time steering provides a lightweight alternative to fine-tuning large language models (LLMs) and vision-language models (VLMs) by modifying model activations without updating weights. However, existing methods often rely on a global intervention vector, overlook token-level causal influence, and underutilize model logits, especially in multimodal settings where visual and textual inputs contribute unevenly. We propose GrAInS, a contrastive, gradient-based approach that leverages Integrated Gradients to identify top-k influential tokens and construct directional steering vectors based on their contribution to preferred over dispreferred outputs. These vectors guide activation intervention at each layer, preserving the representational scale. GrAInS outperforms fine-tuning and prior steering methods on both LLM and VLM tasks: improving TruthfulQA accuracy by 13.22% (Llama-3.1-8B), reducing MMHal-Bench hallucinations from 0.624 to 0.514 (LLaVA-1.6-7B), and increasing SPA-VL alignment by 8.11%, all without degrading fluency or general capabilities.

Authors

Duy Nguyen , Archiki Prasad , Elias Stengel-Eskin , Mohit Bansal

Topics

Artificial Intelligence > Core AI > Interpretability Artificial Intelligence > Core AI > Large Language Models Artificial Intelligence > Core AI > Vision-Language Models

Keywords

vision-language model integrated gradient gradient attribution activation steering inference time steering

Download PDF

Related papers

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand 2026

One-step Nonautoregressive Natural Language Generation with Shortcut Flow Matching Models 2026

Optimizing Retrieval-Augmented Generation for E-Commerce How-To Assistance 2026

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing 2026

MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation 2026