Neel Nanda

22 papers · 2022–2025 · 5 conferences · across top CS/AI conferences

Achievements

+7 more ↓

🐝 Cross-Pollinator (5) 🌍 Conference Polyglot (5) 🌉 Interdisciplinary Bridge 🧭 Keyword Pioneer 🌈 Renaissance Researcher (5)

🗺️ Taxonomy Completionist (19) 🌉 Interdisciplinary Bridge 👑 Triple Crown 🏆 Keyword Champion (2) ⚡ Prolific Year (8) ❓ The Questioner (3) 💎 Century Club (22)

Conferences

ICML (7) ICLR (6) EMNLP (4) NIPS (4) JMLR (1)

Top co-authors

Arthur Conmy (5) Senthooran Rajamanoharan (4) Curt Tigges (3) Tom Lieberum (3) János Kramár (2) Rohin Shah (2) Noura Al Moubayed (2) Callum Stuart McDougall (2) Bart Bussmann (2) Lawrence Chan (2)

Keywords

mechanistic interpretability (3) sparse autoencoder (3) linear representation (2) imitation learning (1) sentiment analysis (1) model calibration (1) bayesian inference (1) policy learning (1) neural network interpretability (1) model analysis (1) group theory (1) model interpretability (1) latent representation (1) language model (1) world model (1) circuit analysis (1) feature decomposition (1) self-supervised learning (1) feature extraction (1) online learning (1)

Papers

Scaling Sparse Feature Circuits For Studying In-Context Learning ICML 2025 Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control ICLR 2025 Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models ICLR 2025 Sparse Autoencoders Do Not Find Canonical Units of Analysis ICLR 2025 SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability ICML 2025 Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models ICML 2025 Learning Multi-Level Features with Matryoshka Sparse Autoencoders ICML 2025 Are Sparse Autoencoders Useful? A Case Study in Sparse Probing ICML 2025 Towards Best Practices of Activation Patching in Language Models: Metrics and Methods ICLR 2024 Explorations of Self-Repair in Language Models ICML 2024 Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders NIPS 2024 Transcoders find interpretable LLM feature circuits NIPS 2024 Confidence Regulation Neurons in Language Models NIPS 2024 Refusal in Language Models Is Mediated by a Single Direction NIPS 2024 Language Models Linearly Represent Sentiment EMNLP 2024 Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 EMNLP 2024 Copy Suppression: Comprehensively Understanding a Motif in Language Model Attention Heads EMNLP 2024 Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching ICLR 2024 A Toy Model of Universality: Reverse Engineering how Networks Learn Group Operations ICML 2023 Progress measures for grokking via mechanistic interpretability ICLR 2023 Emergent Linear Representations in World Models of Self-Supervised Sequence Models EMNLP 2023 Fully General Online Imitation Learning JMLR 2022