Neel Nanda
22 papers · 2022–2025 · 5 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+7 more ↓ Show less ↑
π Cross-Pollinator (5) π Conference Polyglot (5) π Interdisciplinary Bridge π§ Keyword Pioneer π Renaissance Researcher (5)
πΊοΈ
Taxonomy Completionist
(19)
π
Interdisciplinary Bridge
π
Triple Crown
π
Keyword Champion
(2)
β‘
Prolific Year
(8)
β
The Questioner
(3)
π
Century Club
(22)
Conferences
ICML (7)
ICLR (6)
EMNLP (4)
NIPS (4)
JMLR (1)
Top co-authors
Keywords
mechanistic interpretability
(3)
sparse autoencoder
(3)
linear representation
(2)
imitation learning
(1)
sentiment analysis
(1)
model calibration
(1)
bayesian inference
(1)
policy learning
(1)
neural network interpretability
(1)
model analysis
(1)
group theory
(1)
model interpretability
(1)
latent representation
(1)
language model
(1)
world model
(1)
circuit analysis
(1)
feature decomposition
(1)
self-supervised learning
(1)
feature extraction
(1)
online learning
(1)
Papers
Scaling Sparse Feature Circuits For Studying In-Context Learning
ICML 2025
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
ICLR 2025
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
ICLR 2025
Sparse Autoencoders Do Not Find Canonical Units of Analysis
ICLR 2025
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
ICML 2025
Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models
ICML 2025
Learning Multi-Level Features with Matryoshka Sparse Autoencoders
ICML 2025
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing
ICML 2025
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
ICLR 2024
Explorations of Self-Repair in Language Models
ICML 2024
Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders
NIPS 2024
Transcoders find interpretable LLM feature circuits
NIPS 2024
Confidence Regulation Neurons in Language Models
NIPS 2024
Refusal in Language Models Is Mediated by a Single Direction
NIPS 2024
Language Models Linearly Represent Sentiment
EMNLP 2024
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
EMNLP 2024
Copy Suppression: Comprehensively Understanding a Motif in Language Model Attention Heads
EMNLP 2024
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching
ICLR 2024
A Toy Model of Universality: Reverse Engineering how Networks Learn Group Operations
ICML 2023
Progress measures for grokking via mechanistic interpretability
ICLR 2023
Emergent Linear Representations in World Models of Self-Supervised Sequence Models
EMNLP 2023
Fully General Online Imitation Learning
JMLR 2022