conftrace_

Neel Nanda

22 papers · 2022–2025 · 5 conferences · across top CS/AI conferences

Achievements

Jump to papers ↓
+7 more ↓ 🐝 Cross-Pollinator (5) 🌍 Conference Polyglot (5) πŸŒ‰ Interdisciplinary Bridge 🧭 Keyword Pioneer 🌈 Renaissance Researcher (5)
πŸ—ΊοΈ Taxonomy Completionist (19) πŸŒ‰ Interdisciplinary Bridge πŸ‘‘ Triple Crown πŸ† Keyword Champion (2) ⚑ Prolific Year (8) ❓ The Questioner (3) πŸ’Ž Century Club (22)

Conferences

ICML (7) ICLR (6) EMNLP (4) NIPS (4) JMLR (1)

Papers

Scaling Sparse Feature Circuits For Studying In-Context Learning ICML 2025 Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control ICLR 2025 Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models ICLR 2025 Sparse Autoencoders Do Not Find Canonical Units of Analysis ICLR 2025 SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability ICML 2025 Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models ICML 2025 Learning Multi-Level Features with Matryoshka Sparse Autoencoders ICML 2025 Are Sparse Autoencoders Useful? A Case Study in Sparse Probing ICML 2025 Towards Best Practices of Activation Patching in Language Models: Metrics and Methods ICLR 2024 Explorations of Self-Repair in Language Models ICML 2024 Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders NIPS 2024 Transcoders find interpretable LLM feature circuits NIPS 2024 Confidence Regulation Neurons in Language Models NIPS 2024 Refusal in Language Models Is Mediated by a Single Direction NIPS 2024 Language Models Linearly Represent Sentiment EMNLP 2024 Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 EMNLP 2024 Copy Suppression: Comprehensively Understanding a Motif in Language Model Attention Heads EMNLP 2024 Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching ICLR 2024 A Toy Model of Universality: Reverse Engineering how Networks Learn Group Operations ICML 2023 Progress measures for grokking via mechanistic interpretability ICLR 2023 Emergent Linear Representations in World Models of Self-Supervised Sequence Models EMNLP 2023 Fully General Online Imitation Learning JMLR 2022