Nora Belrose
5 papers · 2023–2025 · 3 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓
🌍
Conference Polyglot
(3)
🌉
Interdisciplinary Bridge
🧭
Keyword Pioneer
🐝
Cross-Pollinator
(15)
❓
The Questioner
Conferences
ICML (3)
AAAI (1)
NIPS (1)
Top co-authors
Keywords
representation learning
(1)
game playing
(1)
adversarial attack
(1)
linear classifier
(1)
recurrent neural network
(1)
language model
(1)
zero-shot transfer
(1)
activation manipulation
(1)
concept erasure
(1)
bias reduction
(1)
interpretability method
(1)
model steering
(1)
transformer model
(1)
adversarial policies
(1)
agent vulnerability
(1)
activation addition
(1)
Papers
Do Transformer Interpretability Methods Transfer to RNNs?
AAAI 2025
Automatically Interpreting Millions of Features in Large Language Models
ICML 2025
Neural Networks Learn Statistics of Increasing Complexity
ICML 2024
LEACE: Perfect linear concept erasure in closed form
NIPS 2023
Adversarial Policies Beat Superhuman Go AIs
ICML 2023