conftrace_

Nora Belrose

5 papers · 2023–2025 · 3 conferences · across top CS/AI conferences

Achievements

Jump to papers ↓

🌍 Conference Polyglot (3) 🌉 Interdisciplinary Bridge 🧭 Keyword Pioneer 🐝 Cross-Pollinator (15) ❓ The Questioner

Conferences

ICML (3) AAAI (1) NIPS (1)

Top co-authors

Alex Troy Mallen (2) Lucia Quirke (1) David Schneider-Joseph (1) Ryan Cotterell (1) Shauli Ravfogel (1) Adam Gleave (1) Yawen Duan (1) Sergey Levine (1) Tom Tseng (1) Michael D Dennis (1)

Keywords

representation learning (1) game playing (1) adversarial attack (1) linear classifier (1) recurrent neural network (1) language model (1) zero-shot transfer (1) activation manipulation (1) concept erasure (1) bias reduction (1) interpretability method (1) model steering (1) transformer model (1) adversarial policies (1) agent vulnerability (1) activation addition (1)

Papers

Do Transformer Interpretability Methods Transfer to RNNs? AAAI 2025 Automatically Interpreting Millions of Features in Large Language Models ICML 2025 Neural Networks Learn Statistics of Increasing Complexity ICML 2024 LEACE: Perfect linear concept erasure in closed form NIPS 2023 Adversarial Policies Beat Superhuman Go AIs ICML 2023