Thomas Kwa
3 papers · 2024–2024 · 1 conference · across top CS/AI conferences
Achievements
Jump to papers ↓
🌉
Interdisciplinary Bridge
🧭
Keyword Pioneer
🐝
Cross-Pollinator
(11)
🗺️
Taxonomy Completionist
(15)
Conferences
NIPS (3)
Top co-authors
Keywords
mechanistic interpretability
(2)
neural network
(2)
neural network verification
(2)
policy optimization
(1)
kl divergence
(1)
reinforcement learning from human feedback
(1)
formal verification
(1)
heavy-tailed distribution
(1)
reward misspecification
(1)
reward hacking
(1)
formal guarantee
(1)
accuracy lower bound
(1)
proof transferability
(1)
causal model
(1)
circuit discovery
(1)
interchange intervention training
(1)
performance bound
(1)
transformer model
(1)
transformer architecture
(1)
accuracy bound
(1)
Papers
Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
NIPS 2024
Compact Proofs of Model Performance via Mechanistic Interpretability
NIPS 2024
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
NIPS 2024