conftrace_

Thomas Kwa

3 papers · 2024–2024 · 1 conference · across top CS/AI conferences

Achievements

Jump to papers ↓

🌉 Interdisciplinary Bridge 🧭 Keyword Pioneer 🐝 Cross-Pollinator (11) 🗺️ Taxonomy Completionist (15)

Conferences

NIPS (3)

Top co-authors

Adrià Garriga-Alonso (2) Soufiane Noubir (1) Jason Gross (1) Alex Gibson (1) Lawrence Chan (1) Drake Thomas (1) Rohan Gupta (1) Rajashree Agrawal (1) Euan Ong (1) Iván Arcuschin (1)

Keywords

mechanistic interpretability (2) neural network (2) neural network verification (2) policy optimization (1) kl divergence (1) reinforcement learning from human feedback (1) formal verification (1) heavy-tailed distribution (1) reward misspecification (1) reward hacking (1) formal guarantee (1) accuracy lower bound (1) proof transferability (1) causal model (1) circuit discovery (1) interchange intervention training (1) performance bound (1) transformer model (1) transformer architecture (1) accuracy bound (1)

Papers

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification NIPS 2024 Compact Proofs of Model Performance via Mechanistic Interpretability NIPS 2024 InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques NIPS 2024