conftrace_

Alexander Pan

5 papers · 2022–2024 · 3 conferences · across top CS/AI conferences

Achievements

Jump to papers ↓

+1 more ↓

🌉 Interdisciplinary Bridge 🧭 Keyword Pioneer 🌍 Conference Polyglot (3) 🐝 Cross-Pollinator (9) 👥 Mega-Team (46)

❓ The Questioner (2)

Conferences

ICML (3) ICLR (1) NIPS (1)

Top co-authors

Steven Basart (3) Dan Hendrycks (3) Alice Gatti (2) Stephen Fitz (2) Nathaniel Li (2) Andy Zou (2) Mantas Mazeika (2) Jacob Steinhardt (2) Gabriel Mukobi (2) Adam Alfred Hunt (1)

Keywords

benchmark evaluation (1) ai safety (1) spectral analysis (1) model scaling (1) safety benchmark (1) capabilities component (1) reward optimization (1) ethical behavior (1) machine ethics (1) model capabilities (1) scale correlation (1)

Papers

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? NIPS 2024 The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning ICML 2024 Feedback Loops With Language Models Drive In-Context Reward Hacking ICML 2024 Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark ICML 2023 The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models ICLR 2022