conftrace_

Andy Zou

12 papers · 2021–2025 · 4 conferences · across top CS/AI conferences

Achievements

Jump to papers ↓

+7 more ↓

🐝 Cross-Pollinator (11) 🌉 Interdisciplinary Bridge 🧭 Keyword Pioneer 🌍 Conference Polyglot (4) 🌈 Renaissance Researcher (7)

🌍 Conference Polyglot (4) 🌈 Renaissance Researcher (7) 🤝 Dynamic Duo (11) 👥 Mega-Team (46) 🔥 Unstoppable (5) 💎 Century Club (12) ❓ The Questioner (2)

Conferences

ICML (4) NIPS (4) ICLR (3) CVPR (1)

Top co-authors

Dan Hendrycks (11) Mantas Mazeika (8) Dawn Song (6) Steven Basart (6) Jacob Steinhardt (5) Nathaniel Li (3) Maxwell Lin (3) Long Phan (3) Zifan Wang (3) Bo Li (3)

Keywords

adversarial robustness (3) anomaly detection (2) question answering (1) temporal reasoning (1) event forecasting (1) video understanding (1) ai safety (1) robustness certification (1) affective computing (1) model alignment (1) adversarial training (1) adversarial attack (1) deep neural network (1) language model (1) data augmentation (1) lipschitz constant (1) out-of-distribution detection (1) representation engineering (1) circuit breaker (1) image classification (1)

Papers

Tamper-Resistant Safeguards for Open-Weight LLMs ICLR 2025 AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents ICLR 2025 HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal ICML 2024 Improving Alignment and Robustness with Circuit Breakers NIPS 2024 The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning ICML 2024 Unlocking Deterministic Robustness Certification on ImageNet NIPS 2023 Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark ICML 2023 Scaling Out-of-Distribution Detection for Real-World Settings ICML 2022 Forecasting Future World Events With Neural Networks NIPS 2022 PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures CVPR 2022 How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios NIPS 2022 Measuring Massive Multitask Language Understanding ICLR 2021