conftrace_

Rohin Shah

11 papers · 2019–2025 · 4 conferences · across top CS/AI conferences

Achievements

Jump to papers ↓

+5 more ↓

🐣 Hot Topic Early Bird 🧭 Keyword Pioneer 🌉 Interdisciplinary Bridge 🗺️ Taxonomy Completionist (19) 🌍 Conference Polyglot (4)

🏃 Academic Marathon (6) 🐝 Cross-Pollinator (4) 🌈 Renaissance Researcher (5) 👑 Triple Crown 💎 Century Club (11)

Conferences

NIPS (6) ICLR (2) ICML (2) EMNLP (1)

Top co-authors

Anca Dragan (5) Pieter Abbeel (4) David Lindner (3) Vikrant Varma (3) János Kramár (3) Neel Nanda (2) Andrew Critch (2) Tom Lieberum (2) Senthooran Rajamanoharan (2) Lewis Smith (2)

Keywords

reinforcement learning (3) ai safety (2) multi-agent system (2) imitation learning (2) sparse autoencoder (2) model analysis (1) reward function (1) neural network interpretability (1) markov decision process (1) behavior cloning (1) model evaluation (1) distribution shift (1) optimal policy (1) reward learning (1) latent representation (1) pairwise comparison (1) human feedback (1) benchmark dataset (1) language model (1) inverse reinforcement learning (1)

Papers

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking ICML 2025 On scalable oversight with weak LLMs judging strong LLMs NIPS 2024 Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders NIPS 2024 Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 EMNLP 2024 BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks NIPS 2023 Learning What To Do by Simulating the Past ICLR 2021 Optimal Policies Tend To Seek Power NIPS 2021 The MAGICAL Benchmark for Robust Imitation NIPS 2020 Preferences Implicit in the State of the World ICLR 2019 On the Utility of Learning about Humans for Human-AI Coordination NIPS 2019 On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference ICML 2019