conftrace_

Adrià Garriga-Alonso

10 papers · 2019–2025 · 3 conferences · across top CS/AI conferences

Achievements

Jump to papers ↓

+6 more ↓

🧭 Keyword Pioneer 🌉 Interdisciplinary Bridge 🌍 Conference Polyglot (3) 🏃 Academic Marathon (6) 🐝 Cross-Pollinator (11)

🌍 Conference Polyglot (3) 🏃 Academic Marathon (6) 🐝 Cross-Pollinator (11) 🏆 Keyword Champion (2) 💎 Century Club (10) 🔥 Unstoppable (5)

Conferences

NIPS (5) ICLR (3) UAI (2)

Top co-authors

Laurence Aitchison (3) Mark van der Wilk (3) Vincent Fortuin (2) Thomas Kwa (2) Aengus Lynch (2) Thomas Bush (1) David Krueger (1) Achille Nazaret (1) Daniel Tan (1) Florian Wenzel (1)

Keywords

mechanistic interpretability (3) circuit discovery (2) neural network analysis (2) language model (2) policy optimization (1) data augmentation (1) kl divergence (1) model behavior (1) model interpretability (1) reinforcement learning from human feedback (1) hypothesis testing (1) convolutional neural network (1) heavy-tailed distribution (1) reward misspecification (1) reward hacking (1) circuit analysis (1) neural network verification (1) bayesian neural network (1) steering vector (1) causal model (1)

Papers

Interpreting Emergent Planning in Model-Free Reinforcement Learning ICLR 2025 Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification NIPS 2024 InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques NIPS 2024 Hypothesis Testing the Circuit Hypothesis in LLMs NIPS 2024 Analysing the Generalisation and Reliability of Steering Vectors NIPS 2024 Towards Automated Circuit Discovery for Mechanistic Interpretability NIPS 2023 Data augmentation in Bayesian neural networks and the cold posterior effect UAI 2022 Bayesian Neural Network Priors Revisited ICLR 2022 Correlated weights in infinite limits of deep convolutional neural networks UAI 2021 Deep Convolutional Networks as shallow Gaussian Processes ICLR 2019