Adriร Garriga-Alonso
10 papers · 2019–2025 · 3 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+6 more ↓ Show less ↑
๐งญ Keyword Pioneer ๐ Interdisciplinary Bridge ๐ Conference Polyglot (3) ๐ Academic Marathon (6) ๐ Cross-Pollinator (11)
๐
Conference Polyglot
(3)
๐
Academic Marathon
(6)
๐
Cross-Pollinator
(11)
๐
Keyword Champion
(2)
๐
Century Club
(10)
๐ฅ
Unstoppable
(5)
Conferences
NIPS (5)
ICLR (3)
UAI (2)
Top co-authors
Keywords
mechanistic interpretability
(3)
circuit discovery
(2)
neural network analysis
(2)
language model
(2)
policy optimization
(1)
data augmentation
(1)
kl divergence
(1)
model behavior
(1)
model interpretability
(1)
reinforcement learning from human feedback
(1)
hypothesis testing
(1)
convolutional neural network
(1)
heavy-tailed distribution
(1)
reward misspecification
(1)
reward hacking
(1)
circuit analysis
(1)
neural network verification
(1)
bayesian neural network
(1)
steering vector
(1)
causal model
(1)
Papers
Interpreting Emergent Planning in Model-Free Reinforcement Learning
ICLR 2025
Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
NIPS 2024
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
NIPS 2024
Hypothesis Testing the Circuit Hypothesis in LLMs
NIPS 2024
Analysing the Generalisation and Reliability of Steering Vectors
NIPS 2024
Towards Automated Circuit Discovery for Mechanistic Interpretability
NIPS 2023
Data augmentation in Bayesian neural networks and the cold posterior effect
UAI 2022
Bayesian Neural Network Priors Revisited
ICLR 2022
Correlated weights in infinite limits of deep convolutional neural networks
UAI 2021
Deep Convolutional Networks as shallow Gaussian Processes
ICLR 2019