David Krueger

27 papers · 2017–2025 · 5 conferences · across top CS/AI conferences

Achievements

+10 more ↓

🧭 Keyword Pioneer 🌈 Renaissance Researcher (5) 🌉 Interdisciplinary Bridge 🗺️ Taxonomy Completionist (12) 🌍 Conference Polyglot (5)

🌍 Conference Polyglot (5) 🏃 Academic Marathon (8) 🐝 Cross-Pollinator (7) 👑 Triple Crown 🧬 Topic Evolution 🔥 Unstoppable (5) ⚡ Prolific Year (10) 📈 Trend Setter 💎 Century Club (27) 🗃️ Keyword Collector (79)

Conferences

ICML (10) ICLR (9) NIPS (6) ACML (1) NAACL (1)

Top co-authors

Bruno Kacper Mlodozeniec (4) Fazl Barez (3) Stephen Chung (3) Dmitrii Krasheninnikov (3) Philip Torr (3) Tegan Maharaj (3) Aaron Courville (3) Ekdeep Singh Lubana (3) Hidenori Tanaka (2) Robert Kirk (2)

Keywords

reinforcement learning (3) world model (2) large language model (2) out-of-distribution generalization (2) causal inference (2) domain generalization (1) group theory (1) adversarial robustness (1) neural network interpretability (1) model safety (1) feature learning (1) loss landscape (1) model evaluation (1) reinforcement learning from human feedback (1) probability distribution (1) covariate shift (1) model-based reinforcement learning (1) reward function (1) action prediction (1) data augmentation (1)

Papers

Analyzing (In)Abilities of SAEs via Formal Languages NAACL 2025 Interpreting Emergent Planning in Model-Free Reinforcement Learning ICLR 2025 Protecting against simultaneous data poisoning attacks ICLR 2025 Towards Interpreting Visual Information Processing in Vision-Language Models ICLR 2025 Input Space Mode Connectivity in Deep Neural Networks ICLR 2025 Influence Functions for Scalable Data Attribution in Diffusion Models ICLR 2025 The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret ICML 2025 PoisonBench: Assessing Language Model Vulnerability to Poisoned Preference Data ICML 2025 Position: Humanity Faces Existential Risk from Gradual Disempowerment ICML 2025 Position: Probabilistic Modelling is Sufficient for Causal Inference ICML 2025 Reward Model Ensembles Help Mitigate Overoptimization ICLR 2024 Implicit meta-learning may lead language models to trust more reliable sources ICML 2024 Interpreting Learned Feedback Patterns in Large Language Models NIPS 2024 Predicting Future Actions of Reinforcement Learning Agents NIPS 2024 Stress-Testing Capability Elicitation With Password-Locked Models NIPS 2024 A Generative Model of Symmetry Transformations NIPS 2024 Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks ICLR 2024 Mechanistic Mode Connectivity ICML 2023 Thinker: Learning to Plan and Act NIPS 2023 Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics ICLR 2023 Broken Neural Scaling Laws ICLR 2023 Defining and Characterizing Reward Gaming NIPS 2022 Goal Misgeneralization in Deep Reinforcement Learning ICML 2022 Out-of-Distribution Generalization via Risk Extrapolation (REx) ICML 2021 Neural Autoregressive Flows ICML 2018 Nested LSTMs ACML 2017 A Closer Look at Memorization in Deep Networks ICML 2017