Jacob Steinhardt

68 papers · 2011–2025 · 10 conferences · across top CS/AI conferences

Achievements

+16 more ↓

🧭 Keyword Pioneer 🐣 Hot Topic Early Bird 🗺️ Taxonomy Completionist (16) 🌉 Interdisciplinary Bridge 🌍 Conference Polyglot (10)

🌉 Interdisciplinary Bridge 🏃 Academic Marathon (14) 🗺️ Taxonomy Completionist (16) 🏠 Conference Loyalist (21) 🌟 Keyword Trendsetter Combo (4) 🤝 Dynamic Duo (12) 👑 Triple Crown 🧬 Topic Evolution 🏆 Keyword Champion (3) 🔥 Unstoppable (12) 🚀 Conference Pioneer ⚡ Prolific Year (9) 🗃️ Keyword Collector (216) ❓ The Questioner (9) 💎 Century Club (68) 📈 Trend Setter

Conferences

ICML (21) NIPS (18) ICLR (16) CVPR (4) AISTATS (3) COLT (2) ACL (1) ICCV (1) IJCNLP (1) RSS (1)

Top co-authors

Percy Liang (12) Dan Hendrycks (8) Ruiqi Zhong (8) Dawn Song (8) Erik Jones (6) Dan Klein (6) Steven Basart (6) Alexander Wei (5) Mantas Mazeika (5) Andy Zou (5)

Research topics

Optimization (1)

Keywords

language model (6) large language model (4) adversarial robustness (3) data augmentation (3) approximate inference (3) scaling law (3) distribution shift (3) adversarial example (3) neural network (3) semidefinite programming (3) out-of-distribution detection (3) regret bound (2) latent variable (2) natural language (2) sparse linear regression (2) event forecasting (2) anomaly detection (2) sample complexity (2) image classification (2) model robustness (2)

Papers

Interpreting the Second-Order Effects of Neurons in CLIP ICLR 2025 Monitoring Latent World States in Language Models with Propositional Probes ICLR 2025 Language Models Learn to Mislead Humans via RLHF ICLR 2025 VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models ICLR 2025 Uncovering Gaps in How Humans and LLMs Interpret Subjective Language ICLR 2025 Extractive Structures Learned in Pretraining Enable Generalization on Finetuned Facts ICML 2025 Adversaries Can Misuse Combinations of Safe Models ICML 2025 Iterative Label Refinement Matters More than Preference Optimization under Weak Supervision ICLR 2025 What Do Learning Dynamics Reveal About Generalization in LLM Mathematical Reasoning? ICML 2025 Which Attention Heads Matter for In-Context Learning? ICML 2025 Eliciting Language Model Behaviors with Investigator Agents ICML 2025 Overthinking the Truth: Understanding how Language Models Process False Demonstrations ICLR 2024 Feedback Loops With Language Models Drive In-Context Reward Hacking ICML 2024 Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation ICML 2024 Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations ICML 2024 Approaching Human-Level Forecasting with Language Models NIPS 2024 Explaining Datasets in Words: Statistical Models with Natural Language Parameters NIPS 2024 Describing Differences in Image Sets with Natural Language CVPR 2024 Interpreting CLIP's Image Representation via Text-Based Decomposition ICLR 2024 How do Language Models Bind Entities in Context? ICLR 2024 Improved Bayes Risk Can Yield Reduced Social Welfare Under Competition NIPS 2023 Reward Learning as Doubly Nonparametric Bandits: Optimal Design and Scaling Laws AISTATS 2023 Goal Driven Discovery of Distributional Differences via Language Descriptions NIPS 2023 Mass-Producing Failures of Multimodal Systems with Language Models NIPS 2023 Supply-Side Equilibria in Recommender Systems NIPS 2023 Progress measures for grokking via mechanistic interpretability ICLR 2023 Jailbroken: How Does LLM Safety Training Fail? NIPS 2023 Discovering Latent Knowledge in Language Models Without Supervision ICLR 2023 Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small ICLR 2023 Are Neurons Actually Collapsed? On the Fine-Grained Structure in Neural Representations ICML 2023 Automatically Auditing Large Language Models via Discrete Optimization ICML 2023 Capturing Failures of Large Language Models via Human Cognitive Biases NIPS 2022 How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios NIPS 2022 Forecasting Future World Events With Neural Networks NIPS 2022 PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures CVPR 2022 The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models ICLR 2022 Scaling Out-of-Distribution Detection for Real-World Settings ICML 2022 More Than a Toy: Random Matrix Models Predict How Real-World Neural Representations Generalize ICML 2022 Predicting Out-of-Distribution Error with the Projection Norm ICML 2022 Describing Differences between Text Distributions with Natural Language ICML 2022 The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization ICCV 2021 Natural Adversarial Examples CVPR 2021 Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level IJCNLP 2021 Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level ACL 2021 Learning Equilibria in Matching Markets from Bandit Feedback NIPS 2021 Grounding Representation Similarity Through Statistical Testing NIPS 2021 Aligning AI With Shared Human Values ICLR 2021 Measuring Massive Multitask Language Understanding ICLR 2021 Limitations of Post-Hoc Feature Alignment for Robustness CVPR 2021 Identifying Statistical Bias in Dataset Replication ICML 2020 Rethinking Bias-Variance Trade-off for Generalization of Neural Networks ICML 2020 Enabling certification of verification-agnostic networks via memory-efficient semidefinite programming NIPS 2020 Sever: A Robust Meta-Algorithm for Stochastic Optimization ICML 2019 Certified Defenses against Adversarial Examples ICLR 2018 Semidefinite relaxations for certifying robustness to adversarial examples NIPS 2018 Certified Defenses for Data Poisoning Attacks NIPS 2017 Avoiding Imposters and Delinquents: Adversarial Crowdsourcing and Peer Prediction NIPS 2016 Unsupervised Risk Estimation Using Only Conditional Independence Structure NIPS 2016 Memory, Communication, and Statistical Queries COLT 2016 Learning Fast-Mixing Models for Structured Prediction ICML 2015 Learning with Relaxed Supervision NIPS 2015 Minimax rates for memory-bounded sparse linear regression COLT 2015 Learning Where to Sample in Structured Prediction AISTATS 2015 Reified Context Models ICML 2015 Filtering with Abstract Particles ICML 2014 Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm ICML 2014 Flexible Martingale Priors for Deep Hierarchies AISTATS 2012 Finite-Time Regional Verification of Stochastic Nonlinear Systems RSS 2011