Jacob Steinhardt
68 papers · 2011–2025 · 10 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+16 more ↓ Show less ↑
π§ Keyword Pioneer π£ Hot Topic Early Bird πΊοΈ Taxonomy Completionist (16) π Interdisciplinary Bridge π Conference Polyglot (10)
π
Interdisciplinary Bridge
π
Academic Marathon
(14)
πΊοΈ
Taxonomy Completionist
(16)
π
Conference Loyalist
(21)
π
Keyword Trendsetter Combo
(4)
π€
Dynamic Duo
(12)
π
Triple Crown
π§¬
Topic Evolution
π
Keyword Champion
(3)
π₯
Unstoppable
(12)
π
Conference Pioneer
β‘
Prolific Year
(9)
ποΈ
Keyword Collector
(216)
β
The Questioner
(9)
π
Century Club
(68)
π
Trend Setter
Conferences
ICML (21)
NIPS (18)
ICLR (16)
CVPR (4)
AISTATS (3)
COLT (2)
ACL (1)
ICCV (1)
IJCNLP (1)
RSS (1)
Top co-authors
Research topics
Keywords
language model
(6)
large language model
(4)
adversarial robustness
(3)
data augmentation
(3)
approximate inference
(3)
scaling law
(3)
distribution shift
(3)
adversarial example
(3)
neural network
(3)
semidefinite programming
(3)
out-of-distribution detection
(3)
regret bound
(2)
latent variable
(2)
natural language
(2)
sparse linear regression
(2)
event forecasting
(2)
anomaly detection
(2)
sample complexity
(2)
image classification
(2)
model robustness
(2)
Papers
Interpreting the Second-Order Effects of Neurons in CLIP
ICLR 2025
Monitoring Latent World States in Language Models with Propositional Probes
ICLR 2025
Language Models Learn to Mislead Humans via RLHF
ICLR 2025
VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models
ICLR 2025
Uncovering Gaps in How Humans and LLMs Interpret Subjective Language
ICLR 2025
Extractive Structures Learned in Pretraining Enable Generalization on Finetuned Facts
ICML 2025
Adversaries Can Misuse Combinations of Safe Models
ICML 2025
Iterative Label Refinement Matters More than Preference Optimization under Weak Supervision
ICLR 2025
What Do Learning Dynamics Reveal About Generalization in LLM Mathematical Reasoning?
ICML 2025
Which Attention Heads Matter for In-Context Learning?
ICML 2025
Eliciting Language Model Behaviors with Investigator Agents
ICML 2025
Overthinking the Truth: Understanding how Language Models Process False Demonstrations
ICLR 2024
Feedback Loops With Language Models Drive In-Context Reward Hacking
ICML 2024
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
ICML 2024
Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations
ICML 2024
Approaching Human-Level Forecasting with Language Models
NIPS 2024
Explaining Datasets in Words: Statistical Models with Natural Language Parameters
NIPS 2024
Describing Differences in Image Sets with Natural Language
CVPR 2024
Interpreting CLIP's Image Representation via Text-Based Decomposition
ICLR 2024
How do Language Models Bind Entities in Context?
ICLR 2024
Improved Bayes Risk Can Yield Reduced Social Welfare Under Competition
NIPS 2023
Reward Learning as Doubly Nonparametric Bandits: Optimal Design and Scaling Laws
AISTATS 2023
Goal Driven Discovery of Distributional Differences via Language Descriptions
NIPS 2023
Mass-Producing Failures of Multimodal Systems with Language Models
NIPS 2023
Supply-Side Equilibria in Recommender Systems
NIPS 2023
Progress measures for grokking via mechanistic interpretability
ICLR 2023
Jailbroken: How Does LLM Safety Training Fail?
NIPS 2023
Discovering Latent Knowledge in Language Models Without Supervision
ICLR 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small
ICLR 2023
Are Neurons Actually Collapsed? On the Fine-Grained Structure in Neural Representations
ICML 2023
Automatically Auditing Large Language Models via Discrete Optimization
ICML 2023
Capturing Failures of Large Language Models via Human Cognitive Biases
NIPS 2022
How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios
NIPS 2022
Forecasting Future World Events With Neural Networks
NIPS 2022
PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures
CVPR 2022
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
ICLR 2022
Scaling Out-of-Distribution Detection for Real-World Settings
ICML 2022
More Than a Toy: Random Matrix Models Predict How Real-World Neural Representations Generalize
ICML 2022
Predicting Out-of-Distribution Error with the Projection Norm
ICML 2022
Describing Differences between Text Distributions with Natural Language
ICML 2022
The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization
ICCV 2021
Natural Adversarial Examples
CVPR 2021
Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level
IJCNLP 2021
Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level
ACL 2021
Learning Equilibria in Matching Markets from Bandit Feedback
NIPS 2021
Grounding Representation Similarity Through Statistical Testing
NIPS 2021
Aligning AI With Shared Human Values
ICLR 2021
Measuring Massive Multitask Language Understanding
ICLR 2021
Limitations of Post-Hoc Feature Alignment for Robustness
CVPR 2021
Identifying Statistical Bias in Dataset Replication
ICML 2020
Rethinking Bias-Variance Trade-off for Generalization of Neural Networks
ICML 2020
Enabling certification of verification-agnostic networks via memory-efficient semidefinite programming
NIPS 2020
Sever: A Robust Meta-Algorithm for Stochastic Optimization
ICML 2019
Certified Defenses against Adversarial Examples
ICLR 2018
Semidefinite relaxations for certifying robustness to adversarial examples
NIPS 2018
Certified Defenses for Data Poisoning Attacks
NIPS 2017
Avoiding Imposters and Delinquents: Adversarial Crowdsourcing and Peer Prediction
NIPS 2016
Unsupervised Risk Estimation Using Only Conditional Independence Structure
NIPS 2016
Memory, Communication, and Statistical Queries
COLT 2016
Learning Fast-Mixing Models for Structured Prediction
ICML 2015
Learning with Relaxed Supervision
NIPS 2015
Minimax rates for memory-bounded sparse linear regression
COLT 2015
Learning Where to Sample in Structured Prediction
AISTATS 2015
Reified Context Models
ICML 2015
Filtering with Abstract Particles
ICML 2014
Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm
ICML 2014
Flexible Martingale Priors for Deep Hierarchies
AISTATS 2012
Finite-Time Regional Verification of Stochastic Nonlinear Systems
RSS 2011