Himabindu Lakkaraju

43 papers · 2016–2026 · 9 conferences · across top CS/AI conferences

Achievements

+13 more ↓

🌍 Conference Polyglot (7) 🏃 Academic Marathon (9) 🌉 Interdisciplinary Bridge 🧭 Keyword Pioneer 🐝 Cross-Pollinator (13)

🗺️ Taxonomy Completionist (51) 🧭 Keyword Pioneer 🌍 Conference Polyglot (7) 🔬 Deep Specialist (20) 🏆 Keyword Champion (2) 👑 Triple Crown 🏆 Grand Slam 🧬 Topic Evolution 🔥 Unstoppable (6) ⚡ Prolific Year (7) 💎 Century Club (40) ❓ The Questioner (3) 🗃️ Keyword Collector (147)

Conferences

NIPS (16) AISTATS (6) ICML (6) ICLR (4) UAI (4) NAACL (3) ACL (2) AAAI (1) EACL (1)

Top co-authors

Chirag Agarwal (9) Suraj Srinivas (8) Martin Pawelczyk (6) Satyapriya Krishna (5) Jiaqi Ma (4) Dylan Slack (3) Zhenting Qi (3) Tessa Han (3) Marinka Zitnik (3) Usha Bhalla (3)

Research topics

Privacy (1)

Keywords

counterfactual explanation (5) large language model (5) feature attribution (4) adversarial training (4) model interpretability (4) post hoc explanation (4) adversarial robustness (4) uncertainty quantification (3) algorithmic recourse (3) algorithmic fairness (3) benchmark evaluation (2) chain-of-thought prompting (2) right to be forgotten (2) post-hoc explanation (2) data deletion (2) generative model (2) interpretable model (2) causal inference (2) in-context learning (2) explainable ai (2)

Papers

Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders EACL 2026 Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models ACL 2026 How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior ACL 2026 Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems ICLR 2025 On the Impact of Fine-Tuning on Chain-of-Thought Reasoning NAACL 2025 More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness ICLR 2025 Quantifying Generalization Complexity for Large Language Models ICLR 2025 In-Context Unlearning: Language Models as Few-Shot Unlearners ICML 2024 MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models NIPS 2024 Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE) NIPS 2024 Quantifying Uncertainty in Natural Language Explanations of Large Language Models AISTATS 2024 Fair Machine Unlearning: Data Removal while Mitigating Disparities AISTATS 2024 Understanding the Effects of Iterative Prompting on Truthfulness ICML 2024 Confronting LLMs with Traditional ML: Rethinking the Fairness of Large Language Models in Tabular Classifications NAACL 2024 A Study on the Calibration of In-context Learning NAACL 2024 Characterizing Data Point Vulnerability as Average-Case Robustness UAI 2024 Towards Bridging the Gaps between the Right to Explanation and the Right to be Forgotten ICML 2023 $\mathcal{M}^4$: A Unified XAI Benchmark for Faithfulness Evaluation of Feature Attribution Methods across Metrics, Modalities and Models NIPS 2023 On the Privacy Risks of Algorithmic Recourse AISTATS 2023 On Minimizing the Impact of Dataset Shifts on Actionable Explanations UAI 2023 Probabilistically Robust Recourse: Navigating the Trade-offs between Costs and Robustness in Algorithmic Recourse ICLR 2023 Post Hoc Explanations of Language Models Can Improve Language Models NIPS 2023 Discriminative Feature Attributions: Bridging Post Hoc Explainability and Inherent Interpretability NIPS 2023 On the Impact of Algorithmic Recourse on Social Segregation ICML 2023 Which Models have Perceptually-Aligned Gradients? An Explanation via Off-Manifold Robustness NIPS 2023 Efficient Training of Low-Curvature Neural Networks NIPS 2022 OpenXAI: Towards a Transparent Evaluation of Model Explanations NIPS 2022 Data poisoning attacks on off-policy policy evaluation methods UAI 2022 Probing GNN Explainers: A Rigorous Theoretical and Empirical Analysis of GNN Explanation Methods AISTATS 2022 Exploring Counterfactual Explanations Through the Lens of Adversarial Examples: A Theoretical and Empirical Analysis AISTATS 2022 Which Explanation Should I Choose? A Function Approximation Perspective to Characterizing Post Hoc Explanations NIPS 2022 Towards a unified framework for fair and stable graph representation learning UAI 2021 Learning Models for Actionable Recourse NIPS 2021 Towards Robust and Reliable Algorithmic Recourse NIPS 2021 Towards the Unification and Robustness of Perturbation and Gradient Based Explanations ICML 2021 Counterfactual Explanations Can Be Manipulated NIPS 2021 Reliable Post hoc Explanations: Modeling Uncertainty in Explainability NIPS 2021 Fair Influence Maximization: a Welfare Optimization Approach AAAI 2021 Beyond Individualized Recourse: Interpretable and Interactive Summaries of Actionable Recourses NIPS 2020 Robust and Stable Black Box Explanations ICML 2020 Incorporating Interpretable Output Constraints in Bayesian Neural Networks NIPS 2020 Learning Cost-Effective and Interpretable Treatment Regimes AISTATS 2017 Confusions over Time: An Interpretable Bayesian Model to Characterize Trends in Decision Making NIPS 2016