Artificial Intelligence › Core AI ›

Interpretability

7318 directly classified papers

Papers per year

Papers

TaeBench: Improving Quality of Toxic Adversarial Examples NAACL 2025

TactfulToM: Do LLMs have the Theory of Mind ability to understand White Lies? EMNLP 2025

Trust Me, I’m Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer EMNLP 2025

Interpretable Mnemonic Generation for Kanji Learning via Expectation-Maximization EMNLP 2025

Can Multiple Responses from an LLM Reveal the Sources of Its Uncertainty? EMNLP 2025

Sparse Activation Editing for Reliable Instruction Following in Narratives EMNLP 2025

DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning EMNLP 2025

CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios EMNLP 2025

Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements EMNLP 2025

Principled Personas: Defining and Measuring the Intended Effects of Persona Prompting on Task Performance EMNLP 2025

Great Memory, Shallow Reasoning: Limits of kNN-LMs NAACL 2025

Morables: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables EMNLP 2025

Do RAG Systems Really Suffer From Positional Bias? EMNLP 2025

Improving Large Language Model Safety with Contrastive Representation Learning EMNLP 2025

Leveraging What’s Overfixed: Post-Correction via LLM Grammatical Error Overcorrection EMNLP 2025

LinguaLens: Towards Interpreting Linguistic Mechanisms of Large Language Models via Sparse Auto-Encoder EMNLP 2025

Investigating the interaction of linguistic and mathematical reasoning in language models using multilingual number puzzles EMNLP 2025

Unsupervised Concept Vector Extraction for Bias Control in LLMs EMNLP 2025

EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding EMNLP 2025

Do All Autoregressive Transformers Remember Facts the Same Way? A Cross-Architecture Analysis of Recall Mechanisms EMNLP 2025

Probing Narrative Morals: A New Character-Focused MFT Framework for Use with Large Language Models EMNLP 2025

Probing and Boosting Large Language Models Capabilities via Attention Heads EMNLP 2025

Explaining Differences Between Model Pairs in Natural Language through Sample Learning EMNLP 2025

Toward Efficient Sparse Autoencoder-Guided Steering for Improved In-Context Learning in Large Language Models EMNLP 2025

Decoding Uncertainty: The Impact of Decoding Strategies for Uncertainty Estimation in Large Language Models EMNLP 2025