Artificial Intelligence › Core AI ›

Interpretability

7318 directly classified papers

Papers per year

Papers

Revealing and Mitigating the Challenge of Detecting Character Knowledge Errors in LLM Role-Playing EMNLP 2025

Mechanisms vs. Outcomes: Probing for Syntax Fails to Explain Performance on Targeted Syntactic Evaluations EMNLP 2025

Data Descriptions from Large Language Models with Influence Estimation EMNLP 2025

Toward Machine Translation Literacy: How Lay Users Perceive and Rely on Imperfect Translations EMNLP 2025

LiTEx: A Linguistic Taxonomy of Explanations for Understanding Within-Label Variation in Natural Language Inference EMNLP 2025

From Input Perception to Predictive Insight: Modeling Model Blind Spots Before They Become Errors EMNLP 2025

AI Argues Differently: Distinct Argumentative and Linguistic Patterns of LLMs in Persuasive Contexts EMNLP 2025

The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs EMNLP 2025

Turning Logic Against Itself: Probing Model Defenses Through Contrastive Questions EMNLP 2025

NormXLogit: The Head-on-Top Never Lies EMNLP 2025

FoREST: Frame of Reference Evaluation in Spatial Reasoning Tasks EMNLP 2025

A Simple Yet Effective Method for Non-Refusing Context Relevant Fine-grained Safety Steering in LLMs EMNLP 2025

Quantifying Logical Consistency in Transformers via Query-Key Alignment EMNLP 2025

CourtReasoner: Can LLM Agents Reason Like Judges? EMNLP 2025

Mind the Blind Spots: A Focus-Level Evaluation Framework for LLM Reviews EMNLP 2025

Unconditional Truthfulness: Learning Unconditional Uncertainty of Large Language Models EMNLP 2025

LingConv: An Interactive Toolkit for Controlled Paraphrase Generation with Linguistic Attribute Control EMNLP 2025

AgentDiagnose: An Open Toolkit for Diagnosing LLM Agent Trajectories EMNLP 2025

CafGa: Customizing Feature Attributions to Explain Language Models EMNLP 2025

EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models EMNLP 2025

AERA Chat: An Interactive Platform for Automated Explainable Student Answer Assessment EMNLP 2025

o-MEGA: Optimized Methods for Explanation Generation and Analysis EMNLP 2025

TRACE: Training and Inference-Time Interpretability Analysis for Language Models EMNLP 2025

From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens EMNLP 2025

Hybrid Concept Bottleneck Models CVPR 2025