Artificial Intelligence › Core AI ›

Interpretability

7318 directly classified papers

Papers per year

Papers

Multi-Attribute Steering of Language Models via Targeted Intervention ACL 2025

Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension ACL 2025

Efficient Counterexample-Guided Fairness Verification and Repair of Neural Networks Using Satisfiability Modulo Convex Programming IJCAI 2025

How well do LLMs reason over tabular data, really? ACL 2025

Vulnerability of LLMs to Vertically Aligned Text Manipulations ACL 2025

Steering off Course: Reliability Challenges in Steering Language Models ACL 2025

FiRC-NLP at SemEval-2025 Task 3: Exploring Prompting Approaches for Detecting Hallucinations in LLMs ACL 2025

Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? ACL 2025

ICR Probe: Tracking Hidden State Dynamics for Reliable Hallucination Detection in LLMs ACL 2025

Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach CVPR 2025

COGUMELO at SemEval-2025 Task 3: A Synthetic Approach to Detecting Hallucinations in Language Models based on Named Entity Recognition ACL 2025

SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection ACL 2025

Revealing the Deceptiveness of Knowledge Editing: A Mechanistic Analysis of Superficial Editing ACL 2025

Hierarchical Attention Generates Better Proofs ACL 2025

T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation CVPR 2025

SCOP: Evaluating the Comprehension Process of Large Language Models from a Cognitive View ACL 2025

Neuron-Level Sequential Editing for Large Language Models ACL 2025

Interpretable Generative Models through Post-hoc Concept Bottlenecks CVPR 2025

FIRE: Robust Detection of Diffusion-Generated Images via Frequency-Guided Reconstruction Error CVPR 2025

CalibraEval: Calibrating Prediction Distribution to Mitigate Selection Bias in LLMs-as-Judges ACL 2025

IRT-Router: Effective and Interpretable Multi-LLM Routing via Item Response Theory ACL 2025

Learning from Sufficient Rationales: Analysing the Relationship Between Explanation Faithfulness and Token-level Regularisation Strategies IJCNLP 2025

HiddenDetect: Detecting Jailbreak Attacks against Multimodal Large Language Models via Monitoring Hidden States ACL 2025

Towards Objective Fine-tuning: How LLMs’ Prior Knowledge Causes Potential Poor Calibration? ACL 2025

Inner Information Analysis Algorithm for Deep Neural Network based on Community ICLR 2025