conftrace_

Artificial Intelligence › Core AI ›

Interpretability

7,318 papers

Papers per year

Papers

CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders EMNLP 2025

Evil twins are not that evil: Qualitative insights into machine-generated prompts EMNLP 2025

Steering Prepositional Phrases in Language Models: A Case of with-headed Adjectival and Adverbial Complements in Gemma-2 EMNLP 2025

Not a nuisance but a useful heuristic: Outlier dimensions favor frequent tokens in language models EMNLP 2025

Interpreting Language Models Through Concept Descriptions: A Survey EMNLP 2025

When LRP Diverges from Leave-One-Out in Transformers EMNLP 2025

Circuit-Tracer: A New Library for Finding Feature Circuits EMNLP 2025

Mechanistic Fine-tuning for In-context Learning EMNLP 2025

Understanding How CodeLLMs (Mis)Predict Types with Activation Steering EMNLP 2025

The Unheard Alternative: Contrastive Explanations for Speech-to-Text Models EMNLP 2025

Exploring Large Language Models’ World Perception: A Multi-Dimensional Evaluation through Data Distribution EMNLP 2025

On the Representations of Entities in Auto-regressive Large Language Models EMNLP 2025

Can Language Neuron Intervention Reduce Non-Target Language Output? EMNLP 2025

Fine-Grained Manipulation of Arithmetic Neurons EMNLP 2025

What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks EMNLP 2025

BlackboxNLP-2025 MIB Shared Task: Improving Circuit Faithfulness via Better Edge Selection EMNLP 2025

BlackboxNLP-2025 MIB Shared Task: IPE: Isolating Path Effects for Improving Latent Circuit Identification EMNLP 2025

BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods EMNLP 2025

Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models EMNLP 2025

Unpacking Ambiguity: The Interaction of Polysemous Discourse Markers and Non-DM Signals EMNLP 2025

Entity Tracking in Small Language Models: An Attention-Based Study of Parameter-Efficient Fine-Tuning EMNLP 2025

TripleCheck: Transparent Post-Hoc Verification of Biomedical Claims in AI-Generated Answers EMNLP 2025

Predictive Modeling of Human Developers’ Evaluative Judgment of Generated Code as a Decision Process EMNLP 2025

From Regulation to Interaction: Expert Views on Aligning Explainable AI with the EU AI Act EMNLP 2025

FIRMA: Bidirectional Formal-Informal Mathematical Language Alignment with Proof-Theoretic Grounding EMNLP 2025