conftrace_

Artificial Intelligence › Core AI ›

Interpretability

7,318 papers

Papers per year

Papers

Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models EMNLP 2025

From Insight to Exploit: Leveraging LLM Collaboration for Adaptive Adversarial Text Generation EMNLP 2025

Do We Know What LLMs Don’t Know? A Study of Consistency in Knowledge Probing EMNLP 2025

Probing Political Ideology in Large Language Models: How Latent Political Representations Generalize Across Tasks EMNLP 2025

Understanding GUI Agent Localization Biases through Logit Sharpness EMNLP 2025

No Black Boxes: Interpretable and Interactable Predictive Healthcare with Knowledge-Enhanced Agentic Causal Discovery EMNLP 2025

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models EMNLP 2025

SteerVLM: Robust Model Control through Lightweight Activation Steering for Vision Language Models EMNLP 2025

PropXplain: Can LLMs Enable Explainable Propaganda Detection? EMNLP 2025

Promptception: How Sensitive Are Large Multimodal Models to Prompts? EMNLP 2025

Evaluating Compound AI Systems through Behaviors, Not Benchmarks EMNLP 2025

Dissecting Persona-Driven Reasoning in Language Models via Activation Patching EMNLP 2025

Do Before You Judge: Self-Reference as a Pathway to Better LLM Evaluation EMNLP 2025

Mixed Signals: Decoding VLMs’ Reasoning and Underlying Bias in Vision-Language Conflict EMNLP 2025

Mitigating Hallucination in Large Vision-Language Models through Aligning Attention Distribution to Information Flow EMNLP 2025

Reliability Crisis of Reference-free Metrics for Grammatical Error Correction EMNLP 2025

Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks EMNLP 2025

Quantifying Uncertainty in Natural Language Explanations of Large Language Models for Question Answering EMNLP 2025

SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization EMNLP 2025

Who’s the Author? How Explanations Impact User Reliance in AI-Assisted Authorship Attribution EMNLP 2025

How to Generalize the Detection of AI-Generated Text: Confounding Neurons EMNLP 2025

Saudi-Alignment Benchmark: Assessing LLMs Alignment with Cultural Norms and Domain Knowledge in the Saudi Context EMNLP 2025

AraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLMs EMNLP 2025

IslamicEval 2025: The First Shared Task of Capturing LLMs Hallucination in Islamic Content EMNLP 2025

Two ways into the hall of mirrors: Language exposure and lossy memory drive cross-linguistic grammaticality illusions in language models EMNLP 2025