LLMs (Almost) Never Abstain Under Medical Uncertainty

Alessio Cocchieri; Luca Ragazzi; Giuseppe Tagliavini; Gianluca Moro

2026 ACL ACL 2026

LLMs (Almost) Never Abstain Under Medical Uncertainty

Abstract

AbstractMedical multiple-choice question answering (MCQA) benchmarks implicitly assume that large language models (LLMs) should always commit to an answer. However, in clinical practice, uncertainty is pervasive and abstaining is often the safest action. We introduce MedQAbstain, a benchmark explicitly designed to evaluate medical abstention under uncertainty. MedQAbstain repurposes standard medical MCQA datasets by removing the gold answer and introducing an explicit "I abstain" option, framed as a safety-critical decision with clinical consequences. The benchmark supports systematic analysis across abstention regimes, distractor complexity, and input modalities, and elicits self-reported model confidence to study calibration. Across all settings, we find that state-of-the-art LLMs systematically overcommit, rarely abstaining even when the question itself is hidden. These results reveal a fundamental mismatch between LLM behavior and clinical norms, highlighting abstention as a critical but overlooked dimension of medical decision-making evaluation.

Authors

Alessio Cocchieri , Luca Ragazzi , Giuseppe Tagliavini , Gianluca Moro

Topics

Healthcare & Medicine > Clinical > Medical AI Artificial Intelligence > Core AI > Uncertainty Quantification Artificial Intelligence > Core AI > Evaluation

Keywords

model calibration clinical decision-making medical question answering large language model abstention behavior

Download PDF

Related papers

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand 2026

One-step Nonautoregressive Natural Language Generation with Shortcut Flow Matching Models 2026

Optimizing Retrieval-Augmented Generation for E-Commerce How-To Assistance 2026

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing 2026

MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation 2026