DisCal: Distribution-Aware Calibration for Mathematical Reasoning Under Character-Level Noisy Inputs

Bo Zhang; Jiawei Zhang; Cong Gao; Bingxu Han; Minghao Hu; Jun Zhang; Yunbo Cao; Zhunchen Luo; Wen Yao; Guotong Geng; Zhong Wang

2026 ACL ACL 2026

DisCal: Distribution-Aware Calibration for Mathematical Reasoning Under Character-Level Noisy Inputs

Abstract

AbstractAlthough large reasoning models (LRMs) exhibit exceptional mathematical reasoning capabilities on clean inputs, their reasoning accuracy drops substantially in the presence of character-level noise such as typographical errors. Critically, their confidence estimates fail to reflect the corresponding decline in reasoning accuracy. While confidence calibration offers a principled solution, existing methods predominantly target clean inputs, leaving noisy scenarios largely unexplored. To address this gap, we propose DisCal (Distribution-aware Calibration), a confidence calibration framework for character-level noisy inputs. DisCal extracts uncertainty signals from both the empirical answer distribution and the model’s predictive distribution, and integrates them via a learned calibrator to produce well-calibrated confidence. Experiments across multiple mathematical reasoning benchmarks demonstrate that DisCal consistently outperforms existing calibration methods under noisy inputs, reducing Expected Calibration Error (ECE) by up to 39.21% and improving Area Under the Receiver Operating Characteristic Curve (AUROC) by up to 31.44%.

Authors

Bo Zhang , Jiawei Zhang , Cong Gao , Bingxu Han , Minghao Hu , Jun Zhang , Yunbo Cao , Zhunchen Luo , Wen Yao , Guotong Geng , Zhong Wang

Topics

Artificial Intelligence > Core AI > Large Language Models Artificial Intelligence > Core AI > Uncertainty Quantification Machine Learning > Learning Types > Calibration

Keywords

mathematical reasoning confidence calibration expected calibration error predictive distribution large reasoning model character-level noise

Download PDF

Related papers

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand 2026

One-step Nonautoregressive Natural Language Generation with Shortcut Flow Matching Models 2026

Optimizing Retrieval-Augmented Generation for E-Commerce How-To Assistance 2026

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing 2026

MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation 2026