DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

Yukun Huang; Leonardo F. R. Ribeiro; Momchil Hardalov; Bhuwan Dhingra; Markus Dreyer; Venkatesh Saligrama

2026 ACL ACL 2026

DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

Abstract

AbstractSearch-augmented LLM agents can produce deep research reports (DRRs), but verifying claim-level factuality remains challenging. Existing fact-checkers usually target general-domain atomic claims, and there is no benchmark to test whether such verifiers transfer to DRRs.Yet building such a benchmark for DRR fact-checkers is itself difficult because it requires expert judgments over cognitively demanding, domain-specific claims.In a controlled study with PhD-level specialists, unassisted experts achieve only 60.8% accuracy on hidden known-answer claims. We therefore propose evolving benchmarking via **Audit-then-Score** (**AtS**), in which labels and rationales remain revisable: when a verifier disagrees with the current benchmark, it submits evidence; an auditor adjudicates the dispute; and accepted revisions update the benchmark before scoring. After three additional **AtS** rounds, expert accuracy rises to 90.9%, showing that experts are better auditors than one-shot labelers.We instantiate **AtS** as **DeepFactBench**, a versioned DRR factuality benchmark with auditable rationales, and introduce **DeepFactEval**, a claim-level verifier.On the frozen **DeepFactBench** release, **DeepFactEval** achieves 83.4% accuracy, outperforming the best prior deep-research and traditional fact-checkers by 14.3 and 24.9 points, respectively, and transferring well to external factuality datasets.

Authors

Yukun Huang , Leonardo F. R. Ribeiro , Momchil Hardalov , Bhuwan Dhingra , Markus Dreyer , Venkatesh Saligrama

Topics

Natural Language Processing > Applications > Fact-Checking Artificial Intelligence > Core AI > Large Language Models Artificial Intelligence > Core AI > Evaluation

Keywords

fact verification language model agent deep research report

Download PDF

Related papers

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand 2026

One-step Nonautoregressive Natural Language Generation with Shortcut Flow Matching Models 2026

Optimizing Retrieval-Augmented Generation for E-Commerce How-To Assistance 2026

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing 2026

MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation 2026