WIST: Web-Grounded Iterative Self-Play Tree for Domain-Targeted Reasoning Improvement

Fangyuan Li; Pengfei Li; Shijie Wang; Junqi Gao; Jianxing Liu; Biqing Qi; Yuqiang Li

2026 ACL ACL 2026

WIST: Web-Grounded Iterative Self-Play Tree for Domain-Targeted Reasoning Improvement

Abstract

AbstractRecent progress in reinforcement learning with verifiable rewards (RLVR) offers a practical path to self-improving language models, but existing methods face a key trade-off: endogenous self-play can drift over iterations, while corpus-grounded approaches rely on curated data environments. We present WIST, a Web-grounded Iterative Self-play Tree framework for domain-targeted reasoning improvement that learns directly from the open-web without requiring any pre-arranged domain corpus. WIST incrementally expands a domain tree to structure exploration and retrieves and cleans path-consistent web evidence to construct a controllable training environment. It then performs Challenger-Solver self-play with verifiable rewards, and feeds learnability signals back to update node posteriors and guide subsequent exploration through an adaptive curriculum. Across four backbones, WIST consistently improves over the base models and typically outperforms both purely endogenous self-evolution and corpus-grounded self-play baselines, with the Overall gains reaching +9.8 (Qwen3-4B-Base) and +9.7 (OctoThinker-8B-Hybrid-Base). WIST is also domain-steerable: improving Qwen3-8B-Base by +14.79 in medicine and Qwen3-4B-Base by +5.28 on PhyBench. Ablations further confirm the importance of WIST’s key components for stable open-web learning. Our Code is available at https://github.com/lfy-123/WIST.

Authors

Fangyuan Li , Pengfei Li , Shijie Wang , Junqi Gao , Jianxing Liu , Biqing Qi , Yuqiang Li

Topics

Reinforcement Learning > Methods > Deep RL Artificial Intelligence > Core AI > Large Language Models Artificial Intelligence > Core AI > Reinforcement Learning

Keywords

reinforcement learning language model web evidence domain tree

Download PDF

Related papers

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand 2026

One-step Nonautoregressive Natural Language Generation with Shortcut Flow Matching Models 2026

Optimizing Retrieval-Augmented Generation for E-Commerce How-To Assistance 2026

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing 2026

MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation 2026