J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization

Austin Xu; Yilun Zhou; Xuan-Phi Nguyen; Caiming Xiong; Shafiq Joty

2026 ACL ACL 2026

J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization

Abstract

AbstractTo keep pace with the increasing velocity of large language models (LLM) development, model output evaluation has transitioned away from time-consuming human evaluation to automatic evaluation, where LLMs themselves are tasked with assessing and critiquing other model outputs. LLM-as-judge models are a class of generative evaluators that excel in evaluating relatively simple domains, like chat quality, but struggle in reasoning intensive domains where model responses contain more substantive and challenging content. To remedy existing judge shortcomings, we explore training judges with reinforcement learning (RL). We make three key contributions: (1) We propose the Equivalent Initial State Group Relative Policy Optimization (EIS-GRPO) algorithm, which allows us to train our judge to be robust to positional biases that arise in more complex evaluation settings. (2) We introduce ReasoningJudgeBench, a benchmark that evaluates judges in diverse reasoning settings not covered by prior work. (3) We train Judge for Reasoning (J4R), a 7B judge trained with EIS-GRPO that outperforms GPT-4o and the next best small judge by 6.7% and 9%, matching or exceeding the performance of larger GRPO-trained judges on both JudgeBench and ReasoningJudgeBench.

Authors

Austin Xu , Yilun Zhou , Xuan-Phi Nguyen , Caiming Xiong , Shafiq Joty

Topics

Artificial Intelligence > Core AI > Reasoning Deep Learning > Learning Types > Reinforcement Learning Artificial Intelligence > Core AI > Evaluation

Keywords

reinforcement learning position bia reasoning evaluation group relative policy optimization

Download PDF

Related papers

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand 2026

One-step Nonautoregressive Natural Language Generation with Shortcut Flow Matching Models 2026

Optimizing Retrieval-Augmented Generation for E-Commerce How-To Assistance 2026

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing 2026

MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation 2026