Reward Alignment Optimization: A Direct Point-wise Alignment Approach

Zelin Li; Jia Leng; Dawei Song; Yangen Hu

2026 ACL ACL 2026

Reward Alignment Optimization: A Direct Point-wise Alignment Approach

Abstract

AbstractDirect Alignment Algorithms (DAAs) such as DPO simplify RLHF by optimizing policies directly from preference pairs. However, the Bradley–Terry probability-gap objective can induce likelihood displacement and, under weak KL constraints, may even reduce the probability of preferred responses, while implicit rewards can be limited in generalizaiton. We propose Reward Alignment Optimization (RAO), a point-wise direct alignment method that uses an explicit reward model to specify exact target generation probabilities and align the policy offline towards them. Our key insight is a theoretical principle we call "prefix consistency", which links the normalization terms of prompts that share a prefix. Leveraging this property, RAO decouples target reward differentials from bias terms, prevents decreasing preferred-response probabilities, and better exploits reward information both within and across prompts. Extensive experiments on multiple base LLMs show that RAO consistently outperforms existing DAAs while enabling controllable target probability distributions.

Authors

Zelin Li , Jia Leng , Dawei Song , Yangen Hu

Topics

Artificial Intelligence > Core AI > Large Language Models Artificial Intelligence > Core AI > Reinforcement Learning Deep Learning > Learning Types > Reinforcement Learning from Human Feedback

Keywords

policy alignment preference pair direct alignment algorithm reward alignment optimization prefix consistency

Download PDF

Related papers

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand 2026

One-step Nonautoregressive Natural Language Generation with Shortcut Flow Matching Models 2026

Optimizing Retrieval-Augmented Generation for E-Commerce How-To Assistance 2026

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing 2026

MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation 2026