Co-occurring keywords
Papers
Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs
ACL 2024
Policy Mirror Descent with Lookahead
NIPS 2024
TLCR: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human Feedback
ACL 2024
Variational Delayed Policy Optimization
NIPS 2024
Off-Agent Trust Region Policy Optimization
IJCAI 2024