reward modeling

159 papers

Explore in graph

Also known as

RLHF RM

Co-occurring keywords

large language model (12755) reinforcement learning (4122) reinforcement learning from human feedback (261) reward model (251) preference learning (411) language model alignment (142) human feedback (161) direct preference optimization (317) policy optimization (630) language model (4573)

Papers

Rating-Based Reinforcement Learning AAAI 2024

Rethinking the Role of Proxy Rewards in Language Model Alignment EMNLP 2024

Reward Modeling Requires Automatic Adjustment Based on Data Quality EMNLP 2024

Sing it, Narrate it: Quality Musical Lyrics Translation EMNLP 2024

Global Reward to Local Rewards: Multimodal-Guided Decomposition for Improving Dialogue Agents EMNLP 2024

Improving Discriminative Capability of Reward Models in RLHF Using Contrastive Learning EMNLP 2024

A General Theoretical Paradigm to Understand Learning from Human Preferences AISTATS 2024

Aligning to Thousands of Preferences via System Message Generalization NIPS 2024

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification NIPS 2024

LLM Evaluators Recognize and Favor Their Own Generations NIPS 2024

KAUCUS - Knowledgeable User Simulators for Training Large Language Models EACL 2024

When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback NIPS 2024

Direct Preference-based Policy Optimization without Reward Modeling NIPS 2023

Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons ICML 2023

Aligning Large Language Models through Synthetic Feedback EMNLP 2023

Reward Gaming in Conditional Text Generation ACL 2023

Aligning Factual Consistency for Clinical Studies Summarization through Reinforcement Learning ACL 2023

Loose lips sink ships: Mitigating Length Bias in Reinforcement Learning from Human Feedback EMNLP 2023

A Last Switch Dependent Analysis of Satiation and Seasonality in Bandits AISTATS 2022

Non-Markovian Reward Modelling from Trajectory Labels via Interpretable Multiple Instance Learning NIPS 2022

Aligning Generative Language Models with Human Values NAACL 2022

Thank you BART! Rewarding Pre-Trained Models Improves Formality Style Transfer IJCNLP 2021

Imperfect also Deserves Reward: Multi-Level and Sequential Reward Modeling for Better Dialog Management NAACL 2021

Reconciling Rewards with Predictive State Representations IJCAI 2021

PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training ICML 2021