preference learning

411 papers

Explore in graph

Also known as

DPO PL

Co-occurring keywords

large language model (12755) reinforcement learning (4122) direct preference optimization (317) reinforcement learning from human feedback (261) language model alignment (142) reward model (251) human feedback (161) reward modeling (159) model alignment (219) human preference (120)

Papers

p²-TQA: A Process-based Preference Learning Framework for Self-Improving Table Question Answering Models AACL 2025

LLaVA-Critic: Learning to Evaluate Multimodal Models CVPR 2025

Improving Large Vision and Language Models by Learning from a Panel of Peers ICCV 2025

MVReward: Better Aligning and Evaluating Multi-View Diffusion Models with Human Preferences AAAI 2025

Aligning to Constraints for Data-Efficient Language Model Customization NAACL 2025

VPO: Aligning Text-to-Video Generation Models with Prompt Optimization ICCV 2025

Beyond Under-Alignment: Atomic Preference Enhanced Factuality Tuning for Large Language Models NAACL 2025

Leveraging Human Input to Enable Robust, Interactive, and Aligned AI Systems AAAI 2025

Aligning to What? Limits to RLHF Based Alignment NAACL 2025

Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback ACL 2025

Generative Reward Modeling via Synthetic Criteria Preference Learning ACL 2025

Understanding Reference Policies in Direct Preference Optimization NAACL 2025

Cheems: A Practical Guidance for Building and Evaluating Chinese Reward Models from Scratch ACL 2025

Uncertainty-Aware Iterative Preference Optimization for Enhanced LLM Reasoning ACL 2025

MWPO: Enhancing LLMs Performance through Multi-Weight Preference Strength and Length Optimization ACL 2025

Cognitive-Level Adaptive Generation via Capability-Aware Retrieval and Style Adaptation EMNLP 2025

Chain of Strategy Optimization Makes Large Language Models Better Emotional Supporter EMNLP 2025

Bridging the Creativity Understanding Gap: Small-Scale Human Alignment Enables Expert-Level Humor Ranking in LLMs EMNLP 2025

RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness CVPR 2025

Tuning Less, Prompting More: In-Context Preference Learning Pipeline for Natural Language Transformation EMNLP 2025

Language Models as Continuous Self-Evolving Data Engineers EMNLP 2025

LaMDAgent: An Autonomous Framework for Post-Training Pipeline Optimization via LLM Agents EMNLP 2025

Binary Classifier Optimization for Large Language Model Alignment ACL 2025

Self-Training Meets Consistency: Improving LLMs’ Reasoning with Consistency-Driven Rationale Evaluation NAACL 2025

COPR: Continual Human Preference Learning via Optimal Policy Regularization ACL 2025