← Learning Types

Machine Learning › Learning Types ›

Reinforcement Learning

2932 directly classified papers

Papers per year

Papers

Reversal of Thought: Enhancing Large Language Models with Preference-Guided Reverse Reasoning Warm-up ACL 2025

Steering LLM Reasoning Through Bias-Only Adaptation EMNLP 2025

Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models EMNLP 2025

Sticker-TTS: Learn to Utilize Historical Experience with a Sticker-driven Test-Time Scaling Framework EMNLP 2025

From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment ACL 2025

Structural Reward Model: Enhancing Interpretability, Efficiency, and Scalability in Reward Modeling EMNLP 2025

Mapping Smarter, Not Harder: A Test-Time Reinforcement Learning Agent That Improve Without Labels or Model Updates EMNLP 2025

T-REG: Preference Optimization with Token-Level Reward Regularization ACL 2025

AutoDSPy: Automating Modular Prompt Design with Reinforcement Learning for Small and Large Language Models EMNLP 2025

STACKFEED: Structured Textual Actor-Critic Knowledge base editing with FEEDback EMNLP 2025

Don’t Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls ACL 2025

Learning to Translate Ambiguous Terminology by Preference Optimization on Post-Edits EMNLP 2025

Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond ACL 2025

Removing Prompt-template Bias in Reinforcement Learning from Human Feedback ACL 2025

Local Look-Ahead Guidance via Verifier-in-the-Loop for Automated Theorem Proving ACL 2025

CheXalign: Preference fine-tuning in chest X-ray interpretation models without human feedback ACL 2025

PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models ACL 2025

FastMCTS: A Simple Sampling Strategy for Data Synthesis ACL 2025

Uncertainty-Aware Iterative Preference Optimization for Enhanced LLM Reasoning ACL 2025

InspireDebate: Multi-Dimensional Subjective-Objective Evaluation-Guided Reasoning and Optimization for Debating ACL 2025

HelpSteer3: Human-Annotated Feedback and Edit Data to Empower Inference-Time Scaling in Open-Ended General-Domain Tasks ACL 2025

Teaching an Old LLM Secure Coding: Localized Preference Optimization on Distilled Preferences ACL 2025

Balancing the Budget: Understanding Trade-offs Between Supervised and Preference-Based Finetuning ACL 2025

PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment ACL 2025

ROSE: A Reward-Oriented Data Selection Framework for LLM Task-Specific Instruction Tuning EMNLP 2025