← Learning Types

Machine Learning › Learning Types ›

Reinforcement Learning

2932 directly classified papers

Papers per year

Papers

Rejected Dialects: Biases Against African American Language in Reward Models NAACL 2025

Step-level Verifier-guided Hybrid Test-Time Scaling for Large Language Models EMNLP 2025

Learning to Reason via Self-Iterative Process Feedback for Small Language Models COLING 2025

StoryLLaVA: Enhancing Visual Storytelling with Multi-Modal Large Language Models COLING 2025

Understanding Reference Policies in Direct Preference Optimization NAACL 2025

InstructionCP: A Simple yet Effective Approach for Transferring Large Language Models to Target Languages ACL 2025

2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision NAACL 2025

Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement ACL 2025

Faster Machine Translation Ensembling with Reinforcement Learning and Competitive Correction NAACL 2025

Reversal of Thought: Enhancing Large Language Models with Preference-Guided Reverse Reasoning Warm-up ACL 2025

Interaction-Required Suggestions for Control, Ownership, and Awareness in Human-AI Co-Writing NAACL 2025

Enhancing Machine Translation with Self-Supervised Preference Data ACL 2025

An Analysis of Scoring Methods for Reranking in Large Language Model Story Generation NAACL 2025

Don’t Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls ACL 2025

TAROT: Task-Oriented Authorship Obfuscation Using Policy Optimization Methods NAACL 2025

PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models ACL 2025

Teaching an Old LLM Secure Coding: Localized Preference Optimization on Distilled Preferences ACL 2025

Balancing the Budget: Understanding Trade-offs Between Supervised and Preference-Based Finetuning ACL 2025

Breaking the Reasoning Barrier A Survey on LLM Complex Reasoning through the Lens of Self-Evolution ACL 2025

Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond ACL 2025

CARMO: Dynamic Criteria Generation for Context Aware Reward Modelling ACL 2025

Comparing Bad Apples to Good Oranges Aligning Large Language Models via Joint Preference Optimization ACL 2025

Focused-DPO: Enhancing Code Generation Through Focused Preference Optimization on Error-Prone Points ACL 2025

Local Look-Ahead Guidance via Verifier-in-the-Loop for Automated Theorem Proving ACL 2025

Learning Structured World Models From and For Physical Interactions AAAI 2025