Reasoning Structure Matters for Safety Alignment of Reasoning Models

Yeonjun In; Wonjoong Kim; Sangwu Park; Chanyoung Park

2026 ACL ACL 2026

Reasoning Structure Matters for Safety Alignment of Reasoning Models

Abstract

AbstractLarge reasoning models (LRMs) achieve strong performance on complex reasoning tasks but often generate harmful responses to malicious user queries. This paper investigates the underlying cause of these safety risks and shows that the issue lies in the reasoning structure itself. Based on this insight, we claim that effective safety alignment can be achieved by altering the reasoning structure. We propose AltTrain, a simple yet effective post-training method that explicitly alters the reasoning structure of LRMs. AltTrain is both practical and generalizable, requiring no complex reinforcement learning (RL) training or reward design—only supervised fine-tuning (SFT) with a lightweight 1K training examples. Experiments across LRM backbones and model sizes demon strate strong safety alignment, along with robust generalization across reasoning, QA, summarization, and multilingual setting.

Authors

Yeonjun In , Wonjoong Kim , Sangwu Park , Chanyoung Park

Topics

Artificial Intelligence > Core AI > Large Language Models Machine Learning > Learning Types > Fine-Tuning Artificial Intelligence > Core AI > Safety

Keywords

safety alignment supervised fine-tuning large reasoning model reasoning structure

Download PDF

Related papers

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand 2026

One-step Nonautoregressive Natural Language Generation with Shortcut Flow Matching Models 2026

Optimizing Retrieval-Augmented Generation for E-Commerce How-To Assistance 2026

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing 2026

MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation 2026