When Thinking Backfires: Mechanistic Insights into Reasoning-Induced Misalignment

Abstract

With the growing accessibility and wide adoption of large language models, concerns about their safety and alignment with human values have become paramount. In our work, we identify a concerning phenomenon: Reasoning-Induced Misalignment (RIM), in which misalignment emerges when reasoning capabilities strengthened-particularly when specific types of reasoning patterns are introduced during inference or training. Beyond reporting this vulnerability, we provide the first mechanistic account of its origins. Through representation analysis, we discover that specific attention heads facilitate refusal by reducing their attention to CoT tokens, a mechanism that modulates the model's rationalization process during inference. During training, we find significantly higher activation entanglement between reasoning and safety in safety-critical neurons than in control neurons, particularly after fine-tuning with those identified reasoning patterns. This entanglement strongly correlates with catastrophic forgetting, providing a neuron-level explanation for RIM.

Experiment Setup

Reasoning-Induced Misalignment

We find that enhancing reasoning capabilities in LLMs can paradoxically lead to misalignment, a phenomenon we term Reasoning-Induced Misalignment (RIM). This occurs when models, prompted to use step-by-step reasoning (Chain-of-Thought), become more susceptible to generating harmful or undesirable content. We observe this effect across the Qwen3 family of reasoning models:

Misalignment rate (M. Rate) and math accuracy for Qwen3 models with think mode on vs. off.

The core of the issue lies in what we call "Effort-Minimizing Reasoning Patterns." These are cognitive shortcuts the model takes, such as confirmatory reasoning (seeking easy confirmation) or instruction deviation (partially complying with instructions). These patterns, while efficient, compromise the model's safety alignment.

Left: Average misalignment rate with different reasoning patterns (controlled group for comparison) for all eight models. Right: The responses from math (upper) and HEx-PHI (lower) dataset associated with the reasoning patterns.

Mechanistic Analysis with CoTs in Inference

Probing show that harmful and harmless inputs are separable using LLMs' internal representations. However, refusal and fulfillment behaviors overlap in the think mode, particularly within the CoT token region. This suggests that non-CoT regions significantly contribute to refusal behaviors.

Probe scores for different tokens in the Think mode.

Probe scores for different tokens in the No-Think mode

Certain attention heads facilitate refusal by focusing on empty reasoning spans in No-Think mode.

Refusal attention head shifts its attention from \textit{assistant} (left: think mode) to the \textit{empty} think tag (right: no-think mode)

Intervening on these refusal-facilitating attention heads lead to reduced refusal rates, confirming their causal role in refusal behavior.

Refusal attention head shifts its attention from \textit{assistant} (left: think mode) to the \textit{empty} think tag (right: no-think mode)

Mechanistic Analysis during Reasoning-Induced Fine-tuning

We identify safety-critical neurons by measuring conditional activation value when model process harmful requests from HEx-PHI (high likelihood of fulfillment) and HEx-PHI-MI (high likelihood of refusal). We confirm the safety-critical neurons by intervening on them and measuring the change in misalignment rate.

Changes in misalignment rate (left) and math accuracy (right) by intervening the target and random neurons. Left: intervention on target neurons lead to larger increase in misalignment than random neurons. Right: math reasoning accuracy is highly associated with the safety-critical neurons.

We introduce Reciprocal Activation Shift (RAS) to quantify the trade-off between safety and reasoning. We show that transferability is amplified in safety-critical neurons relative to random neurons, which implies a direct competition for shared neural resources when fine-tuning model on math reasoning tasks.

RAS for models trained on target (with effort-minimizing patterns) and control (without effort-minimizing patterns) CoTs on GSM8k(L)

We further show that RAS correlates strongly with misalignment rate. The correlation is statistically significant at α=0.05.

Comparison of the correlation between RAS using safety-critical neurons (left), random neurons (middle), and KL-divergence (right) for Qwen3-4B. The Pearson correlation (r) and its corresponding test statistic (p) are shown in the bottom right box.

We show that the correlation between RAS and misalignment rate is consistent across various models.

Correlation between RAS and change in model misalignment rate.

BibTeX


@inproceedings{yan2025when,
  title={When Thinking Backfires: Mechanistic Insights into Reasoning-Induced Misalignment},
  author={Yan, Hanqi and Xu, Hainiu and Qi, Siya and Yang, Shu and He, Yulan},
  booktitle={arxiv},
  year={2025}
}