Skip to content
Research · Jun 26, 2026

Post-training helpfulness degrades compassion values more than coding training in Llama 3.1 8B

A study finds supervised fine-tuning and reinforcement learning for helpfulness reduce animal compassion scores by up to 46.5 percentage points compared to coding-focused training, with moral reasoning gaps up to 25.5 points.

Trust79
HypeLow hype

1 source · cross-referenced

ShareXLinkedInEmail
TL;DR
  • Helpfulness-oriented post-training significantly degrades mid-trained compassion values in Llama 3.1 8B, while coding-focused training preserves them.
  • Animal compassion scores fell to 35.7% (SFT) and 18.7% (GRPO) under helpfulness training vs. 65.2% and 32.0% under coding training on the Animal Harm Benchmark.
  • General moral reasoning degraded by 25.5 percentage points on English MORU items when trained for helpfulness vs. coding (46.4% vs. 71.9%).
  • The compassion degradation effect transfers across languages, while the helpfulness-driven moral reasoning gap does not appear in multilingual MORU benchmarks.

A new arXiv preprint investigates how the domain of post-training data affects the retention of compassion values in a Llama 3.1 8B model that was mid-trained on compassion-oriented synthetic data. The study compares supervised fine-tuning (SFT) and group relative policy optimization (GRPO) when the post-training domain is helpfulness-focused versus coding-focused.

Across both training paradigms, helpfulness-oriented post-training significantly degraded animal compassion performance relative to coding-oriented training on the Animal Harm Benchmark (AHB 2.2). Under SFT, helpfulness training reduced compassion scores to 35.7% versus 65.2% for coding training; under GRPO, scores were 18.7% versus 32.0%, respectively. These results replicated across two independent helpfulness datasets and two training methods.

The study also evaluated general moral reasoning using the Moral Reasoning Under Uncertainty (MORU) benchmark. On English-language MORU items, helpfulness training degraded moral reasoning by 25.5 percentage points compared to coding training (46.4% vs. 71.9%). The authors describe this gap as "striking" and comparable in magnitude to the compassion degradation effect.

However, the domain-dependent effect did not transfer cross-lingually. On the multilingual MORU benchmark, the difference between helpfulness and coding training domains disappeared (SFT: 52.3% vs. 51.2%). In contrast, the compassion degradation effect remained consistent across languages, with coding-focused post-training yielding larger gains over the base model on non-English items than on English items.

The authors interpret these results as evidence that values instilled through mid-training are encoded more deeply and cross-lingually than reasoning improvements derived from domain-specific post-training. They suggest that labs building on value-laden mid-training may better preserve those values by using coding-domain post-training rather than helpfulness-domain post-training, without sacrificing general reasoning performance.

Sources
  1. 01arXiv cs.CLHelpfulness Hurts: Domain-Dependent Degradation of Mid-Trained Compassion Values Under Post-Training
Also on Research

Stories may contain errors. Dispatch is assembled with AI assistance and curated by human editors; despite the trust-score filter, mistakes happen. We correct publicly — every article links to its revision history. Nothing here is financial, legal, or medical advice. Verify before relying on any claim.

© 2026 Dispatch. No ads. No sponsorships. No paid placement. Reader-supported via Ko-fi.

Built by a person who cares about honest AI news.