Post-training helpfulness degrades compassion values more than coding training in Llama 3.1 8B
A study finds supervised fine-tuning and reinforcement learning for helpfulness reduce animal compassion scores by up to 46.5 percentage points compared to coding-focused training, with moral reasoning gaps up to 25.5 points.
1 source · cross-referenced
- Helpfulness-oriented post-training significantly degrades mid-trained compassion values in Llama 3.1 8B, while coding-focused training preserves them.
- Animal compassion scores fell to 35.7% (SFT) and 18.7% (GRPO) under helpfulness training vs. 65.2% and 32.0% under coding training on the Animal Harm Benchmark.
- General moral reasoning degraded by 25.5 percentage points on English MORU items when trained for helpfulness vs. coding (46.4% vs. 71.9%).
- The compassion degradation effect transfers across languages, while the helpfulness-driven moral reasoning gap does not appear in multilingual MORU benchmarks.
A new arXiv preprint investigates how the domain of post-training data affects the retention of compassion values in a Llama 3.1 8B model that was mid-trained on compassion-oriented synthetic data. The study compares supervised fine-tuning (SFT) and group relative policy optimization (GRPO) when the post-training domain is helpfulness-focused versus coding-focused.
Across both training paradigms, helpfulness-oriented post-training significantly degraded animal compassion performance relative to coding-oriented training on the Animal Harm Benchmark (AHB 2.2). Under SFT, helpfulness training reduced compassion scores to 35.7% versus 65.2% for coding training; under GRPO, scores were 18.7% versus 32.0%, respectively. These results replicated across two independent helpfulness datasets and two training methods.
The study also evaluated general moral reasoning using the Moral Reasoning Under Uncertainty (MORU) benchmark. On English-language MORU items, helpfulness training degraded moral reasoning by 25.5 percentage points compared to coding training (46.4% vs. 71.9%). The authors describe this gap as "striking" and comparable in magnitude to the compassion degradation effect.
However, the domain-dependent effect did not transfer cross-lingually. On the multilingual MORU benchmark, the difference between helpfulness and coding training domains disappeared (SFT: 52.3% vs. 51.2%). In contrast, the compassion degradation effect remained consistent across languages, with coding-focused post-training yielding larger gains over the base model on non-English items than on English items.
The authors interpret these results as evidence that values instilled through mid-training are encoded more deeply and cross-lingually than reasoning improvements derived from domain-specific post-training. They suggest that labs building on value-laden mid-training may better preserve those values by using coding-domain post-training rather than helpfulness-domain post-training, without sacrificing general reasoning performance.
- Jun 26, 2026 · arXiv cs.CL
LLMs show strong performance on text-only statics problems but struggle with diagrams and multi-step reasoning
Trust79 - Jun 26, 2026 · arXiv cs.CL
Linguistic features that shift LLM reasoning about animal welfare identified in new arXiv study
Trust79 - Jun 26, 2026 · arXiv cs.AI
Paper proposes activation-steering method to detect and reduce sycophancy in language models
Trust79