Research · May 10, 2026

Apple proposes variance-regularized method to prevent language models from ignoring critical constraints in multi-objective alignment

RVPO addresses constraint neglect in reinforcement learning from human feedback by penalizing inter-reward variance, improving medical reasoning benchmarks while maintaining general capabilities across model scales.

Trust70

HypeLow hype

1 source · single source

ShareX LinkedIn Email

TL;DR

Apple researchers introduce Reward-Variance Policy Optimization (RVPO), a risk-sensitive alignment framework that prevents language models from achieving high scores in easy objectives while failing at critical constraints like safety or formatting.
The method uses a LogSumExp operator to smooth variance penalties during multi-objective reward aggregation, shifting optimization from maximizing sum to maximizing consistency across 17 concurrent LLM-judged reward signals.
Evaluation on medical and scientific reasoning tasks (tested on Qwen2.5 models ranging from 1.5B to 14B parameters) shows RVPO improves HealthBench scores from 0.215 to 0.261 while avoiding late-stage accuracy degradation on GPQA-Diamond.

Apple's Machine Learning Research team has published a paper proposing Reward-Variance Policy Optimization (RVPO), a method designed to address constraint neglect in multi-objective reinforcement learning from human feedback. The core problem: when training language models using multiple reward signals aggregated as a simple arithmetic mean, a model can achieve high scores on easy objectives while completely failing at critical ones—masking what the authors call "bottleneck" rewards essential for safe and reliable behavior.

RVPO introduces a variance penalty into the policy optimization objective, shifting the optimization target from maximizing reward sum to maximizing reward consistency across objectives. The authors demonstrate via Taylor expansion that a LogSumExp (SoftMin) operator functions as a smooth variance penalty, enabling the method to work within existing critic-less RLHF architectures without architectural redesign.

The team evaluated RVPO on rubric-based medical and scientific reasoning tasks with up to 17 concurrent LLM-judged reward signals, testing on Qwen2.5 models at 1.5B, 3B, 7B, and 14B parameter scales. On HealthBench, RVPO achieved a score of 0.261 compared to 0.215 for GDPO at the 14B scale (p < 0.001). Critically, the method maintains competitive performance on general reasoning (GPQA-Diamond) without the late-stage accuracy degradation observed in other multi-reward methods, suggesting variance regularization does not trade off general capability for constraint adherence.

Sources

01Apple — Machine Learning Research — RVPO: Risk-Sensitive Alignment via Variance Regularization

Also on Research

Apple proposes variance-regularized method to prevent language models from ignoring critical constraints in multi-objective alignment

Anthropic reports discovery of an internal reasoning space in its Claude models

Apple researchers propose interactive proof systems to verify distribution property claims with sublinear overhead

Apple researchers propose doubly sub-linear interactive proofs for verifying large inputs