Apple proposes variance-regularized method to prevent language models from ignoring critical constraints in multi-objective alignment
RVPO addresses constraint neglect in reinforcement learning from human feedback by penalizing inter-reward variance, improving medical reasoning benchmarks while maintaining general capabilities across model scales.
1 source · single source
- Apple researchers introduce Reward-Variance Policy Optimization (RVPO), a risk-sensitive alignment framework that prevents language models from achieving high scores in easy objectives while failing at critical constraints like safety or formatting.
- The method uses a LogSumExp operator to smooth variance penalties during multi-objective reward aggregation, shifting optimization from maximizing sum to maximizing consistency across 17 concurrent LLM-judged reward signals.
- Evaluation on medical and scientific reasoning tasks (tested on Qwen2.5 models ranging from 1.5B to 14B parameters) shows RVPO improves HealthBench scores from 0.215 to 0.261 while avoiding late-stage accuracy degradation on GPQA-Diamond.
Apple's Machine Learning Research team has published a paper proposing Reward-Variance Policy Optimization (RVPO), a method designed to address constraint neglect in multi-objective reinforcement learning from human feedback. The core problem: when training language models using multiple reward signals aggregated as a simple arithmetic mean, a model can achieve high scores on easy objectives while completely failing at critical ones—masking what the authors call "bottleneck" rewards essential for safe and reliable behavior.
RVPO introduces a variance penalty into the policy optimization objective, shifting the optimization target from maximizing reward sum to maximizing reward consistency across objectives. The authors demonstrate via Taylor expansion that a LogSumExp (SoftMin) operator functions as a smooth variance penalty, enabling the method to work within existing critic-less RLHF architectures without architectural redesign.
The team evaluated RVPO on rubric-based medical and scientific reasoning tasks with up to 17 concurrent LLM-judged reward signals, testing on Qwen2.5 models at 1.5B, 3B, 7B, and 14B parameter scales. On HealthBench, RVPO achieved a score of 0.261 compared to 0.215 for GDPO at the 14B scale (p < 0.001). Critically, the method maintains competitive performance on general reasoning (GPQA-Diamond) without the late-stage accuracy degradation observed in other multi-reward methods, suggesting variance regularization does not trade off general capability for constraint adherence.
- May 22, 2026 · arXiv cs.AI
New Method Improves LLM Reasoning About Conflicting Beliefs in Complex Social Scenarios
Trust79 - May 20, 2026 · OpenAI — News
OpenAI model resolves 80-year-old discrete geometry conjecture
Trust67 - May 20, 2026 · arXiv cs.AI
Study evaluates how language models interpret personal health records to answer patient questions
Trust74