Skip to content
Research · May 10, 2026

Apple proposes variance-regularized method to prevent language models from ignoring critical constraints in multi-objective alignment

RVPO addresses constraint neglect in reinforcement learning from human feedback by penalizing inter-reward variance, improving medical reasoning benchmarks while maintaining general capabilities across model scales.

Trust70
HypeLow hype

1 source · single source

ShareXLinkedInEmail
TL;DR
  • Apple researchers introduce Reward-Variance Policy Optimization (RVPO), a risk-sensitive alignment framework that prevents language models from achieving high scores in easy objectives while failing at critical constraints like safety or formatting.
  • The method uses a LogSumExp operator to smooth variance penalties during multi-objective reward aggregation, shifting optimization from maximizing sum to maximizing consistency across 17 concurrent LLM-judged reward signals.
  • Evaluation on medical and scientific reasoning tasks (tested on Qwen2.5 models ranging from 1.5B to 14B parameters) shows RVPO improves HealthBench scores from 0.215 to 0.261 while avoiding late-stage accuracy degradation on GPQA-Diamond.

Apple's Machine Learning Research team has published a paper proposing Reward-Variance Policy Optimization (RVPO), a method designed to address constraint neglect in multi-objective reinforcement learning from human feedback. The core problem: when training language models using multiple reward signals aggregated as a simple arithmetic mean, a model can achieve high scores on easy objectives while completely failing at critical ones—masking what the authors call "bottleneck" rewards essential for safe and reliable behavior.

RVPO introduces a variance penalty into the policy optimization objective, shifting the optimization target from maximizing reward sum to maximizing reward consistency across objectives. The authors demonstrate via Taylor expansion that a LogSumExp (SoftMin) operator functions as a smooth variance penalty, enabling the method to work within existing critic-less RLHF architectures without architectural redesign.

The team evaluated RVPO on rubric-based medical and scientific reasoning tasks with up to 17 concurrent LLM-judged reward signals, testing on Qwen2.5 models at 1.5B, 3B, 7B, and 14B parameter scales. On HealthBench, RVPO achieved a score of 0.261 compared to 0.215 for GDPO at the 14B scale (p < 0.001). Critically, the method maintains competitive performance on general reasoning (GPQA-Diamond) without the late-stage accuracy degradation observed in other multi-reward methods, suggesting variance regularization does not trade off general capability for constraint adherence.

Sources
  1. 01Apple — Machine Learning ResearchRVPO: Risk-Sensitive Alignment via Variance Regularization
Also on Research

Stories may contain errors. Dispatch is assembled with AI assistance and curated by human editors; despite the trust-score filter, mistakes happen. We correct publicly — every article links to its revision history. Nothing here is financial, legal, or medical advice. Verify before relying on any claim.

© 2026 Dispatch. No ads. No sponsorships. No paid placement. Reader-supported via Ko-fi.

Built by a person who cares about honest AI news.