ServiceNow engineers document vLLM V1 migration: fixing backend correctness in reinforcement learning pipelines
A detailed technical post describes how ServiceNow's PipelineRL addressed train-inference mismatches when migrating from vLLM 0.8.5 to 0.18.1, identifying and fixing four backend compatibility issues before tuning the RL objective.
1 source · cross-referenced
- ServiceNow engineers migrated PipelineRL from vLLM V0 (0.8.5) to V1 (0.18.1) and discovered that backend behavior differences—not algorithmic issues—caused training divergence between the two versions.
- Four specific fixes restored parity: switching to processed_logprobs mode, disabling prefix caching and async scheduling to match V0 defaults, aligning inflight weight-update semantics, and computing the final lm_head in fp32 precision.
- The team prioritized backend correctness verification over RL objective changes, using trainer-side metrics (clip rate, KL, entropy, reward) to isolate and diagnose each mismatch layer.
- The final V1 run matched the V0 reference trajectory across all measured metrics after these fixes, demonstrating that architecture rewrites can hide subtle correctness issues in online RL systems.
- This approach mirrors correctness verification documented in other large-scale RL work (MiniMax-M1, ScaleRL), where logits and inference precision directly affect policy gradient computation.
ServiceNow's PipelineRL system uses vLLM as its inference engine for rollout generation, with the engine producing token logprobs that feed directly into policy gradient updates. When the team began migrating from vLLM 0.8.5 to the substantially rewritten vLLM 0.18.1, they encountered training divergence that suggested either a backend problem or an algorithmic issue requiring RL tuning. The post documents how they systematically identified and fixed four backend-level issues that had caused the mismatch.
The first problem was semantic. vLLM V1 returns logprobs from raw model outputs by default, whereas PipelineRL expected logprobs from the post-processed distribution (after temperature scaling and sampling penalties). Switching to logprobs-mode=processed_logprobs aligned the logits semantics but did not fully resolve the gap, pointing to differences in how the two versions executed the same requests.
Runtime defaults created a second source of divergence. The initial V1 run applied prefix caching and asynchronous scheduling automatically, behaviors that differed from the V0 reference. In an online RL setting with inflight model updates, prefix-cache hits could reuse state computed before weights changed, creating correctness issues. Explicitly disabling prefix caching and async scheduling removed this degree of freedom from the comparison.
A third issue emerged in how vLLM V1 handled weight synchronization during active inference. The team replicated V0's behavior by pausing generation at engine boundaries, loading new weights, and resuming without explicit cache invalidation—matching the older system's implicit behavior more closely than V1's stricter default modes.
The final refinement involved numerical precision. The trainer computed policy ratios using fp32 precision for the final lm_head projection, but the V1 inference backend had not been configured to match. Once the team aligned this computational path, the V1 training curves matched the V0 reference across clip rate, KL entropy, and reward metrics, demonstrating that backend correctness precedes objective-level design choices in online RL systems.
- May 6, 2026 · Hugging Face
Hugging Face adds private datasets to Open ASR Leaderboard to combat benchmarking gaming
Trust70 - May 6, 2026 · GitHub · huggingface/transformers releases
Transformers 5.8.0 adds DeepSeek-V4, Gemma 4 Assistant, and five new models
Trust79 - May 5, 2026 · OpenAI — News
OpenAI introduces beta self-serve ads manager for ChatGPT with CPC bidding
Trust79