Tools · Jul 4, 2026

AWS SageMaker AI adds multi-turn reinforcement learning training loop with serverless execution

New SageMaker AI multi-turn reinforcement learning capabilities provide a managed training loop, hardware, and orchestration for agentic workflows, including serverless execution and a native algorithm library for PPO, CISPO, and importance-sampling losses.

Trust79

HypeLow hype

1 source · cross-referenced

ShareX LinkedIn Email

TL;DR

Amazon SageMaker AI now supports multi-turn reinforcement learning (RL) training with a managed loop, hardware, and orchestration for agentic workflows.
The service offers serverless execution, asynchronous rollout, and a native algorithm library including PPO, CISPO, and importance-sampling losses.
Best practices emphasize building a trustworthy simulated environment, setting up external evaluation, and aligning rewards with end tasks.
Trajectory and reward observability are integrated with MLflow managed by SageMaker AI.

Amazon SageMaker AI now provides a managed training loop for multi-turn reinforcement learning (RL) agent workflows, handling orchestration, hardware, and the RL loop itself. The system supports agentic tasks that read instructions, make tool calls, read results, decide the next action, and recover from mistakes before committing to an answer.

SageMaker AI multi-turn RL (SageMaker AI MTRL) connects to agents running on Amazon Bedrock AgentCore, Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Compute Cloud (Amazon EC2), AWS Fargate, or custom infrastructure via a small adapter. The service manages serverless execution, asynchronous rollout with bounded off-policy staleness, and parallel generation and gradient updates to speed training.

The native algorithm library includes Proximal Policy Optimization (PPO), Clipped Importance Sampling Policy Optimization (CISPO), and importance-sampling losses, paired with multiple group-based advantage estimators such as GRPO, GRPO pass@k, and RLOO. Sequence-extension training is supported to reduce wall-clock time on long multi-turn trajectories.

Trajectory and reward observability are integrated with MLflow managed by Amazon SageMaker AI, enabling turn-by-turn inspection of agent behavior and training steps. Evaluation jobs report reward, pass@k, trajectory metrics, and more before deployment to a SageMaker AI endpoint or Amazon Bedrock.

Best practices outlined in the post emphasize building a training environment that is cheap, reproducible, and representative. The environment should be sandboxed or simulated to avoid live traffic impact, with tool calls and responses driven by recorded responses or isolated state. Three patterns are recommended: read-only tools using recorded responses, stateful tools with per-episode resource allocation and cleanup, and verifiable outcomes via isolated execution environments for code, SQL, or math.

External evaluation should be set up before training to measure success directly against the end goal, since RL optimizes the reward signal literally. The post also highlights the need to design rewards aligned with the end task and to monitor metrics that indicate when to iterate.

Examples draw on the SOP-Bench dataset, an Amazon Science benchmark evaluating agents’ ability to resolve tasks based on complex Standard Operating Procedures across 12 business domains.

Sources

01AWS — Machine Learning Blog — Best practices for multi-turn reinforcement learning in Amazon SageMaker AI

Also on Tools

AWS SageMaker AI adds multi-turn reinforcement learning training loop with serverless execution

Alibaba bans employees from using Anthropic’s Claude Code

AWS details how Amazon Bedrock can be used to detect AI-generated phishing emails

Current AI releases Gap Map v0.1, an open dataset cataloging 421 open-source AI products and 16,185 GitHub repositories