Tools · Jun 21, 2026

AWS SageMaker adds over 100 detailed inference metrics and CloudWatch Insights dashboard for generative AI endpoints

New observability features include token-level latency, KV cache pressure, and GPU health metrics, with a built-in CloudWatch dashboard and PromQL compatibility for custom tooling.

Trust79

HypeLow hype

1 source · cross-referenced

ShareX LinkedIn Email

TL;DR

AWS SageMaker now emits over 100 detailed inference metrics for generative AI endpoints, including token-level latency, KV cache pressure, and GPU health signals.
A new SageMaker Insights dashboard in CloudWatch provides pre-built visualizations for fleet, endpoint, and inference-component levels across Performance, Capacity, and Reliability views.
Detailed observability is enabled by default for new endpoints and opt-in for existing ones via the EnableDetailedObservability parameter.
Metrics are exposed in OpenTelemetry format and queryable via PromQL, with compatibility for custom dashboards like Grafana or Datadog.
The update targets production generative AI workloads using SageMaker’s Single-model or Inference component endpoints.

Amazon SageMaker now provides over 100 detailed inference metrics for generative AI workloads, covering GPU health, token-level latency, KV cache pressure, traffic distribution across Availability Zones, inference component placement, and cold start diagnostics. These metrics are emitted to Amazon CloudWatch and visualized in a new SageMaker Insights dashboard, which supports both Single-model endpoints and Inference component endpoints.

The SageMaker Insights dashboard is located in the CloudWatch console under Infrastructure Monitoring → SageMaker Insights. It queries metrics using PromQL and renders visualizations at the fleet, endpoint, and inference-component levels across three tabs: Performance, Capacity, and Reliability. The Performance view includes token latency, throughput, errors, and engine pressure; the Capacity view shows GPU, CPU, and memory utilization; and the Reliability view tracks Availability Zone distribution, scaling events, cold start patterns, and insufficient capacity errors.

Detailed observability is enabled by default for new SageMaker endpoints via the EnableDetailedObservability parameter, which defaults to true in endpoint configurations. For existing endpoints, users must opt in by creating a new endpoint configuration with EnableDetailedObservability set to true and updating the endpoint. The MetricsPublishFrequencyInSeconds parameter can be set to less than 60 seconds for near real-time monitoring, with a default of 60 seconds.

Metrics are emitted in OpenTelemetry format and are queryable via a PromQL-compatible endpoint in CloudWatch. This allows teams to connect SageMaker metrics to external observability tools such as Grafana or Datadog. Native OpenTelemetry metrics flow automatically to CloudWatch after enablement, but existing classic metrics require OTel enrichment to be visible in the SageMaker Insights dashboard and queryable with PromQL.

SageMaker supports two endpoint architectures for generative AI: Single-model endpoints, which host one model on dedicated instances, and Inference component endpoints, which allow multiple models to share GPU infrastructure with independent scaling and high availability through Availability Zone distribution. The Insights dashboard automatically adapts to display IC-specific panels when inference components are detected.

Sources

01AWS — Machine Learning Blog — Monitor and debug generative AI inference with SageMaker detailed metrics and Insights dashboard on CloudWatch

Also on Tools

AWS SageMaker adds over 100 detailed inference metrics and CloudWatch Insights dashboard for generative AI endpoints

AWS unveils Amazon Bedrock Managed Knowledge Base to simplify enterprise RAG pipelines

New website ranks how well AI models recall individuals without web search

AWS adds managed web search to Amazon Bedrock AgentCore