Models · May 9, 2026

OpenAI releases GPT-Realtime-2 voice model with expanded reasoning and 128K context window

Three new streaming audio models—GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper—are now available via the Realtime API, with independent benchmarks showing instruction retention improvements and faster response times.

Trust69

HypeSome hype

3 sources · cross-referenced

ShareX LinkedIn Email

TL;DR

OpenAI launched GPT-Realtime-2, positioning it as a voice-to-speech model with 'GPT-5-class reasoning' capable of tool use, interruption recovery, and longer conversations via expanded 128K context window.
Two companion models release simultaneously: GPT-Realtime-Translate for streaming translation across 70+ input languages into 13 outputs, and GPT-Realtime-Whisper for low-latency streaming transcription.
Scale AI reported GPT-Realtime-2 achieved top ranking on its Audio MultiChallenge leaderboard with instruction retention rising from 36.7% to 70.8%, while Artificial Analysis measured 96.6% on Big Bench Audio speech-to-speech reasoning.
Adjustable reasoning effort levels (minimal through xhigh) allow developers to trade off response latency; minimal reasoning achieves 1.12s time-to-first-audio versus 2.33s at high reasoning.
Early adopters Glean and Genspark reported 42.9% and 26% relative improvements in helpfulness and conversation effectiveness respectively in internal evaluations.

OpenAI has released three new streaming audio models integrated into its Realtime API. GPT-Realtime-2 is positioned as a native speech-to-speech model designed for production voice agents that can reason during conversation, invoke multiple tools concurrently, recover gracefully from user interruptions, and maintain longer dialogue sessions. The context window has expanded from 32K to 128K tokens, supporting more complex conversational histories.

Two companion models address adjacent use cases. GPT-Realtime-Translate enables streaming translation from over 70 input languages into 13 output languages in real time. GPT-Realtime-Whisper provides low-latency streaming transcription for captions, note-taking, and continuous speech understanding. All three models are available immediately via the Realtime API; a ChatGPT voice upgrade incorporating these improvements remains pending.

Independent evaluators have published performance metrics. Scale AI's Audio MultiChallenge leaderboard ranks GPT-Realtime-2 first overall, with instruction retention improving from 36.7% to 70.8% against the prior GPT-Realtime-1.5 model. Artificial Analysis measured 96.6% accuracy on Big Bench Audio speech-to-speech reasoning and 96.1% on its Conversational Dynamics benchmark. Time-to-first-audio ranges from 1.12 seconds at minimal reasoning effort to 2.33 seconds at high reasoning effort. Audio pricing remains unchanged at $1.15 per hour for input and $4.61 per hour for output.

Developers can now control reasoning intensity across five adjustable levels—minimal, low, medium, high, and extra-high—with low set as the default. The model supports preambles that precede main responses (e.g., 'let me check that') and audible tool transparency during execution (e.g., 'checking your calendar now'), designed to sustain user engagement while the model processes requests. Early enterprise adopters report measurable gains: Glean observed a 42.9% relative improvement in helpfulness in internal evaluations of real-time organizational voice interactions, while Genspark reported a 26% increase in effective conversation completion rate for its Call for Me Agent.

Sources

Also on Models

OpenAI releases GPT-Realtime-2 voice model with expanded reasoning and 128K context window

Claude Code confirmed using Bun’s Rust port in production

Moonshot AI releases Kimi K3 open source model, touting frontier-level performance

OpenAI CFO proposes scorecard to measure AI ROI