Skip to content
Research · Apr 18, 2026

Google DeepMind releases Gemini 3.1 Flash TTS with granular audio controls for expressive speech synthesis

The new text-to-speech model introduces audio tags for fine-grained control over vocal style, pacing, and delivery across 70+ languages, with quality benchmarked on the Artificial Analysis leaderboard.

Trust61
HypeSome hype

1 source

ShareXLinkedInEmail
TL;DR
  • Google DeepMind announced Gemini 3.1 Flash TTS, a text-to-speech model offering improved speech quality and expressiveness with granular natural language controls via audio tags.
  • The model achieved an Elo score of 1,211 on the Artificial Analysis TTS leaderboard and ranks in the top efficiency category for quality-to-cost ratio.
  • Gemini 3.1 Flash TTS supports over 70 languages, multi-speaker dialogue, and includes SynthID watermarking to identify AI-generated audio.
  • The model is rolling out in preview via the Gemini API, Google AI Studio, and Vertex AI, with availability for Google Workspace users through Google Vids.
  • Developers can use audio tags to control vocal style, pace, tone, and accent, with parameters exportable as API code for consistent voice reproduction.

Google DeepMind has released Gemini 3.1 Flash TTS, a text-to-speech model designed to provide developers and enterprises with enhanced control over AI-generated speech synthesis. The model emphasizes both speech quality and fine-grained expressiveness, marking an evolution in generative audio capabilities for production environments.

The system introduces audio tags—a set of natural language-driven controls embedded directly into text inputs that allow users to adjust vocal characteristics mid-sentence without model retraining. Users can specify scene direction, speaker-level parameters such as pace and tone, and inline expression changes. Developers can configure voices in Google AI Studio and export the resulting parameters as API code for reproducibility across deployments.

On the Artificial Analysis TTS leaderboard, which aggregates blind human preference tests across thousands of samples, the model achieved an Elo score of 1,211 and was positioned in what the benchmark describes as its most attractive quadrant—indicating high-quality output at lower computational cost relative to competing systems. The model supports 70+ languages and native multi-speaker dialogue generation.

Google is rolling out the model in preview phases: developers can access it via the Gemini API and Google AI Studio, enterprises can use it on Vertex AI, and Google Workspace subscribers can access it through Google Vids. All generated audio includes SynthID watermarking, a technique designed to mark synthetic speech as machine-generated for transparency and to mitigate potential misuse in deepfake scenarios.

Sources
  1. 01Google DeepMind — BlogGemini 3.1 Flash TTS: the next generation of expressive AI speech
Also on Research

Stories may contain errors. Dispatch is assembled with AI assistance and curated by human editors; despite the trust-score filter, mistakes happen. We correct publicly — every article links to its revision history. Nothing here is financial, legal, or medical advice. Verify before relying on any claim.

© 2026 Dispatch. No ads. No sponsorships. No paid placement. Reader-supported via Ko-fi.

Built by a person who cares about honest AI news.