Google DeepMind releases Gemini 3.1 Flash TTS with granular audio controls for expressive speech synthesis
The new text-to-speech model introduces audio tags for fine-grained control over vocal style, pacing, and delivery across 70+ languages, with quality benchmarked on the Artificial Analysis leaderboard.
1 source
- Google DeepMind announced Gemini 3.1 Flash TTS, a text-to-speech model offering improved speech quality and expressiveness with granular natural language controls via audio tags.
- The model achieved an Elo score of 1,211 on the Artificial Analysis TTS leaderboard and ranks in the top efficiency category for quality-to-cost ratio.
- Gemini 3.1 Flash TTS supports over 70 languages, multi-speaker dialogue, and includes SynthID watermarking to identify AI-generated audio.
- The model is rolling out in preview via the Gemini API, Google AI Studio, and Vertex AI, with availability for Google Workspace users through Google Vids.
- Developers can use audio tags to control vocal style, pace, tone, and accent, with parameters exportable as API code for consistent voice reproduction.
Google DeepMind has released Gemini 3.1 Flash TTS, a text-to-speech model designed to provide developers and enterprises with enhanced control over AI-generated speech synthesis. The model emphasizes both speech quality and fine-grained expressiveness, marking an evolution in generative audio capabilities for production environments.
The system introduces audio tags—a set of natural language-driven controls embedded directly into text inputs that allow users to adjust vocal characteristics mid-sentence without model retraining. Users can specify scene direction, speaker-level parameters such as pace and tone, and inline expression changes. Developers can configure voices in Google AI Studio and export the resulting parameters as API code for reproducibility across deployments.
On the Artificial Analysis TTS leaderboard, which aggregates blind human preference tests across thousands of samples, the model achieved an Elo score of 1,211 and was positioned in what the benchmark describes as its most attractive quadrant—indicating high-quality output at lower computational cost relative to competing systems. The model supports 70+ languages and native multi-speaker dialogue generation.
Google is rolling out the model in preview phases: developers can access it via the Gemini API and Google AI Studio, enterprises can use it on Vertex AI, and Google Workspace subscribers can access it through Google Vids. All generated audio includes SynthID watermarking, a technique designed to mark synthetic speech as machine-generated for transparency and to mitigate potential misuse in deepfake scenarios.
- May 22, 2026 · arXiv cs.AI
New Method Improves LLM Reasoning About Conflicting Beliefs in Complex Social Scenarios
Trust79 - May 20, 2026 · OpenAI — News
OpenAI model resolves 80-year-old discrete geometry conjecture
Trust67 - May 20, 2026 · arXiv cs.AI
Study evaluates how language models interpret personal health records to answer patient questions
Trust74