Introducing next-generation audio models in the API
https://openai.com/index/introducing-our-next-generation-audio-models/-
Three new models: gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-mini-tts.
- Beats Whisper v2 and v3 on WER across all languages evaluated.
- ~35% lower word error rate on Common Voice and FLEURS benchmarks.
-
TTS steerability: instruct the model HOW to speak, not just what.
- “Speak like a calm therapist” adjusts dynamically — no reprogramming.
- 11 base voices; openai.fm playground for live testing.
- Pricing: transcribe $0.006/min (full) and $0.003/min (mini); TTS $0.015/min.
- RL training on diverse data; robust to accents, noise, fast speech.
- Agents SDK integration enables continuous listen→process→speak loops.
- Security risk: embedded stage directions in TTS scripts inconsistently enforced.
· ** · Read on openai.com
| Type | Link |
| Added | Apr 16, 2026 |
| Modified | Apr 16, 2026 |