Introducing next-generation audio models in the API

Three new models: gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-mini-tts.
- Beats Whisper v2 and v3 on WER across all languages evaluated.
- ~35% lower word error rate on Common Voice and FLEURS benchmarks.
TTS steerability: instruct the model HOW to speak, not just what.
- “Speak like a calm therapist” adjusts dynamically — no reprogramming.
- 11 base voices; openai.fm playground for live testing.
Pricing: transcribe $0.006/min (full) and $0.003/min (mini); TTS $0.015/min.
RL training on diverse data; robust to accents, noise, fast speech.
Agents SDK integration enables continuous listen→process→speak loops.
Security risk: embedded stage directions in TTS scripts inconsistently enforced.