Introducing next-generation audio models in the API

https://openai.com/index/introducing-our-next-generation-audio-models/
  • Three new models: gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-mini-tts.
    • Beats Whisper v2 and v3 on WER across all languages evaluated.
    • ~35% lower word error rate on Common Voice and FLEURS benchmarks.
  • TTS steerability: instruct the model HOW to speak, not just what.
    • “Speak like a calm therapist” adjusts dynamically — no reprogramming.
    • 11 base voices; openai.fm playground for live testing.
  • Pricing: transcribe $0.006/min (full) and $0.003/min (mini); TTS $0.015/min.
  • RL training on diverse data; robust to accents, noise, fast speech.
  • Agents SDK integration enables continuous listen→process→speak loops.
  • Security risk: embedded stage directions in TTS scripts inconsistently enforced.

· ** · Read on openai.com


Type Link
Added Apr 16, 2026
Modified Apr 16, 2026