microsoft/VibeVoice

· coding · Source ↗

TLDR

  • Microsoft’s MIT-licensed VibeVoice transcribes one hour of audio in ~9 minutes on an M5 Max Mac, with speaker diarization built in.

Key Takeaways

  • Run via a single uv command using the 5.71GB mlx-community/VibeVoice-ASR-4bit model; the full model is 17.3GB.
  • Benchmarked at 524 seconds for 60 minutes of audio on a 128GB M5 Max MacBook Pro, peaking at 61.5GB RAM during prefill.
  • Output is timestamped JSON with text, start, end, duration, and speaker_id per segment, ready to load into Datasette Lite.
  • Default --max-tokens of 8192 covers ~25 minutes; set to 32768 for a full hour. Audio longer than one hour requires manual splitting with overlap.
  • Diarization correctly separated two speakers across a podcast, and flagged a third voice used only for intros and sponsor reads.

Why It Matters

  • Fully local, MIT-licensed speech-to-text with diarization closes a gap that previously required cloud APIs or separate post-processing steps.
  • The 1-hour cap and high RAM ceiling (61GB peak) are hard constraints for anyone processing long-form audio on consumer hardware.
  • Structured JSON output with per-segment speaker IDs makes downstream search, transcript editing, and analysis straightforward without extra tooling.

Simon Willison, Simon Willison’s Weblog · 2026-04-27 · Read the original