microsoft/VibeVoice
TLDR
- Microsoft’s MIT-licensed VibeVoice transcribes one hour of audio in ~9 minutes on an M5 Max Mac, with speaker diarization built in.
Key Takeaways
-
Run via a single
uvcommand using the 5.71GBmlx-community/VibeVoice-ASR-4bitmodel; the full model is 17.3GB. - Benchmarked at 524 seconds for 60 minutes of audio on a 128GB M5 Max MacBook Pro, peaking at 61.5GB RAM during prefill.
-
Output is timestamped JSON with
text,start,end,duration, andspeaker_idper segment, ready to load into Datasette Lite. -
Default
--max-tokensof 8192 covers ~25 minutes; set to 32768 for a full hour. Audio longer than one hour requires manual splitting with overlap. - Diarization correctly separated two speakers across a podcast, and flagged a third voice used only for intros and sponsor reads.
Why It Matters
- Fully local, MIT-licensed speech-to-text with diarization closes a gap that previously required cloud APIs or separate post-processing steps.
- The 1-hour cap and high RAM ceiling (61GB peak) are hard constraints for anyone processing long-form audio on consumer hardware.
- Structured JSON output with per-segment speaker IDs makes downstream search, transcript editing, and analysis straightforward without extra tooling.
Simon Willison, Simon Willison’s Weblog · 2026-04-27 · Read the original