microsoft/VibeVoice

Apr 27, 2026 · coding · Source ↗

TLDR

Microsoft’s MIT-licensed VibeVoice transcribes one hour of audio in ~9 minutes on an M5 Max Mac, with speaker diarization built in.

Run via a single uv command using the 5.71GB mlx-community/VibeVoice-ASR-4bit model; the full model is 17.3GB.
Benchmarked at 524 seconds for 60 minutes of audio on a 128GB M5 Max MacBook Pro, peaking at 61.5GB RAM during prefill.
Output is timestamped JSON with text, start, end, duration, and speaker_id per segment, ready to load into Datasette Lite.
Default --max-tokens of 8192 covers ~25 minutes; set to 32768 for a full hour. Audio longer than one hour requires manual splitting with overlap.
Diarization correctly separated two speakers across a podcast, and flagged a third voice used only for intros and sponsor reads.

Fully local, MIT-licensed speech-to-text with diarization closes a gap that previously required cloud APIs or separate post-processing steps.
The 1-hour cap and high RAM ceiling (61GB peak) are hard constraints for anyone processing long-form audio on consumer hardware.
Structured JSON output with per-segment speaker IDs makes downstream search, transcript editing, and analysis straightforward without extra tooling.

Simon Willison, Simon Willison’s Weblog · 2026-04-27 · Read the original