Microsoft VibeVoice: Open-Source Frontier Voice AI

· ai open-source · Source ↗

TLDR

  • Microsoft open-sources VibeVoice, a family of ASR and TTS models using continuous speech tokenizers at 7.5 Hz for long-form audio up to 90 minutes.

Key Takeaways

  • VibeVoice-ASR-7B handles 60-minute audio in a single 64K-token pass with joint diarization, timestamping, and speaker attribution (Who/When/What).
  • Custom hotword injection lets domain-specific terms (names, jargon) steer recognition accuracy without fine-tuning.
  • VibeVoice-TTS-1.5B synthesizes up to 90 minutes of multi-speaker dialogue with up to 4 distinct speakers; accepted as an Oral at ICLR 2026.
  • VibeVoice-Realtime-0.5B is a deployment-friendly streaming TTS with ~300ms first-audible latency and ~10-minute robust generation window.
  • TTS code was pulled in September 2025 after misuse reports; ASR and Realtime models remain available via Hugging Face, and ASR is now integrated into Hugging Face Transformers.

Hacker News Comment Review

  • The TTS removal history is the dominant concern: commenters note Microsoft pulled the TTS code over misuse and question what safeguards, if any, have changed before this resurfacing.
  • Some commenters see VibeVoice-ASR as over-engineered relative to Whisper and Parakeet for typical dictation workloads; the size premium buys diarization and long-context coherence, not raw WER.
  • The “Vibe” branding drew dry commentary about AI naming trends, and the spontaneous-singing TTS demo was flagged as unsettling rather than impressive.

Notable Comments

  • @embedding-shape: points out TTS was previously pulled for safety reasons and asks what has changed since then.
  • @walthamstow: contrasts ASR weight with Parakeet/Whisper for quick dictation; calls the spontaneous-singing clip “creepy as fuck”.

Original | Discuss on HN