VibeVoice-TTS-1.5B synthesizes up to 90 minutes of multi-speaker dialogue with up to 4 distinct speakers; accepted as an Oral at ICLR 2026.
VibeVoice-Realtime-0.5B is a deployment-friendly streaming TTS with ~300ms first-audible latency and ~10-minute robust generation window.
TTS code was pulled in September 2025 after misuse reports; ASR and Realtime models remain available via Hugging Face, and ASR is now integrated into Hugging Face Transformers.
Hacker News Comment Review
The TTS removal history is the dominant concern: commenters note Microsoft pulled the TTS code over misuse and question what safeguards, if any, have changed before this resurfacing.
Some commenters see VibeVoice-ASR as over-engineered relative to Whisper and Parakeet for typical dictation workloads; the size premium buys diarization and long-context coherence, not raw WER.
The “Vibe” branding drew dry commentary about AI naming trends, and the spontaneous-singing TTS demo was flagged as unsettling rather than impressive.
Notable Comments
@embedding-shape: points out TTS was previously pulled for safety reasons and asks what has changed since then.
@walthamstow: contrasts ASR weight with Parakeet/Whisper for quick dictation; calls the spontaneous-singing clip “creepy as fuck”.