Voice-AI-for-Beginners – A curated learning path for developers

· ai ai-agents open-source · Source ↗

TLDR

  • Structured curriculum taking developers from first STT call through production telephony, covering LiveKit, Pipecat, VAD, turn detection, and scaling.

Key Takeaways

  • Recommended stack: WebRTC or telephony transport + streaming STT/LLM/TTS pipeline + semantic turn detection; LiveKit Agents and Pipecat are the safest open-source starting points.
  • Sub-300 ms LLM TTFT and sub-200 ms TTS first-byte are the practical latency targets that change conversation feel in production.
  • Pure acoustic VAD is insufficient; pair Silero VAD with a semantic end-of-utterance model like LiveKit’s SmolLM-based turn detector.
  • Ultravox (fixie-ai) skips the separate ASR stage entirely for ~150 ms TTFT; Moshi is the leading OSS full-duplex speech-to-speech model to study.
  • For managed time-to-first-call: Vapi, Retell, Bland. For self-hosted TTS on CPU: Kokoro 82M (Apache-licensed) or Piper for edge/offline.

Hacker News Comment Review

  • No substantive HN discussion yet.

Original | Discuss on HN