Structured curriculum taking developers from first STT call through production telephony, covering LiveKit, Pipecat, VAD, turn detection, and scaling.
Key Takeaways
Recommended stack: WebRTC or telephony transport + streaming STT/LLM/TTS pipeline + semantic turn detection; LiveKit Agents and Pipecat are the safest open-source starting points.
Sub-300 ms LLM TTFT and sub-200 ms TTS first-byte are the practical latency targets that change conversation feel in production.
Pure acoustic VAD is insufficient; pair Silero VAD with a semantic end-of-utterance model like LiveKit’s SmolLM-based turn detector.
Ultravox (fixie-ai) skips the separate ASR stage entirely for ~150 ms TTFT; Moshi is the leading OSS full-duplex speech-to-speech model to study.
For managed time-to-first-call: Vapi, Retell, Bland. For self-hosted TTS on CPU: Kokoro 82M (Apache-licensed) or Piper for edge/offline.