I spent a day (~$100 in API credits) rebuilding the core orchestration loop of a real-time AI voice agent from scratch instead of using an all-in-one SDK. The hard part isn’t STT, LLMs, or TTS in isolation, but turn-taking: detecting when the user starts and stops speaking, cancelling in-flight generation instantly, and pipelining everything to minimize time-to-first-audio.
The write-up covers why VAD alone fails for real turn detection, how voice agents reduce to a minimal speaking/listening loop, why STT → LLM → TTS must be streaming rather than sequential, why TTFT matters more than model quality in voice, and why geography dominates latency. By colocating Twilio, Deepgram, ElevenLabs, and the orchestration layer, I reached ~790ms end-to-end latency, slightly faster than an equivalent Vapi setup.
I spent a day (~$100 in API credits) rebuilding the core orchestration loop of a real-time AI voice agent from scratch instead of using an all-in-one SDK. The hard part isn’t STT, LLMs, or TTS in isolation, but turn-taking: detecting when the user starts and stops speaking, cancelling in-flight generation instantly, and pipelining everything to minimize time-to-first-audio.
The write-up covers why VAD alone fails for real turn detection, how voice agents reduce to a minimal speaking/listening loop, why STT → LLM → TTS must be streaming rather than sequential, why TTFT matters more than model quality in voice, and why geography dominates latency. By colocating Twilio, Deepgram, ElevenLabs, and the orchestration layer, I reached ~790ms end-to-end latency, slightly faster than an equivalent Vapi setup.