One turn on a clock
A composed typical on-prem turn (KKB Gemma 4 31B, warm cache, no node transition). Each lane is a stage; lanes that share a column run concurrently. STT finalize hides behind the waitSeconds floor; extraction hides behind intent-match.
Felt total ≈ 1.6–1.9 s to first audio — matching the production p50 of ~1.5 s. The single felt-latency metric is turn.user_bot_latency_seconds (VADUserStoppedSpeaking → BotStartedSpeaking).
The path of one turn
Click a stage to see which part owns it. Every contributor in this guide lives at exactly one of these stages.
The per-turn budget table
The platform's shared definition of "healthy", kept identical in the latency-analyzer and debug-call-audio skills. Expert-set targets, not measured percentiles — use as a sanity check, not a hard rule. The user-perceived gap is the Total; the sub-segments tell you who to blame.
| Segment | Healthy median | Suspect above |
|---|---|---|
| user_stop → vad_end endpointing | 200–350 ms | >500 ms |
| vad_end → stt_final STT | 150–400 ms | >800 ms |
| stt_final → llm_first_tok routing + speaking | 300–700 ms | >1.2 s |
| llm_first_tok → tts_first_aud TTS | 100–300 ms | >600 ms |
| Total: user-stop → agent-audio | 0.8–1.7 s | >2.5 s |
Production reference on KKB on-prem (Gemma 4 31B, 4×H100): felt per-turn p50 ~1.5 s, p90 ~2.2 s, max 3.6 s.
The twelve parts
Anatomy of one turn
The exact serial path, the three overlap windows, and the turn-close formula.
VAD & endpointing: the dead-air tax
stop_secs, waitSeconds, smart turn, the backstops. 100% policy, 0% compute.
STT: finalization, streaming, and the filters in front of it
Batch vs streaming, finalize timeout, and the noise-cancel hop inside VAD.
The LLM stage: one turn fires a bundle of calls
Extraction, routing, speaking, guardrails — one turn fires several LLM calls.
Prefill vs decode, prefix caching, and warmups
Why TTFT swings from 0.2 s to seconds, and how the warmup tasks hide it.
On-prem vLLM serving: capacity math and the Garanti case study
KV cache sizing, eviction curve, and the Garanti cold-prefill case study.
The cloud provider path
Routing remaps, priority processing, and what changes off the on-prem box.
Prompt & workflow design: the author sets the floor
Prompt size, node structure, conditions that serialize the bundle.
TTS: from first token to first audio
Sentence aggregation, TTFB, and the trailing-silence pad.
Network & telephony: the invisible 150–300 ms
The legs that never appear in any trace yet the caller feels them.
The long tail: API calls, fillers, guardrails, voicemail, DTMF, transfers, call start
API calls, fillers, guardrails, voicemail, DTMF, transfers, call start.
Measurement, and the symptom → fix playbook
The span-by-span decomposition map and what to change for each complaint.