Freya End-to-End Latency — Field Engineer's Deep Dive

One turn on a clock

A composed typical on-prem turn (KKB Gemma 4 31B, warm cache, no node transition). Each lane is a stage; lanes that share a column run concurrently. STT finalize hides behind the waitSeconds floor; extraction hides behind intent-match.

Turn waterfall — caller stops → agent first audio

VAD / endpointing STT LLM (route + speak) TTS Transport ghost = overlapped (not on critical path)

Felt total ≈ 1.6–1.9 s to first audio — matching the production p50 of ~1.5 s. The single felt-latency metric is turn.user_bot_latency_seconds (VADUserStoppedSpeaking → BotStartedSpeaking).

The path of one turn

Click a stage to see which part owns it. Every contributor in this guide lives at exactly one of these stages.

Select a stage above to see who owns it and where the cost lands.

The per-turn budget table

The platform's shared definition of "healthy", kept identical in the latency-analyzer and debug-call-audio skills. Expert-set targets, not measured percentiles — use as a sanity check, not a hard rule. The user-perceived gap is the Total; the sub-segments tell you who to blame.

Segment	Healthy median	Suspect above
user_stop → vad_end endpointing	200–350 ms	>500 ms
vad_end → stt_final STT	150–400 ms	>800 ms
stt_final → llm_first_tok routing + speaking	300–700 ms	>1.2 s
llm_first_tok → tts_first_aud TTS	100–300 ms	>600 ms
Total: user-stop → agent-audio	0.8–1.7 s	>2.5 s