Forward-Deployed Engineering

Where every
millisecond goes.

From the caller's last syllable to the agent's first audio sample: VAD, STT, the workflow LLM bundle, prefill and prefix caching, vLLM serving, the cloud path, TTS, and the telephony legs no trace ever shows. Each stage with its real cost, its file:line, and the symptom it produces.

Grounded in pipecat-agent@dev (5b29206c) + freya-dashboard@staging-eu. 12 parts. Companion to guide.md.

The one-turn latency equation
felt latency ≈ stop_secs                                  VAD silence confirmation
             + max(stt_final − vad_stop, waitSeconds)     turn-close gate; STT overlaps the wait
             + max(extraction, intent_match) [+ g4 eval]  workflow routing bundle
             + speaking-LLM TTFT                          prefill: warm ~0.2s, cold seconds
             + first-sentence generation                  decode to the first sentence boundary
             + TTS TTFB                                   first audio byte
             + ~40ms transport chunking
             [+ ~150–300ms telephony legs not in any trace]

Every part of this guide tunes exactly one term of that equation.

One turn on a clock

A composed typical on-prem turn (KKB Gemma 4 31B, warm cache, no node transition). Each lane is a stage; lanes that share a column run concurrently. STT finalize hides behind the waitSeconds floor; extraction hides behind intent-match.

Turn waterfall — caller stops → agent first audio
VAD / endpointing STT LLM (route + speak) TTS Transport ghost = overlapped (not on critical path)

Felt total ≈ 1.6–1.9 s to first audio — matching the production p50 of ~1.5 s. The single felt-latency metric is turn.user_bot_latency_seconds (VADUserStoppedSpeaking → BotStartedSpeaking).

The path of one turn

Click a stage to see which part owns it. Every contributor in this guide lives at exactly one of these stages.

Select a stage above to see who owns it and where the cost lands.

The per-turn budget table

The platform's shared definition of "healthy", kept identical in the latency-analyzer and debug-call-audio skills. Expert-set targets, not measured percentiles — use as a sanity check, not a hard rule. The user-perceived gap is the Total; the sub-segments tell you who to blame.

SegmentHealthy medianSuspect above
user_stop → vad_end endpointing200–350 ms>500 ms
vad_end → stt_final STT150–400 ms>800 ms
stt_final → llm_first_tok routing + speaking300–700 ms>1.2 s
llm_first_tok → tts_first_aud TTS100–300 ms>600 ms
Total: user-stop → agent-audio0.8–1.7 s>2.5 s

Production reference on KKB on-prem (Gemma 4 31B, 4×H100): felt per-turn p50 ~1.5 s, p90 ~2.2 s, max 3.6 s.

The twelve parts

PART 1

Anatomy of one turn

The exact serial path, the three overlap windows, and the turn-close formula.

5 steps
PART 2

VAD & endpointing: the dead-air tax

stop_secs, waitSeconds, smart turn, the backstops. 100% policy, 0% compute.

5 steps
PART 3

STT: finalization, streaming, and the filters in front of it

Batch vs streaming, finalize timeout, and the noise-cancel hop inside VAD.

steps
PART 4

The LLM stage: one turn fires a bundle of calls

Extraction, routing, speaking, guardrails — one turn fires several LLM calls.

steps
PART 5

Prefill vs decode, prefix caching, and warmups

Why TTFT swings from 0.2 s to seconds, and how the warmup tasks hide it.

steps
PART 6

On-prem vLLM serving: capacity math and the Garanti case study

KV cache sizing, eviction curve, and the Garanti cold-prefill case study.

steps
PART 7

The cloud provider path

Routing remaps, priority processing, and what changes off the on-prem box.

steps
PART 8

Prompt & workflow design: the author sets the floor

Prompt size, node structure, conditions that serialize the bundle.

steps
PART 9

TTS: from first token to first audio

Sentence aggregation, TTFB, and the trailing-silence pad.

steps
PART 10

Network & telephony: the invisible 150–300 ms

The legs that never appear in any trace yet the caller feels them.

steps
PART 11

The long tail: API calls, fillers, guardrails, voicemail, DTMF, transfers, call start

API calls, fillers, guardrails, voicemail, DTMF, transfers, call start.

steps
PART 12

Measurement, and the symptom → fix playbook

The span-by-span decomposition map and what to change for each complaint.

steps