Part 5 — Prefill, prefix cache, warmups · Freya End-to-End Latency

Step 23

The prefill-bound shape, with numbers LLM

What it is. Every turn's LLM time splits into two phases. Prefill reads the whole prompt and builds its KV cache (cost scales with input tokens). Decode then emits the reply one token at a time (cost scales with output tokens). When the prompt is 15–30K tokens and the spoken reply is ~50–60 tokens, prefill dominates — the model is not slow, the prompt is huge (latency-analyzer SKILL.md:54-66).

turn = prefill(in) + decode(out) speaking prompt 14.6–30.1K tok output ~50–60 tok Turkish ~2.7 chars/token

Prompt shape. Garanti borc_sorgulama speaking prompt: 14.6–15.8K input tokens per turn. Anadolu dogrulama / anket: 18.8K / 30.1K (freya-ops latency-benchmark/workflow-fixtures/anadolu-survey/meta.json).

Warm vs cold. Same prompt on KKB: a warm turn is 0.33–0.96 s total with TTFT 0.15–0.27 s; a cold (cache-miss) turn is 2.6–4.3 s. "The 3–4 s delta is pure prompt prefill" (case study report.md:28-30).

Per-1K cold prefill cost: loaded KKB box ~0.25–0.28 s/1K; idle KKB ~0.155 s/1K (3,774 ms for a 24.4K prompt); cloud GPT-4o ~15 ms/1K (analysis/llm-latency-benchmark/4-prompt-size: 809 → 22,752 tok = +330 ms). A single on-prem GPU pair pays prefill 10–20× more per token than a cloud fleet — bloat that is invisible in cloud testing becomes the dominant on-prem cost.
Decode side: Gemma 31B decodes ~53 tok/s (freya-onprem gpu/llm/gemma-4-31b/README.md:216), TPOT 19–21 ms/tok. A 57-token reply is ~1.1 s of decode, but TTS streams per sentence so only time-to-first-sentence (~480–620 ms at C=1) is felt.

The verdict taxonomy (latency-analyzer references/interpreting.md:23-35): PREFILL-BOUND (median in ≥8K, out ≤200) → cache + prompt-size problem; DECODE-BOUND (out ≥300) → shorten spoken turns, caching can't help; neither → suspect GPU contention.

Symptom it causes/fixes: "the agent is slow every turn" on a big-prompt on-prem node = prefill-bound. The fix is the prompt, not the GPU. Inline: a 24.4K cold prefill is +3,774 ms on an idle box.

On the clock

One on-prem turn: warm vs cold (24.4K-token node prompt)

prefill decode (first sentence)

Try it — prefill-cost estimator

Prefill vs decode budget

Set the prompt size and the box, then toggle the warm prefix cache. Warm collapses prefill to TTFT (~150–270 ms); cold pays the full per-1K rate.

Prompt tokens

15000

Per-1K prefill cost preset

Warm (cached) — prefix already in KV cache

prefill decode (57 tok @ 19 ms)

Adjust the inputs.

Step 24

Prefix caching is keyed on token IDs, byte-exactness included LLM

What it is. vLLM reuses cached KV for the longest matching token prefix and prefills only the divergent suffix. Two design-level consequences the agent codes around:

Bytes must match. The warmup mirrors production tool serialization because "vLLM keys its prefix cache on token IDs, so the bytes the chat template renders for tools must match the production request… otherwise the cache chain breaks at the first divergent block" (pipecat-agent src/core/boot_steps.py:1523-1527). Dynamic placeholders (system.uuid4()) are skipped so non-determinism doesn't get baked into the warmed prefix (boot_steps.py:1532-1543).
Node hops are step functions. Within a node, ~95% of the prompt is prefix-cacheable across turns (history grows as a cache-friendly suffix); a node transition swaps the big system block and re-pays a large prefill all at once. The prompt-size curve an author actually experiences is not the gentle warm slope (~3–15 ms/1K) — it is "free until the prefix is cold, then ~0.13–0.28 s per 1K tokens all at once."

Symptom it causes/fixes: one slow turn right after a node transition, then fast turns again. That is the cache step function, not a flaky GPU.

Try it — node-hop cache simulator

Click through a 4-node workflow

Each hop swaps the system block. With an entry message, the per-node warmup pre-pays the cold prefill during playback (hidden). With the entry message off, a silent hop eats the ~3.8 s cold prefill on turn 1.

Entry message on (gives a warmup window)

cached prefix (reused) re-prefilled system block warmup window (entry msg)

Click a node to hop.

Step 25

The warmup machinery: pre-paying prefill off the critical path LLM

What it is. Two paths, both firing real 1-token inferences through the production adapter so tokenization matches.

Boot-time warmup (single-prompt agents). warmup_llm_cache is a serializable Wave-2 boot step that runs at ringtime in the telephony prefetch: "by the time the websocket connects, vLLM has already cached the (huge) system-prompt prefix" (boot_steps.py:1691-1758,1696-1705). It sends system → [assistant(greeting)] → user(".") with max_tokens=1 (boot_steps.py:1666-1676). Workflow agents skip this entirely (boot_steps.py:1735-1737).
Per-node warmup (workflow agents). On entering a node with a playback window (TTS / Audio / LLM entry message), _schedule_entry_warmup (processor.py:2788-2866) fires up to three parallel background primes: speaking prompt+tools (llm_warmup.node_entry, max_tokens=1, processor.py:2868-2936), router (llm_warmup.router, max_tokens 512 — 1 token can't form valid JSON, router.py:232-234), extractor (llm_warmup.extractor, max_tokens 256, extractor.py:262-293). Terminal nodes are skipped. All best-effort; failures never block the call.

Cost being hiddenThe first-ever prefill of a ~24K node prompt is 3,774 ms TTFT; warm, the same warmup completes in 170–430 ms (freya-ops workflow_latency_benchmark_20260525_*, warmup_node_entry rows). The entry-message playback is the budget the warmup hides inside.

Step 26

The once-per-node dedupe (the load-bearing limitation) limitation

What it is. self._warmed_nodes: set[str] is a per-call dedupe (processor.py:274-275): _schedule_entry_warmup returns early if the node was already warmed (processor.py:2814-2815). There is no code path that re-warms a node mid-call.

After a mid-call cache eviction (the Garanti case study's turns 4–5) nothing restores the prefix — the next real user turn pays the full cold prefill.
The silent-hop foot-gun: a node without an entry message gets no warmup window at all (processor.py:2805-2813) — its first live turn eats the ~3.8 s cold prefill.

Symptom it causes/fixes: a node that is normally fast suddenly costs +3.8 s once mid-call — an eviction the once-per-node dedupe refuses to re-cover. Authoring fix: keep prompts small enough to survive eviction, and give every non-terminal node an entry message.

Step 27

Warmup contention, and the anti-pattern anti-pattern

What it is. The warmup is itself a 15–30K-token prefill competing for the same GPU as live routing / extraction / speaking calls. When it overruns the entry message (or the user answers fast), it collides: routing >2 s fat tails = collision, not the routing prompt (interpreting.md:37-41). At benchmark concurrency 8, warmup TTFT stretched to p50 3,268 / p95 4,669 ms while live speaking totals hit p95 4,169 ms in the same cells.

The anti-pattern: never "warm harder"Re-warming a 30K prompt every turn adds exactly the contention causing the problem (latency-analyzer references/serving-config.md:106-108; case study report.md:70-72). The durable fix is shrinking the prompt.

Symptom it causes/fixes: routing latency with a fat p95 tail under load, while the routing prompt itself is small. That is warmup/serving contention, not routing cost.

Step 28

Cache-busting variables in prompts cache-buster

What it is. task_message, role_message, loop_prompt, and the Global Prompt all run {{variable}} substitution every turn (src/core/workflow/workflow.py:522-528,600-619). Two authored patterns silently destroy the cache:

Turn counters. <node>.current_turns / <node>.total_turns update every turn (workflow.py:573-591); referencing them in the system/task text changes the prefix every turn → full re-prefill from the divergence point.
system.uuid4(). Fresh hex per substitution (workflow.py:616); guarantees a per-turn cache miss anywhere in a prompt or tool description.

Same logic for any variable that changes mid-node. Also note the Loop Prompt (loop_prompt) replaces the Task from the first loop onward (workflow.py:526-527,549) — activating it is a mid-node prefix swap costing one fresh prefill.

Symptom it causes/fixes: a node that should be warm yet is slow on every turn = a cache-buster in the prompt. Move per-turn state to a trailing context block or drop it.

Try it — cache-buster demo

Insert a per-turn token, watch the prefix diverge

Pick what to inject into the node's task text. A static line stays cacheable; a turn counter or uuid4() changes every turn, so vLLM re-prefills everything after the divergence point.

What to put in the task message

Position of the injected token in a 30K-token prompt

40%

Choose an injection.

Checkpoint: an author adds "Soru {{anket.current_turns}}/10" to the anket node's task message to help the model track progress. What happens to latency?

Every turn becomes a cold prefill of everything after that token. The turn counter changes each turn (workflow.py:573-591), so the system prefix diverges at that point and vLLM re-prefills the remaining ~30K-token prompt — roughly +3–5 s per turn on the loaded KKB box — and the once-per-node warmup can't help because the warmed bytes never match. Track progress in a trailing context block or not at all.

What dominates here

The cold-prefill step function. Warm turns cost ~0.2 s of TTFT; one eviction or one un-warmed node hop costs 3–4 s. Everything in Parts 6 and 8 is about keeping the step from firing.

Ask Claude Code: "Show me every place _schedule_entry_warmup and self._warmed_nodes are read in pipecat-agent processor.py, and confirm there is no mid-call re-warm path."