The prefill-bound shape, with numbers LLM
What it is. Every turn's LLM time splits into two phases. Prefill reads the
whole prompt and builds its KV cache (cost scales with input tokens). Decode
then emits the reply one token at a time (cost scales with output tokens). When the
prompt is 15–30K tokens and the spoken reply is ~50–60 tokens, prefill dominates —
the model is not slow, the prompt is huge (latency-analyzer SKILL.md:54-66).
Prompt shape. Garanti borc_sorgulama speaking prompt: 14.6–15.8K
input tokens per turn. Anadolu dogrulama / anket: 18.8K / 30.1K
(freya-ops latency-benchmark/workflow-fixtures/anadolu-survey/meta.json).
Warm vs cold. Same prompt on KKB: a warm turn is 0.33–0.96 s total with
TTFT 0.15–0.27 s; a cold (cache-miss) turn is 2.6–4.3 s. "The 3–4 s
delta is pure prompt prefill" (case study report.md:28-30).
- Per-1K cold prefill cost: loaded KKB box ~0.25–0.28 s/1K; idle KKB ~0.155 s/1K
(3,774 ms for a 24.4K prompt); cloud GPT-4o ~15 ms/1K
(
analysis/llm-latency-benchmark/4-prompt-size: 809 → 22,752 tok = +330 ms). A single on-prem GPU pair pays prefill 10–20× more per token than a cloud fleet — bloat that is invisible in cloud testing becomes the dominant on-prem cost. - Decode side: Gemma 31B decodes ~53 tok/s (freya-onprem
gpu/llm/gemma-4-31b/README.md:216), TPOT 19–21 ms/tok. A 57-token reply is ~1.1 s of decode, but TTS streams per sentence so only time-to-first-sentence (~480–620 ms at C=1) is felt.
The verdict taxonomy (latency-analyzer references/interpreting.md:23-35):
PREFILL-BOUND (median in ≥8K, out ≤200) → cache + prompt-size problem;
DECODE-BOUND (out ≥300) → shorten spoken turns, caching can't help;
neither → suspect GPU contention.
Symptom it causes/fixes: "the agent is slow every turn" on a big-prompt on-prem node = prefill-bound. The fix is the prompt, not the GPU. Inline: a 24.4K cold prefill is +3,774 ms on an idle box.
On the clock
Try it — prefill-cost estimator
Set the prompt size and the box, then toggle the warm prefix cache. Warm collapses prefill to TTFT (~150–270 ms); cold pays the full per-1K rate.
Prefix caching is keyed on token IDs, byte-exactness included LLM
What it is. vLLM reuses cached KV for the longest matching token prefix and prefills only the divergent suffix. Two design-level consequences the agent codes around:
- Bytes must match. The warmup mirrors production tool serialization because "vLLM keys
its prefix cache on token IDs, so the bytes the chat template renders for
toolsmust match the production request… otherwise the cache chain breaks at the first divergent block" (pipecat-agentsrc/core/boot_steps.py:1523-1527). Dynamic placeholders (system.uuid4()) are skipped so non-determinism doesn't get baked into the warmed prefix (boot_steps.py:1532-1543). - Node hops are step functions. Within a node, ~95% of the prompt is prefix-cacheable across turns (history grows as a cache-friendly suffix); a node transition swaps the big system block and re-pays a large prefill all at once. The prompt-size curve an author actually experiences is not the gentle warm slope (~3–15 ms/1K) — it is "free until the prefix is cold, then ~0.13–0.28 s per 1K tokens all at once."
Symptom it causes/fixes: one slow turn right after a node transition, then fast turns again. That is the cache step function, not a flaky GPU.
Try it — node-hop cache simulator
Each hop swaps the system block. With an entry message, the per-node warmup pre-pays the cold prefill during playback (hidden). With the entry message off, a silent hop eats the ~3.8 s cold prefill on turn 1.
The warmup machinery: pre-paying prefill off the critical path LLM
What it is. Two paths, both firing real 1-token inferences through the production adapter so tokenization matches.
- Boot-time warmup (single-prompt agents).
warmup_llm_cacheis a serializable Wave-2 boot step that runs at ringtime in the telephony prefetch: "by the time the websocket connects, vLLM has already cached the (huge) system-prompt prefix" (boot_steps.py:1691-1758,1696-1705). It sendssystem → [assistant(greeting)] → user(".")withmax_tokens=1(boot_steps.py:1666-1676). Workflow agents skip this entirely (boot_steps.py:1735-1737). - Per-node warmup (workflow agents). On entering a node with a playback window
(TTS / Audio / LLM entry message),
_schedule_entry_warmup(processor.py:2788-2866) fires up to three parallel background primes: speaking prompt+tools (llm_warmup.node_entry,max_tokens=1,processor.py:2868-2936), router (llm_warmup.router,max_tokens 512— 1 token can't form valid JSON,router.py:232-234), extractor (llm_warmup.extractor,max_tokens 256,extractor.py:262-293). Terminal nodes are skipped. All best-effort; failures never block the call.
workflow_latency_benchmark_20260525_*, warmup_node_entry rows). The
entry-message playback is the budget the warmup hides inside.The once-per-node dedupe (the load-bearing limitation) limitation
What it is. self._warmed_nodes: set[str] is a per-call dedupe
(processor.py:274-275): _schedule_entry_warmup returns early if the
node was already warmed (processor.py:2814-2815).
There is no code path that re-warms a node mid-call.
- After a mid-call cache eviction (the Garanti case study's turns 4–5) nothing restores the prefix — the next real user turn pays the full cold prefill.
- The silent-hop foot-gun: a node without an entry message gets no warmup window
at all (
processor.py:2805-2813) — its first live turn eats the ~3.8 s cold prefill.
Symptom it causes/fixes: a node that is normally fast suddenly costs +3.8 s once mid-call — an eviction the once-per-node dedupe refuses to re-cover. Authoring fix: keep prompts small enough to survive eviction, and give every non-terminal node an entry message.
Warmup contention, and the anti-pattern anti-pattern
What it is. The warmup is itself a 15–30K-token prefill competing for the same GPU
as live routing / extraction / speaking calls. When it overruns the entry message (or the user
answers fast), it collides: routing >2 s fat tails = collision, not the routing prompt
(interpreting.md:37-41). At benchmark concurrency 8, warmup TTFT stretched to
p50 3,268 / p95 4,669 ms while live speaking totals hit p95 4,169 ms in the same cells.
references/serving-config.md:106-108; case study report.md:70-72).
The durable fix is shrinking the prompt.Symptom it causes/fixes: routing latency with a fat p95 tail under load, while the routing prompt itself is small. That is warmup/serving contention, not routing cost.
Cache-busting variables in prompts cache-buster
What it is. task_message, role_message,
loop_prompt, and the Global Prompt all run {{variable}} substitution
every turn (src/core/workflow/workflow.py:522-528,600-619). Two authored patterns
silently destroy the cache:
- Turn counters.
<node>.current_turns/<node>.total_turnsupdate every turn (workflow.py:573-591); referencing them in the system/task text changes the prefix every turn → full re-prefill from the divergence point. system.uuid4(). Fresh hex per substitution (workflow.py:616); guarantees a per-turn cache miss anywhere in a prompt or tool description.
Same logic for any variable that changes mid-node. Also note the Loop Prompt
(loop_prompt) replaces the Task from the first loop onward
(workflow.py:526-527,549) — activating it is a mid-node prefix swap costing
one fresh prefill.
Symptom it causes/fixes: a node that should be warm yet is slow on every turn = a cache-buster in the prompt. Move per-turn state to a trailing context block or drop it.
Try it — cache-buster demo
Pick what to inject into the node's
task text. A static line stays cacheable; a turn counter or uuid4() changes every
turn, so vLLM re-prefills everything after the divergence point.
Checkpoint: an author adds "Soru {{anket.current_turns}}/10" to the anket node's task message to help the model track progress. What happens to latency?
Every turn becomes a cold prefill of everything after that token. The turn counter changes
each turn (workflow.py:573-591), so the system prefix diverges at that point and
vLLM re-prefills the remaining ~30K-token prompt — roughly +3–5 s per turn
on the loaded KKB box — and the once-per-node warmup can't help because the warmed bytes
never match. Track progress in a trailing context block or not at all.
What dominates here
The cold-prefill step function. Warm turns cost ~0.2 s of TTFT; one eviction or one un-warmed node hop costs 3–4 s. Everything in Parts 6 and 8 is about keeping the step from firing.
_schedule_entry_warmup and self._warmed_nodes are read in pipecat-agent
processor.py, and confirm there is no mid-call re-warm path."