Part 8 — Prompt & workflow design · Freya End-to-End Latency

Step 40

Anatomy of the speaking prompt LLM

What it is. Every turn the processor rebuilds the system instruction. compose_system_instruction() = Global Prompt (system_prompt) + optional per-node Role (role_message) + "## Current Task\n" + the node's Task (task_message), all with {{variable}} substitution. On top go the full conversation history and the node's tools.

workflow.py:537–552 set on service processor.py:3447–3481 chars→tok ~2.7 c/tok (TR)

Runtime effect. The authored characters silently become per-turn prefill tokens. Real sizes from an anonymized production Anadolu fixture (freya-ops latency-benchmark/workflow-fixtures/anadolu-survey/meta.json):

Node	System prompt chars	Observed input tokens (system + history + tools)	Tools
`dogrulama` (identity check)	54,238	18,814	1
`anket` (survey)	77,136	30,108	2 (incl. the full survey decision-tree in one tool description)

The usual authored-bloat offenders, named by the latency-analyzer skill (SKILL.md:155–158): an oversized task_message (hand-built rules tables, STT-mishear maps, decision trees) or a verbose tool schema carried on every turn.

Symptom it causes/fixes: 50–77K characters of authored text become 19–30K tokens of per-turn prefill ~3.8 s cold on a node hop. The Task message and tool schemas dominate.

Try it — prompt-size cost estimator

Paste your authored text; convert at 2.7 chars/tok (TR)

Task message (task_message) — rules tables, decision trees Tool schema JSON (serialized into every speaking request) KV budget (concurrent prompt tokens the cache can hold)

150000

authored chars: 0 authored tok: 0 per-turn total (incl. ~6K base system): 0 tok warm TTFT delta (~15 ms/1K cloud): +0 ms cold prefill (0.15–0.28 s/1K on-prem): +0 s prompts that fit the KV budget: 0

Paste a task message and a tool schema.

Step 41

Tool schemas and transitions are token dials LLM

Tools are serialized into every speaking request and into the warmup (processor.py:2701–2782; boot_steps.py:1519–1574). The anket survey_fill_standard tool carries the entire survey decision-tree plus a worked example inside its description. Every tool-call transition adds one more FunctionSchema.
Transitions feed the routing call twice: ~260 input tokens per transition on the dogrulama fixture (2,627 tok / 10 transitions) and ~5–6 output tokens of guided JSON each (~60–120 ms decode at this stack's TPOT). Cutting transitions and tightening intent descriptions shrinks both.

Symptom it causes/fixes: a routing call whose decode grows with every transition; a tool description bloated with worked examples that rides every single speaking turn.

Step 42

Entry messages gate warmups; near-duplicate nodes multiply cold prefixes LLM

Two structural rules that fall straight out of Part 5:

Every node reachable mid-conversation should carry an entry message (TTS/Audio/LLM-prompt) long enough to cover ~1–4 s of warmup; a silent hop into a 30K node puts the full ~3.8 s cold prefill on the user's next turn (processor.py:2805–2813).
Merge near-identical nodes. Ten per-question nodes sharing 95% of their prompt are ten distinct 20–30K prefixes competing for KV residency plus ten warmup spends; one node with a question variable in a trailing position is one prefix.

Why "trailing position" mattersPut the per-question variable at the end of the prompt so the shared prefix stays byte-identical and stays cached — a variable near the top rewrites the prefix and forces a miss (see Step 43 and Part 7's trailing-context-block rule).

Try it — turn-timeline composer (edit the workflow, watch the felt latency)

Authoring dials → perceived per-turn latency

Transitions on this node (each ~260 in-tok + a few decode tok)

Extraction groups (concurrent — share extraction_prompt to merge)

Speaking-prompt size (system + task + tools + history)

18.0K tok

Cold prefix (silent hop / cache evicted) Variable-dependent condition (serial g4)

Extraction wait Routing / speaking LLM TTS first sentence

perceived turn latency: 0 s

Move a dial.

Step 43

History growth and truncation LLM

History is usually NOT the prefill problem — node-prompt swaps are — but the levers matter on long calls:

Node-level: Compaction Mode (truncation_strategy, enum keep_all | keep_recent | summarize; default KEEP_ALL, window default 10 — parser.py:273–276). KEEP_RECENT keeps system prefix + last N messages (processor.py:3507–3509); SUMMARIZE compresses older messages via an extra LLM call on the extractor client (processor.py:3511–3555) — the summarize strategy itself costs one LLM round-trip on the turn it fires and rewrites the prefix (cache miss on the next speaking call). keep_recent is the cache-friendlier choice.
Agent-level: contextSummarization (llm.additional_settings.contextSummarization, defaults 8000/6000/20/4 — types.py:117–129), with a safety net that force-compacts at 90% of the model's context window down to 60% (base_service.py:134–139,151–263).
suppress_tool_results (per node) compacts bulky API payloads in history to status-only summaries before each speaking call (processor.py:3383–3445) — the lever for API-heavy flows.
Routing and extraction never see full history anyway: the default templates embed {{router.conversation.last-10}} / {{extractor.conversation.last-10}} (router.py:65, extractor.py:40), overridable per node via Routing Prompt (routing_prompt).

Net effect measured on production traces: ~+700 tokens over a 40-turn call once compaction is active (memory-derived, 2026-05-25 — not re-verified in a repo artifact).

Try it — node-hop cache simulator (with a loop-prompt activation event)

Hop between nodes and fire the loop prompt; watch the prefix-swap cost

Silent hops eat the cold prefill on turn 1; entering via an entry message hides the warmup behind playback; activating the Loop Prompt replaces the Task mid-node, swapping the prefix and forcing a fresh prefill.

No events yet.

warm / hidden cold prefill / cache miss

Step 44

The shrink-levers checklist (priority order) LLM

What to do when the diagnosis says "the prompt is the problem":

Shrink the speaking prompt — target the task_message's rules tables / decision trees first.
Trim tool schemas (move worked examples out of descriptions).
Cut transitions and tighten intent descriptions.
Consolidate extraction groups (share extraction_prompt across same-type fields).
Keep intent conditions variable-free (g3 ∥ extraction instead of serial g4).
Turn on history compaction for long nodes (keep_recent over summarize); suppress_tool_results for API-heavy flows.
Give every reachable node an entry message; merge near-duplicate nodes.
Keep per-turn-changing variables out of system/task text (Step 28).

Checkpoint: cloud testing of a new workflow shows fine latency; the on-prem pilot of the same workflow feels sluggish on every node change. Nothing in the infra changed. Most likely cause?

Prompt size. Cloud prefill is ~15 ms per 1K tokens, so a 25K-token node prompt costs ~0.4 s extra and hides in the noise; the on-prem pair pays ~0.15–0.28 s per 1K — the same node hop is a 3.8–5 s cold prefill, and with several fat nodes the KV cache can't hold them all (Part 6 math).

The fix is authorial: shrink the prompts and merge nodes, not "faster GPUs" (processor.py:2805–2813).

What dominates hereThe Task message and tool schemas — the two places where 50–77K characters of authored text silently become 19–30K tokens of per-turn prefill.

Ask Claude Code: "Show me compose_system_instruction() in pipecat-agent src/core/workflow/workflow.py:537–552 and count the tokens of my node's task_message + tool schemas at 2.7 chars/tok."