Part 8

Prompt & workflow design: the author sets the floor

No serving-side fix can undo a 30K-token node prompt. What the workflow author types — global prompt, task messages, tool schemas, transitions, extraction fields — directly sets the token count of every LLM request per turn, and therefore the prefill, the cache footprint, and the round-trip count.

Step 40

Anatomy of the speaking prompt LLM

What it is. Every turn the processor rebuilds the system instruction. compose_system_instruction() = Global Prompt (system_prompt) + optional per-node Role (role_message) + "## Current Task\n" + the node's Task (task_message), all with {{variable}} substitution. On top go the full conversation history and the node's tools.

workflow.py:537–552 set on service processor.py:3447–3481 chars→tok ~2.7 c/tok (TR)

Runtime effect. The authored characters silently become per-turn prefill tokens. Real sizes from an anonymized production Anadolu fixture (freya-ops latency-benchmark/workflow-fixtures/anadolu-survey/meta.json):

NodeSystem prompt charsObserved input tokens
(system + history + tools)
Tools
dogrulama (identity check)54,23818,8141
anket (survey)77,13630,1082 (incl. the full survey decision-tree in one tool description)

The usual authored-bloat offenders, named by the latency-analyzer skill (SKILL.md:155–158): an oversized task_message (hand-built rules tables, STT-mishear maps, decision trees) or a verbose tool schema carried on every turn.

Symptom it causes/fixes: 50–77K characters of authored text become 19–30K tokens of per-turn prefill ~3.8 s cold on a node hop. The Task message and tool schemas dominate.

Try it — prompt-size cost estimator

Paste your authored text; convert at 2.7 chars/tok (TR)
150000
authored chars: 0 authored tok: 0 per-turn total (incl. ~6K base system): 0 tok warm TTFT delta (~15 ms/1K cloud): +0 ms cold prefill (0.15–0.28 s/1K on-prem): +0 s prompts that fit the KV budget: 0
Paste a task message and a tool schema.
Step 41

Tool schemas and transitions are token dials LLM

Symptom it causes/fixes: a routing call whose decode grows with every transition; a tool description bloated with worked examples that rides every single speaking turn.

Step 42

Entry messages gate warmups; near-duplicate nodes multiply cold prefixes LLM

Two structural rules that fall straight out of Part 5:

  1. Every node reachable mid-conversation should carry an entry message (TTS/Audio/LLM-prompt) long enough to cover ~1–4 s of warmup; a silent hop into a 30K node puts the full ~3.8 s cold prefill on the user's next turn (processor.py:2805–2813).
  2. Merge near-identical nodes. Ten per-question nodes sharing 95% of their prompt are ten distinct 20–30K prefixes competing for KV residency plus ten warmup spends; one node with a question variable in a trailing position is one prefix.
Why "trailing position" mattersPut the per-question variable at the end of the prompt so the shared prefix stays byte-identical and stays cached — a variable near the top rewrites the prefix and forces a miss (see Step 43 and Part 7's trailing-context-block rule).

Try it — turn-timeline composer (edit the workflow, watch the felt latency)

Authoring dials → perceived per-turn latency
6
2
18.0K tok
Extraction wait Routing / speaking LLM TTS first sentence
perceived turn latency: 0 s
Move a dial.
Step 43

History growth and truncation LLM

History is usually NOT the prefill problem — node-prompt swaps are — but the levers matter on long calls:

Net effect measured on production traces: ~+700 tokens over a 40-turn call once compaction is active (memory-derived, 2026-05-25 — not re-verified in a repo artifact).

Try it — node-hop cache simulator (with a loop-prompt activation event)

Hop between nodes and fire the loop prompt; watch the prefix-swap cost

Silent hops eat the cold prefill on turn 1; entering via an entry message hides the warmup behind playback; activating the Loop Prompt replaces the Task mid-node, swapping the prefix and forcing a fresh prefill.

No events yet.
warm / hidden cold prefill / cache miss
Step 44

The shrink-levers checklist (priority order) LLM

What to do when the diagnosis says "the prompt is the problem":

  1. Shrink the speaking prompt — target the task_message's rules tables / decision trees first.
  2. Trim tool schemas (move worked examples out of descriptions).
  3. Cut transitions and tighten intent descriptions.
  4. Consolidate extraction groups (share extraction_prompt across same-type fields).
  5. Keep intent conditions variable-free (g3 ∥ extraction instead of serial g4).
  6. Turn on history compaction for long nodes (keep_recent over summarize); suppress_tool_results for API-heavy flows.
  7. Give every reachable node an entry message; merge near-duplicate nodes.
  8. Keep per-turn-changing variables out of system/task text (Step 28).
Checkpoint: cloud testing of a new workflow shows fine latency; the on-prem pilot of the same workflow feels sluggish on every node change. Nothing in the infra changed. Most likely cause?

Prompt size. Cloud prefill is ~15 ms per 1K tokens, so a 25K-token node prompt costs ~0.4 s extra and hides in the noise; the on-prem pair pays ~0.15–0.28 s per 1K — the same node hop is a 3.8–5 s cold prefill, and with several fat nodes the KV cache can't hold them all (Part 6 math).

The fix is authorial: shrink the prompts and merge nodes, not "faster GPUs" (processor.py:2805–2813).

What dominates hereThe Task message and tool schemas — the two places where 50–77K characters of authored text silently become 19–30K tokens of per-turn prefill.
Ask Claude Code: "Show me compose_system_instruction() in pipecat-agent src/core/workflow/workflow.py:537–552 and count the tokens of my node's task_message + tool schemas at 2.7 chars/tok."