Anatomy of the speaking prompt LLM
What it is. Every turn the processor rebuilds the system instruction. compose_system_instruction() = Global Prompt (system_prompt) + optional per-node Role (role_message) + "## Current Task\n" + the node's Task (task_message), all with {{variable}} substitution. On top go the full conversation history and the node's tools.
Runtime effect. The authored characters silently become per-turn prefill tokens. Real sizes from an anonymized production Anadolu fixture (freya-ops latency-benchmark/workflow-fixtures/anadolu-survey/meta.json):
| Node | System prompt chars | Observed input tokens (system + history + tools) | Tools |
|---|---|---|---|
dogrulama (identity check) | 54,238 | 18,814 | 1 |
anket (survey) | 77,136 | 30,108 | 2 (incl. the full survey decision-tree in one tool description) |
The usual authored-bloat offenders, named by the latency-analyzer skill (SKILL.md:155–158): an oversized task_message (hand-built rules tables, STT-mishear maps, decision trees) or a verbose tool schema carried on every turn.
Symptom it causes/fixes: 50–77K characters of authored text become 19–30K tokens of per-turn prefill ~3.8 s cold on a node hop. The Task message and tool schemas dominate.
Try it — prompt-size cost estimator
Tool schemas and transitions are token dials LLM
- Tools are serialized into every speaking request and into the warmup (
processor.py:2701–2782;boot_steps.py:1519–1574). Theanketsurvey_fill_standardtool carries the entire survey decision-tree plus a worked example inside its description. Every tool-call transition adds one more FunctionSchema. - Transitions feed the routing call twice: ~260 input tokens per transition on the dogrulama fixture (2,627 tok / 10 transitions) and ~5–6 output tokens of guided JSON each (~60–120 ms decode at this stack's TPOT). Cutting transitions and tightening intent descriptions shrinks both.
Symptom it causes/fixes: a routing call whose decode grows with every transition; a tool description bloated with worked examples that rides every single speaking turn.
Entry messages gate warmups; near-duplicate nodes multiply cold prefixes LLM
Two structural rules that fall straight out of Part 5:
- Every node reachable mid-conversation should carry an entry message (TTS/Audio/LLM-prompt) long enough to cover ~1–4 s of warmup; a silent hop into a 30K node puts the full ~3.8 s cold prefill on the user's next turn (
processor.py:2805–2813). - Merge near-identical nodes. Ten per-question nodes sharing 95% of their prompt are ten distinct 20–30K prefixes competing for KV residency plus ten warmup spends; one node with a question variable in a trailing position is one prefix.
Try it — turn-timeline composer (edit the workflow, watch the felt latency)
History growth and truncation LLM
History is usually NOT the prefill problem — node-prompt swaps are — but the levers matter on long calls:
- Node-level: Compaction Mode (
truncation_strategy, enumkeep_all | keep_recent | summarize; default KEEP_ALL, window default 10 —parser.py:273–276). KEEP_RECENT keeps system prefix + last N messages (processor.py:3507–3509); SUMMARIZE compresses older messages via an extra LLM call on the extractor client (processor.py:3511–3555) — the summarize strategy itself costs one LLM round-trip on the turn it fires and rewrites the prefix (cache miss on the next speaking call). keep_recent is the cache-friendlier choice. - Agent-level:
contextSummarization(llm.additional_settings.contextSummarization, defaults 8000/6000/20/4 —types.py:117–129), with a safety net that force-compacts at 90% of the model's context window down to 60% (base_service.py:134–139,151–263). suppress_tool_results(per node) compacts bulky API payloads in history to status-only summaries before each speaking call (processor.py:3383–3445) — the lever for API-heavy flows.- Routing and extraction never see full history anyway: the default templates embed
{{router.conversation.last-10}}/{{extractor.conversation.last-10}}(router.py:65,extractor.py:40), overridable per node via Routing Prompt (routing_prompt).
Net effect measured on production traces: ~+700 tokens over a 40-turn call once compaction is active (memory-derived, 2026-05-25 — not re-verified in a repo artifact).
Try it — node-hop cache simulator (with a loop-prompt activation event)
Silent hops eat the cold prefill on turn 1; entering via an entry message hides the warmup behind playback; activating the Loop Prompt replaces the Task mid-node, swapping the prefix and forcing a fresh prefill.
The shrink-levers checklist (priority order) LLM
What to do when the diagnosis says "the prompt is the problem":
- Shrink the speaking prompt — target the
task_message's rules tables / decision trees first. - Trim tool schemas (move worked examples out of descriptions).
- Cut transitions and tighten intent descriptions.
- Consolidate extraction groups (share
extraction_promptacross same-type fields). - Keep intent conditions variable-free (g3 ∥ extraction instead of serial g4).
- Turn on history compaction for long nodes (
keep_recentoversummarize);suppress_tool_resultsfor API-heavy flows. - Give every reachable node an entry message; merge near-duplicate nodes.
- Keep per-turn-changing variables out of system/task text (Step 28).
Checkpoint: cloud testing of a new workflow shows fine latency; the on-prem pilot of the same workflow feels sluggish on every node change. Nothing in the infra changed. Most likely cause?
Prompt size. Cloud prefill is ~15 ms per 1K tokens, so a 25K-token node prompt costs ~0.4 s extra and hides in the noise; the on-prem pair pays ~0.15–0.28 s per 1K — the same node hop is a 3.8–5 s cold prefill, and with several fat nodes the KV cache can't hold them all (Part 6 math).
The fix is authorial: shrink the prompts and merge nodes, not "faster GPUs" (processor.py:2805–2813).
compose_system_instruction() in pipecat-agent src/core/workflow/workflow.py:537–552 and count the tokens of my node's task_message + tool schemas at 2.7 chars/tok."