One turn, a bundle of differently-shaped calls LLM
A common misconception is that "the LLM call" is one call. A single user turn fires a bundle of differently-shaped LLM calls — extraction (JSON), routing / intent-matching (guided-decode JSON), then the speaking call (streamed free text + tools) — with explicit dependency rules between them.
The canonical spec is the docstring at pipecat-agent
src/core/workflow/processor.py:1214-1248. Read this part as: where each
call sits, what makes one wait for another, and which dials are workflow-design choices
rather than serving knobs.
What dominates here: max(extraction, intent_match) —
typically ~0.5 s on-prem — plus the decode of the routing JSON.
The author controls both: extraction grouping and transition count are workflow-design choices, not infra.
The dependency DAG LLM
What it is. When the turn closes, _extract_and_route classifies each
transition (processor.py:1409-1443) into one of four buckets that decide what
runs, what waits, and what is free:
det
Fully deterministic (variable comparisons): pure Python, no LLM.
router.py:284-319
g3
Needs the LLM (e.g. MATCHES_INTENT) and does not reference this
turn's extraction scope: fires in parallel with extraction.
g4
Needs the LLM and references freshly-extracted variables: must wait for extraction to finish. This is the serialization trap.
always
Bare ALWAYS fallback: free.
Runtime effect. Then (processor.py:1445-1537): the extract
task fires immediately; g3_prefetch fires concurrently — all local +
global intent leaves are batched into ONE LLM call
(router.prefetch_intent_batch, router.py:170-197 →
_evaluate_intent_leaves :1111-1150, indexed-boolean
json_schema, one workflow.intent_matching span). Extraction is
awaited at :1508; only then can g4_prefetch fire with the
substituted variables.
- Priority walk (
:1539-1664): loop → local det → local g3 → local g4 → global det → global g3 → global g4 → localALWAYS→ globalALWAYS. First match wins; pending tasks are cancelled.
Symptom it explains: a turn that "looked simple" but stalled —
a single {{var}} condition flipped a transition from g3 (overlapped) to g4
(serial-after-extraction).
Try it — the DAG stepper
Toggle whether any transition's condition reads a freshly-extracted variable
({{var}} is set). Watch the g4 step appear and serialize after extraction.
Extraction is on the critical path (overlapped, not free) LLM
What it is. Variable extraction from the user's turn, awaited before the routing walk and therefore before the speaking LLM.
Runtime effect. await tasks["extract"] at
processor.py:1508 is the receipt; the routing decision is awaited at
:1720 before the speaking context push at :1832. The turn's
pre-speaking stall is:
max(extraction, g3_routing) # nothing references variables extraction + g4_routing # a transition reads a fresh {{var}}
PR #224 (2026-03) made the overlap real: the measured win was −1,581 ms
on an intent-only turn (extraction 885 ms ∥ routing 18 ms;
analysis/llm-latency-benchmark/5-extraction-parallel-latency/findings.md).
The authored fan-out dial. The extractor groups fields by
(type, extraction_prompt, retry_prompt, has_enum_values) and fires
one LLM call per group, concurrently
(src/core/workflow/extractor.py:684-696,176-236). Eleven fields with eleven
unique extraction prompts = eleven concurrent extractor calls; consolidating 11 → 3
groups measured ~600–1,000 ms saved on extraction-heavy turns
(exp-5 findings). Failed validations add a serial workflow.extraction_retry call
(extractor.py:449).
- When to change: share one
extraction_promptacross same-type fields; keep intent conditions variable-free so g3 overlaps extraction instead of waiting. A condition like{{payment_date}} is seton a transition forces serial extract → route.
workflow.extraction rows).Symptom it causes/fixes: a fat pre-speaking gap on a node with many uniquely-prompted fields — 1.6–1.8 s when 8–11 extractor groups queue. Consolidate the prompts and the gap collapses toward the slowest single group.
Try it — the turn-timeline composer
Constants from KKB 4×H100 Gemma @ C=1: extraction ~698–885 ms per group (the slowest group sets the wall, degrading slightly as groups queue), routing TTFT 52 ms + ~5–6 output tok per transition at ~13 ms/tok, speaking warm TTFT ~155 ms (cold ~3,774 ms), first-sentence decode ~480 ms.
Routing latency is an output-length problem LLM
What it is. The one batched workflow.intent_matching call per turn that
decides the transition.
Runtime effect. Guided decoding into a compact indexed-boolean object
{"0":false,"1":true,…} (router.py:982-991), temperature 0,
max_tokens 512. Measured at C=1 on KKB: routing total p50
439 ms, of which TTFT is only 52 ms — ~88% of routing wall time is decoding
the ~31-token boolean JSON (freya-ops
latency-benchmark/results/workflow_latency_benchmark_20260525_174551__anadolu-survey.md).
Every additional transition adds a key (~5–6 output tokens ≈
60–120 ms at this stack's 12–20 ms/tok).
- When to change: transitions-per-node is a latency dial; the compact schema is why it
isn't worse. Routing >2 s with a fat tail = GPU contention (it collided with a big
warmup or other calls), not the ~2K routing prompt
(latency-analyzer
references/interpreting.md:37-41). - Non-intent LLM conditions (
SEMANTIC_MATCH,LLM_COMPARE) each cost one extra LLM call inside the batch evaluation (router.py:1032-1087,max_tokens 32).
Symptom it causes/fixes: routing wall that grows with transition count even though the prompt barely changes — because you are paying decode, not prefill.
Try it — the routing decode cost widget
The loop condition is NOT an extra call; two-phase routing is the worst case LLM
Per the 2026-02-27 team decision (post-PR #210 revert), the loop condition is a
self-transition evaluated inside the normal phase-1 routing call — verified at
router.py:246-282 ("Phase 1: Evaluate transitions (local + loop-condition
self-transition)", :257,264). Phase 2 (global transitions) only runs when phase 1
had no match.
_has_llm_routes
(processor.py:1255,1290-1304) skips even opening the routing span on
deterministic-only nodes, making them near-free.Symptom it explains: a turn with two routing spans is the two-phase
worst case, not a bug. A node with only det routes showing no routing span at all
is the short-circuit, not a missing measurement.
Routing suppression and the no-response shortcut LLM
If the current node hasn't produced an assistant response yet (and isn't silent/passthrough),
routing and node-extraction context are skipped for that turn — it goes straight to
the speaking LLM (processor.py:1691-1700).
Why it matters: this is why some turns in a trace show no routing spans at all. Don't read it as a missing measurement — it's the suppression shortcut.
Transition execution: speak first, work second LLM
On a matched transition, _execute_transition (processor.py:2544) swaps
workflow state, tool registry, system message and extractor schemas atomically, then
— the load-bearing ordering — pushes the entry message BEFORE running blocking
pre-actions:
- Entry-message playback is also the time budget for the three next-turn warmups
(
:2679-2681, Part 5). - If the transition pushed TTS, this turn's speaking-LLM push defers until
BotStoppedSpeakingFrame(:2687-2689).
Why it matters: the user hears the entry line immediately while a slow API pre-action runs underneath — the ordering hides blocking work behind speech, and gives the next turn's warmups a window to prime the cache.
Checkpoint: a node has 10 MATCHES_INTENT transitions, 8 extraction fields each with its own extraction prompt, and one transition condition reading "{{tckn}} is set". What is the pre-speaking stall shape?
extraction + g4: the 8 unique extraction prompts fan out into
8 concurrent extractor calls (wall ≈ slowest group, degrading under extractor
queueing), and because one condition references {{tckn}}, the g4 intent
evaluation can only fire after extraction completes
(processor.py:1523-1537) — serial, not overlapped.
Fix order: consolidate the extraction prompts into 1–2 groups, and if possible rewrite the condition as a pure intent so it rides the g3 prefetch instead.
_extract_and_route in src/core/workflow/processor.py and explain where
g3 vs g4 fires — with the exact lines where extraction is awaited before the g4 prefetch."