Part 4 — The LLM stage: one turn fires a bundle of calls

Framing

One turn, a bundle of differently-shaped calls LLM

A common misconception is that "the LLM call" is one call. A single user turn fires a bundle of differently-shaped LLM calls — extraction (JSON), routing / intent-matching (guided-decode JSON), then the speaking call (streamed free text + tools) — with explicit dependency rules between them.

processor.py:1214–1248 extraction JSON routing guided JSON speaking streamed + tools

The canonical spec is the docstring at pipecat-agent src/core/workflow/processor.py:1214-1248. Read this part as: where each call sits, what makes one wait for another, and which dials are workflow-design choices rather than serving knobs.

What dominates here: max(extraction, intent_match) — typically ~0.5 s on-prem — plus the decode of the routing JSON. The author controls both: extraction grouping and transition count are workflow-design choices, not infra.

Step 17

The dependency DAG LLM

What it is. When the turn closes, _extract_and_route classifies each transition (processor.py:1409-1443) into one of four buckets that decide what runs, what waits, and what is free:

det

Fully deterministic (variable comparisons): pure Python, no LLM. router.py:284-319

g3

Needs the LLM (e.g. MATCHES_INTENT) and does not reference this turn's extraction scope: fires in parallel with extraction.

g4

Needs the LLM and references freshly-extracted variables: must wait for extraction to finish. This is the serialization trap.

always

Bare ALWAYS fallback: free.

Runtime effect. Then (processor.py:1445-1537): the extract task fires immediately; g3_prefetch fires concurrently — all local + global intent leaves are batched into ONE LLM call (router.prefetch_intent_batch, router.py:170-197 → _evaluate_intent_leaves :1111-1150, indexed-boolean json_schema, one workflow.intent_matching span). Extraction is awaited at :1508; only then can g4_prefetch fire with the substituted variables.

Priority walk (:1539-1664): loop → local det → local g3 → local g4 → global det → global g3 → global g4 → local ALWAYS → global ALWAYS. First match wins; pending tasks are cancelled.

Symptom it explains: a turn that "looked simple" but stalled — a single {{var}} condition flipped a transition from g3 (overlapped) to g4 (serial-after-extraction).

Try it — the DAG stepper

classify → fire(extract ∥ g3) → await extract → g4 → priority walk

Toggle whether any transition's condition reads a freshly-extracted variable ({{var}} is set). Watch the g4 step appear and serialize after extraction.

a transition reads {{var}} (forces g4)

idle

Press Step to walk the DAG.

Step 18

Extraction is on the critical path (overlapped, not free) LLM

What it is. Variable extraction from the user's turn, awaited before the routing walk and therefore before the speaking LLM.

Runtime effect. await tasks["extract"] at processor.py:1508 is the receipt; the routing decision is awaited at :1720 before the speaking context push at :1832. The turn's pre-speaking stall is:

pre-speaking stall

max(extraction, g3_routing)   # nothing references variables
extraction + g4_routing      # a transition reads a fresh {{var}}

PR #224 (2026-03) made the overlap real: the measured win was −1,581 ms on an intent-only turn (extraction 885 ms ∥ routing 18 ms; analysis/llm-latency-benchmark/5-extraction-parallel-latency/findings.md).

The authored fan-out dial. The extractor groups fields by (type, extraction_prompt, retry_prompt, has_enum_values) and fires one LLM call per group, concurrently (src/core/workflow/extractor.py:684-696,176-236). Eleven fields with eleven unique extraction prompts = eleven concurrent extractor calls; consolidating 11 → 3 groups measured ~600–1,000 ms saved on extraction-heavy turns (exp-5 findings). Failed validations add a serial workflow.extraction_retry call (extractor.py:449).

When to change: share one extraction_prompt across same-type fields; keep intent conditions variable-free so g3 overlaps extraction instead of waiting. A condition like {{payment_date}} is set on a transition forces serial extract → route.

On-prem Gemma caveatOn KKB Gemma agents that capture variables via a tool inside the speaking call, there is no separate extraction call — but this is agent-/workflow-dependent, not an on-prem rule (the Garanti case-study call shows 20 workflow.extraction rows).

Symptom it causes/fixes: a fat pre-speaking gap on a node with many uniquely-prompted fields — 1.6–1.8 s when 8–11 extractor groups queue. Consolidate the prompts and the gap collapses toward the slowest single group.

Try it — the turn-timeline composer

extraction ∥ routing → speaking TTFT → first sentence

MATCHES_INTENT transitions on this node

Extraction groups (unique extraction prompts firing concurrently)

a transition reads a fresh {{var}} (g4, serial) cold prefix (speaking prefill not warmed)

extraction routing speaking TTFT first sentence

Move the sliders to compose a turn.

Constants from KKB 4×H100 Gemma @ C=1: extraction ~698–885 ms per group (the slowest group sets the wall, degrading slightly as groups queue), routing TTFT 52 ms + ~5–6 output tok per transition at ~13 ms/tok, speaking warm TTFT ~155 ms (cold ~3,774 ms), first-sentence decode ~480 ms.

Step 19

Routing latency is an output-length problem LLM

What it is. The one batched workflow.intent_matching call per turn that decides the transition.

Runtime effect. Guided decoding into a compact indexed-boolean object {"0":false,"1":true,…} (router.py:982-991), temperature 0, max_tokens 512. Measured at C=1 on KKB: routing total p50 439 ms, of which TTFT is only 52 ms — ~88% of routing wall time is decoding the ~31-token boolean JSON (freya-ops latency-benchmark/results/workflow_latency_benchmark_20260525_174551__anadolu-survey.md). Every additional transition adds a key (~5–6 output tokens ≈ 60–120 ms at this stack's 12–20 ms/tok).

When to change: transitions-per-node is a latency dial; the compact schema is why it isn't worse. Routing >2 s with a fat tail = GPU contention (it collided with a big warmup or other calls), not the ~2K routing prompt (latency-analyzer references/interpreting.md:37-41).
Non-intent LLM conditions (SEMANTIC_MATCH, LLM_COMPARE) each cost one extra LLM call inside the batch evaluation (router.py:1032-1087, max_tokens 32).

Symptom it causes/fixes: routing wall that grows with transition count even though the prompt barely changes — because you are paying decode, not prefill.

Try it — the routing decode cost widget

transitions → output JSON keys → decode milliseconds

MATCHES_INTENT transitions (= keys in the indexed-boolean JSON)

TPOT (time per output token, ms)

13.0

Move the sliders to size the routing call.

Step 20

The loop condition is NOT an extra call; two-phase routing is the worst case LLM

Per the 2026-02-27 team decision (post-PR #210 revert), the loop condition is a self-transition evaluated inside the normal phase-1 routing call — verified at router.py:246-282 ("Phase 1: Evaluate transitions (local + loop-condition self-transition)", :257,264). Phase 2 (global transitions) only runs when phase 1 had no match.

Worst case = 2 sequential routing-LLM calls A phase-1 LLM miss followed by phase-2 globals that also need the LLM means two sequential routing calls in one turn — not a third evaluator. Deterministic conditions short-circuit with zero LLM calls: _has_llm_routes (processor.py:1255,1290-1304) skips even opening the routing span on deterministic-only nodes, making them near-free.

Symptom it explains: a turn with two routing spans is the two-phase worst case, not a bug. A node with only det routes showing no routing span at all is the short-circuit, not a missing measurement.

Step 21

Routing suppression and the no-response shortcut LLM

If the current node hasn't produced an assistant response yet (and isn't silent/passthrough), routing and node-extraction context are skipped for that turn — it goes straight to the speaking LLM (processor.py:1691-1700).

Why it matters: this is why some turns in a trace show no routing spans at all. Don't read it as a missing measurement — it's the suppression shortcut.

Step 22

Transition execution: speak first, work second LLM

On a matched transition, _execute_transition (processor.py:2544) swaps workflow state, tool registry, system message and extractor schemas atomically, then — the load-bearing ordering — pushes the entry message BEFORE running blocking pre-actions:

processor.py:2653-2675"Speak entry message before pre-actions so the user hears it while blocking actions (e.g. API calls) execute."

Entry-message playback is also the time budget for the three next-turn warmups (:2679-2681, Part 5).
If the transition pushed TTS, this turn's speaking-LLM push defers until BotStoppedSpeakingFrame (:2687-2689).

Why it matters: the user hears the entry line immediately while a slow API pre-action runs underneath — the ordering hides blocking work behind speech, and gives the next turn's warmups a window to prime the cache.

Checkpoint: a node has 10 MATCHES_INTENT transitions, 8 extraction fields each with its own extraction prompt, and one transition condition reading "{{tckn}} is set". What is the pre-speaking stall shape?

extraction + g4: the 8 unique extraction prompts fan out into 8 concurrent extractor calls (wall ≈ slowest group, degrading under extractor queueing), and because one condition references {{tckn}}, the g4 intent evaluation can only fire after extraction completes (processor.py:1523-1537) — serial, not overlapped.

Fix order: consolidate the extraction prompts into 1–2 groups, and if possible rewrite the condition as a pure intent so it rides the g3 prefetch instead.

Ask Claude Code: "In pipecat-agent, show me _extract_and_route in src/core/workflow/processor.py and explain where g3 vs g4 fires — with the exact lines where extraction is awaited before the g4 prefetch."