Part 1

Anatomy of one turn

The exact sequential path of one turn — which pieces overlap and which do not — and the turn-close formula that sets the floor for every other knob in this guide.

Step 1

The pipeline, as actually assembled turn

What it is. Before any knob makes sense you need the literal, sequential order in which frames flow through the voice pipeline. assemble_pipeline_step builds it at pipecat-agent src/core/boot_steps.py:2969-2993. Every element below is a real processor in the list, in order.

boot_steps.py:2969–2993 — the assembled order
transport.input()                  :2970   audio in
[FreyaDTMFAggregator]              :2971   keypad digits
[stt]                              :2972   FreyaSTTService (batch) | streaming
AudioPipelineMonitor()             :2973   diagnostics; emits the "vad" span
[voicemail_detector]               :2974
[input_guardrail]                  :2975
[confidence_reask_processor]       :2976
context_aggregator.user()          :2977   VAD + turn start/stop live HERE
[terminal_gate]                    :2978
[workflow_processor]               :2979   StatefulWorkflowProcessor: extract + route
[audio_playback]                   :2980
fallback_pipeline | llm            :2981   the SPEAKING LLM service
[output_guardrail]                 :2982
[voicemail_gate]                   :2983
call_action_processor              :2984   echo grace + call actions
[tts]                              :2985   FreyaTTSService (sentence aggregation)
[aec_tap]                          :2986
[SilenceFilterProcessor]           :2987   on-prem only
transport.output()                 :2988   audio out
The subtlety that trips everyone: VAD is not a pipeline element. The Silero (or DeepFilterNet-wrapped) analyzer is handed to the user aggregator via LLMUserAggregatorParams.vad_analyzer (src/services/base_service.py:1684,1690) and run by a VADController inside LLMUserAggregator (pipecat processors/aggregators/llm_response_universal.py:476-488). VAD start/stop frames are broadcast both upstream and downstream (pipecat audio/vad/vad_controller.py:237-241) — that is how the STT service, which sits upstream of the aggregator, learns when to finalize.

Why it matters for latency: the two stages that own most of the visible delay (the VAD timers and the LLM bundle) live at two specific processors — context_aggregator.user() and workflow_processor — not spread across the whole chain.

Step 2

The three overlap windows parallel

What it is. The chain VAD-stop → turn-close → extract/route → (transition) → speaking LLM → sentence aggregation → TTS → audio out is strictly serial, except for three deliberate overlaps. If you sum the stages naively you will over-count by the size of these three windows.

No speculative speaking start. Routing strictly precedes speaking. The routing decision is awaited at processor.py:1720 before the speaking context is pushed at processor.py:1832 — there is no early/parallel speaking-LLM kickoff.
Step 3

The turn-close formula endpointing

What it is. The exact rule for when the user's turn ends and LLM work may begin.

ttsConfig.startSpeakingPlan.waitSeconds → user_speech_timeout runtime fallback 0.6 s stop_secs default 0.2 s batch STT p99 0.55 s
SpeechTimeoutUserTurnStopStrategy — speech_timeout_user_turn_stop_strategy.py:191-203,236-244
armed timer  = max(stt_ttfs_p99 − stop_secs, user_speech_timeout)
on finalized transcript:
   wait collapses to user_speech_timeout   # measured from the VAD-stop timestamp

user_speech_timeout = waitSeconds   # base_service.py:1565-1571, fallback 0.6 s

The consequence. With on-prem defaults (stop_secs = 0.2, batch STT p99 0.55, waitSeconds = 0.6) the turn closes 0.8 s after the caller's last syllable, every turn, by construction. A 100 ms STT is invisible behind the 0.6 s waitSeconds floor; making STT faster buys nothing unless waitSeconds drops too.

Symptom it causes/fixes: “there's always a beat of silence before it answers” is usually this configured floor +0.8 s, not a slow model.

Step 4

What is NOT on the critical path off-path

Things that look like latency stages but are deliberately off-path. Don't waste a tuning cycle on them:

Step 5

A composed typical turn, with real numbers budget

On-prem defaults, warm cache, no node transition (KKB Gemma 4 31B, sources in Parts 2–9):

guide.md Part 1, Step 5 — warm turn, no transition
0.2 s   VAD silence confirmation (stop_secs)
        (STT finalize ~0.1–0.36 s, fully overlapped)
0.6 s   turn closes (waitSeconds floor from VAD-stop)
~0.5 s  max(extraction, intent-match)        intent p50 0.48 s
~0.2 s  speaking-LLM TTFT                     prefix cache warm: 0.15–0.21 s
~0.25 s first-sentence generation + aggregation
~0.27 s TTS TTFB (KKB wall, C=1)
−−−−−
≈ 1.6–1.9 s to first audio

This matches the production p50 of ~1.5 s (some turns skip extraction/routing entirely). The single felt-latency metric is turn.user_bot_latency_seconds (VADUserStoppedSpeaking → BotStartedSpeaking, pipecat observers/user_bot_latency_observer.py:87-88,147); the span-by-span decomposition map is in Part 12.

Double-counting warning. workflow.routing.batch contains extraction time (extraction is awaited inside it). Never add workflow.extraction.batch on top when summing a turn's budget.
The warm turn on the clock (overlaps shown as ghost lanes)
VAD STT wait floor LLM (route + speak) TTS

What dominates here: the configured timers (0.2 s stop_secs + 0.6 s waitSeconds floor) plus the LLM stage; everything else is rounding error on a warm turn.

Checkpoint: STT on your on-prem box finalizes in 120 ms. You drop in a 2× faster STT model. How much faster does the turn get?

Zero. The turn closes at VAD-stop + max(stt_final, waitSeconds) once a finalized transcript exists; with waitSeconds = 0.6 runtime default, any STT under ~600 ms after VAD-stop is fully hidden behind the wait floor (pipecat speech_timeout_user_turn_stop_strategy.py:236-244). To cash in the faster STT you must also lower waitSeconds. Try it in the race visualizer below: drag STT below the wait line and watch the armed timer ignore it.

Try it — turn-timeline composer

Compose one turn; read “first audio at X ms”
0.20 s
0.60 s
80
3
VAD STT wait LLM TTS

Try it — turn-close race

VAD-stop arms max(waitSeconds, sttP99 − stop_secs) — who wins?
0.55 s
0.60 s
0.20 s
Drag the sliders.

Try it — Langfuse trace decoder

Pick a canned span list; reconstruct the per-stage waterfall and the felt-latency sum
Ask Claude Code: “In pipecat-agent, show me every place waitSeconds and stop_secs end up in the turn-close timer, with file:line, and confirm workflow.routing.batch awaits extraction inside it.”