Part 1 — Anatomy of one turn · Freya End-to-End Latency

Step 1

The pipeline, as actually assembled turn

What it is. Before any knob makes sense you need the literal, sequential order in which frames flow through the voice pipeline. assemble_pipeline_step builds it at pipecat-agent src/core/boot_steps.py:2969-2993. Every element below is a real processor in the list, in order.

boot_steps.py:2969–2993 — the assembled order

transport.input()                  :2970   audio in
[FreyaDTMFAggregator]              :2971   keypad digits
[stt]                              :2972   FreyaSTTService (batch) | streaming
AudioPipelineMonitor()             :2973   diagnostics; emits the "vad" span
[voicemail_detector]               :2974
[input_guardrail]                  :2975
[confidence_reask_processor]       :2976
context_aggregator.user()          :2977   VAD + turn start/stop live HERE
[terminal_gate]                    :2978
[workflow_processor]               :2979   StatefulWorkflowProcessor: extract + route
[audio_playback]                   :2980
fallback_pipeline | llm            :2981   the SPEAKING LLM service
[output_guardrail]                 :2982
[voicemail_gate]                   :2983
call_action_processor              :2984   echo grace + call actions
[tts]                              :2985   FreyaTTSService (sentence aggregation)
[aec_tap]                          :2986
[SilenceFilterProcessor]           :2987   on-prem only
transport.output()                 :2988   audio out

The subtlety that trips everyone: VAD is not a pipeline element. The Silero (or DeepFilterNet-wrapped) analyzer is handed to the user aggregator via LLMUserAggregatorParams.vad_analyzer (src/services/base_service.py:1684,1690) and run by a VADController inside LLMUserAggregator (pipecat processors/aggregators/llm_response_universal.py:476-488). VAD start/stop frames are broadcast both upstream and downstream (pipecat audio/vad/vad_controller.py:237-241) — that is how the STT service, which sits upstream of the aggregator, learns when to finalize.

Why it matters for latency: the two stages that own most of the visible delay (the VAD timers and the LLM bundle) live at two specific processors — context_aggregator.user() and workflow_processor — not spread across the whole chain.

Step 2

The three overlap windows parallel

What it is. The chain VAD-stop → turn-close → extract/route → (transition) → speaking LLM → sentence aggregation → TTS → audio out is strictly serial, except for three deliberate overlaps. If you sum the stages naively you will over-count by the size of these three windows.

1. STT finalization overlaps the waitSeconds wait. The turn-close timer and the STT finalize round-trip both run from VAD-stop; the turn pays the slower of the two, not the sum (pipecat turns/user_stop/speech_timeout_user_turn_stop_strategy.py:191-203,236-244).
2. Extraction overlaps the batched intent-match. Variable extraction and the routing intent call fire concurrently; the turn pays max(extraction, intent_match) — unless a transition condition references an extracted variable, which serializes them (src/core/workflow/processor.py:1445-1496; Part 4).
3. Node-entry TTS overlaps pre-actions and next-turn warmups. On a transition the entry message is spoken first; blocking API pre-actions plus prefix-cache warmups run behind the playback (processor.py:2653-2681,2788; Parts 5 and 11).

No speculative speaking start. Routing strictly precedes speaking. The routing decision is awaited at processor.py:1720 before the speaking context is pushed at processor.py:1832 — there is no early/parallel speaking-LLM kickoff.

Step 3

The turn-close formula endpointing

What it is. The exact rule for when the user's turn ends and LLM work may begin.

ttsConfig.startSpeakingPlan.waitSeconds → user_speech_timeout runtime fallback 0.6 s stop_secs default 0.2 s batch STT p99 0.55 s

SpeechTimeoutUserTurnStopStrategy — speech_timeout_user_turn_stop_strategy.py:191-203,236-244

armed timer  = max(stt_ttfs_p99 − stop_secs, user_speech_timeout)
on finalized transcript:
   wait collapses to user_speech_timeout   # measured from the VAD-stop timestamp

user_speech_timeout = waitSeconds   # base_service.py:1565-1571, fallback 0.6 s

The consequence. With on-prem defaults (stop_secs = 0.2, batch STT p99 0.55, waitSeconds = 0.6) the turn closes 0.8 s after the caller's last syllable, every turn, by construction. A 100 ms STT is invisible behind the 0.6 s waitSeconds floor; making STT faster buys nothing unless waitSeconds drops too.

Symptom it causes/fixes: “there's always a beat of silence before it answers” is usually this configured floor +0.8 s, not a slow model.

Step 4

What is NOT on the critical path off-path

Things that look like latency stages but are deliberately off-path. Don't waste a tuning cycle on them:

Warmups (llm_warmup.node_entry / .router / .extractor) — background asyncio.create_task prefix-cache primes during entry-message playback (processor.py:2788-3011). Their absence puts a cold prefill on the next turn (Part 5).
Langfuse export — spans go through a BatchSpanProcessor on a separate thread, explicitly to remove the audio jitter the old synchronous exporter caused (src/core/tracing.py:93-106).
Idle handler — held during routing via _hold_idle("workflow_route_llm") (processor.py:1265,1279) so slow routing can't trigger a false “are you there”.
Echo grace (botSpeechGraceSecs) — affects barge-in responsiveness only, never response start (boot_steps.py:2848-2867).
Trailing silence pad — 0.1 s appended after the utterance (src/exts/tts/freya_tts_v2.py:23,409-410) affects when the bot finishes, not when it starts.

Step 5

A composed typical turn, with real numbers budget

On-prem defaults, warm cache, no node transition (KKB Gemma 4 31B, sources in Parts 2–9):

guide.md Part 1, Step 5 — warm turn, no transition

0.2 s   VAD silence confirmation (stop_secs)
        (STT finalize ~0.1–0.36 s, fully overlapped)
0.6 s   turn closes (waitSeconds floor from VAD-stop)
~0.5 s  max(extraction, intent-match)        intent p50 0.48 s
~0.2 s  speaking-LLM TTFT                     prefix cache warm: 0.15–0.21 s
~0.25 s first-sentence generation + aggregation
~0.27 s TTS TTFB (KKB wall, C=1)
−−−−−
≈ 1.6–1.9 s to first audio

This matches the production p50 of ~1.5 s (some turns skip extraction/routing entirely). The single felt-latency metric is turn.user_bot_latency_seconds (VADUserStoppedSpeaking → BotStartedSpeaking, pipecat observers/user_bot_latency_observer.py:87-88,147); the span-by-span decomposition map is in Part 12.

Double-counting warning. workflow.routing.batch contains extraction time (extraction is awaited inside it). Never add workflow.extraction.batch on top when summing a turn's budget.

The warm turn on the clock (overlaps shown as ghost lanes)

VAD STT wait floor LLM (route + speak) TTS

What dominates here: the configured timers (0.2 s stop_secs + 0.6 s waitSeconds floor) plus the LLM stage; everything else is rounding error on a warm turn.

Checkpoint: STT on your on-prem box finalizes in 120 ms. You drop in a 2× faster STT model. How much faster does the turn get?

Zero. The turn closes at VAD-stop + max(stt_final, waitSeconds) once a finalized transcript exists; with waitSeconds = 0.6 runtime default, any STT under ~600 ms after VAD-stop is fully hidden behind the wait floor (pipecat speech_timeout_user_turn_stop_strategy.py:236-244). To cash in the faster STT you must also lower waitSeconds. Try it in the race visualizer below: drag STT below the wait line and watch the armed timer ignore it.

Try it — turn-timeline composer

Compose one turn; read “first audio at X ms”

VAD silence (stop_secs)

0.20 s

Wait seconds (waitSeconds)

0.60 s

First-sentence length (chars)

Intent leaves (routing branch count)

Streaming STT (else batch p99 0.55) Extraction on g4 routing fires (LLM intent match) Warm prefix cache

VAD STT wait LLM TTS

Try it — turn-close race

VAD-stop arms max(waitSeconds, sttP99 − stop_secs) — who wins?

STT p99 finalize (s)

0.55 s

Wait seconds (waitSeconds)

0.60 s

Stop secs (stop_secs)

0.20 s

Drag the sliders.

Try it — Langfuse trace decoder

Pick a canned span list; reconstruct the per-stage waterfall and the felt-latency sum

Canned trace

Ask Claude Code: “In pipecat-agent, show me every place waitSeconds and stop_secs end up in the turn-close timer, with file:line, and confirm workflow.routing.batch awaits extraction inside it.”