The pipeline, as actually assembled turn
What it is. Before any knob makes sense you need the literal, sequential
order in which frames flow through the voice pipeline. assemble_pipeline_step
builds it at pipecat-agent src/core/boot_steps.py:2969-2993. Every
element below is a real processor in the list, in order.
transport.input() :2970 audio in [FreyaDTMFAggregator] :2971 keypad digits [stt] :2972 FreyaSTTService (batch) | streaming AudioPipelineMonitor() :2973 diagnostics; emits the "vad" span [voicemail_detector] :2974 [input_guardrail] :2975 [confidence_reask_processor] :2976 context_aggregator.user() :2977 VAD + turn start/stop live HERE [terminal_gate] :2978 [workflow_processor] :2979 StatefulWorkflowProcessor: extract + route [audio_playback] :2980 fallback_pipeline | llm :2981 the SPEAKING LLM service [output_guardrail] :2982 [voicemail_gate] :2983 call_action_processor :2984 echo grace + call actions [tts] :2985 FreyaTTSService (sentence aggregation) [aec_tap] :2986 [SilenceFilterProcessor] :2987 on-prem only transport.output() :2988 audio out
LLMUserAggregatorParams.vad_analyzer (src/services/base_service.py:1684,1690)
and run by a VADController inside LLMUserAggregator
(pipecat processors/aggregators/llm_response_universal.py:476-488).
VAD start/stop frames are broadcast both upstream and downstream
(pipecat audio/vad/vad_controller.py:237-241) — that is how the STT
service, which sits upstream of the aggregator, learns when to finalize.
Why it matters for latency: the two stages that own most of the
visible delay (the VAD timers and the LLM bundle) live at two specific processors
— context_aggregator.user() and workflow_processor —
not spread across the whole chain.
The three overlap windows parallel
What it is. The chain
VAD-stop → turn-close → extract/route → (transition) → speaking LLM → sentence aggregation → TTS → audio out
is strictly serial, except for three deliberate overlaps. If you sum the stages
naively you will over-count by the size of these three windows.
- 1. STT finalization overlaps the
waitSecondswait. The turn-close timer and the STT finalize round-trip both run from VAD-stop; the turn pays the slower of the two, not the sum (pipecatturns/user_stop/speech_timeout_user_turn_stop_strategy.py:191-203,236-244). - 2. Extraction overlaps the batched intent-match. Variable extraction and the
routing intent call fire concurrently; the turn pays
max(extraction, intent_match)— unless a transition condition references an extracted variable, which serializes them (src/core/workflow/processor.py:1445-1496; Part 4). - 3. Node-entry TTS overlaps pre-actions and next-turn warmups. On a transition
the entry message is spoken first; blocking API pre-actions plus prefix-cache warmups
run behind the playback (
processor.py:2653-2681,2788; Parts 5 and 11).
processor.py:1720 before the speaking context is pushed at
processor.py:1832 — there is no early/parallel speaking-LLM kickoff.The turn-close formula endpointing
What it is. The exact rule for when the user's turn ends and LLM work may begin.
armed timer = max(stt_ttfs_p99 − stop_secs, user_speech_timeout) on finalized transcript: wait collapses to user_speech_timeout # measured from the VAD-stop timestamp user_speech_timeout = waitSeconds # base_service.py:1565-1571, fallback 0.6 s
The consequence. With on-prem defaults
(stop_secs = 0.2, batch STT p99 0.55, waitSeconds = 0.6)
the turn closes 0.8 s after the caller's last syllable, every turn, by
construction. A 100 ms STT is invisible behind the 0.6 s waitSeconds
floor; making STT faster buys nothing unless waitSeconds drops too.
Symptom it causes/fixes: “there's always a beat of silence before it answers” is usually this configured floor +0.8 s, not a slow model.
What is NOT on the critical path off-path
Things that look like latency stages but are deliberately off-path. Don't waste a tuning cycle on them:
- Warmups (
llm_warmup.node_entry/.router/.extractor) — backgroundasyncio.create_taskprefix-cache primes during entry-message playback (processor.py:2788-3011). Their absence puts a cold prefill on the next turn (Part 5). - Langfuse export — spans go through a
BatchSpanProcessoron a separate thread, explicitly to remove the audio jitter the old synchronous exporter caused (src/core/tracing.py:93-106). - Idle handler — held during routing via
_hold_idle("workflow_route_llm")(processor.py:1265,1279) so slow routing can't trigger a false “are you there”. - Echo grace (
botSpeechGraceSecs) — affects barge-in responsiveness only, never response start (boot_steps.py:2848-2867). - Trailing silence pad — 0.1 s appended after the utterance
(
src/exts/tts/freya_tts_v2.py:23,409-410) affects when the bot finishes, not when it starts.
A composed typical turn, with real numbers budget
On-prem defaults, warm cache, no node transition (KKB Gemma 4 31B, sources in Parts 2–9):
0.2 s VAD silence confirmation (stop_secs)
(STT finalize ~0.1–0.36 s, fully overlapped)
0.6 s turn closes (waitSeconds floor from VAD-stop)
~0.5 s max(extraction, intent-match) intent p50 0.48 s
~0.2 s speaking-LLM TTFT prefix cache warm: 0.15–0.21 s
~0.25 s first-sentence generation + aggregation
~0.27 s TTS TTFB (KKB wall, C=1)
−−−−−
≈ 1.6–1.9 s to first audio
This matches the production p50 of ~1.5 s (some turns skip extraction/routing
entirely). The single felt-latency metric is
turn.user_bot_latency_seconds (VADUserStoppedSpeaking → BotStartedSpeaking,
pipecat observers/user_bot_latency_observer.py:87-88,147); the span-by-span
decomposition map is in Part 12.
workflow.routing.batch contains extraction time (extraction is awaited
inside it). Never add workflow.extraction.batch on top when summing a
turn's budget.What dominates here: the configured timers
(0.2 s stop_secs + 0.6 s waitSeconds floor) plus the LLM
stage; everything else is rounding error on a warm turn.
Checkpoint: STT on your on-prem box finalizes in 120 ms. You drop in a 2× faster STT model. How much faster does the turn get?
Zero. The turn closes at
VAD-stop + max(stt_final, waitSeconds) once a finalized transcript exists;
with waitSeconds = 0.6 runtime default, any STT under ~600 ms after
VAD-stop is fully hidden behind the wait floor
(pipecat speech_timeout_user_turn_stop_strategy.py:236-244). To cash in the
faster STT you must also lower waitSeconds. Try it in the race
visualizer below: drag STT below the wait line and watch the armed timer ignore it.
Try it — turn-timeline composer
Try it — turn-close race
max(waitSeconds, sttP99 − stop_secs) — who wins?Try it — Langfuse trace decoder
waitSeconds and stop_secs end up in the turn-close timer, with file:line,
and confirm workflow.routing.batch awaits extraction inside it.”