Part 2

VAD & endpointing: the dead-air tax

Before STT can finalize and before the LLM sees a token, the pipeline must decide the caller is done talking. Every knob in that decision is a deliberate wait that fires on every single turn — the cheapest place to win or lose 300 ms platform-wide.

Step 6

Stop Delay — the irreducible floor VAD

What it is. How much continuous silence VAD must accumulate before it declares end-of-speech. Nothing downstream can start earlier — this is the hard floor of every turn.

sttConfig.additionalSettings.vadStopSecs → VADParams.stop_secs slider 0.05–1.0 s, step 0.05 default 0.2

Runtime effect. The Silero state machine consumes 32 ms chunks (512 samples @16 kHz, pipecat audio/vad/silero.py:191-197) and flips SPEAKING → QUIET only after stop_secs of consecutive non-speech chunks (pipecat vad_analyzer.py:237-242). Defaults live in pipecat-agent src/services/base_service.py:1671, dashboard src/components/config-panels/stt-config-panel.tsx:801-812, pipecat audio/vad/vad_analyzer.py:27.

Symptom: mid-sentence cutoffs → raise; sluggish every-turn response → lower. Both can't win — this is the central trade-off of the domain. Healthy endpointing target: 200–350 ms, suspect >500 ms.

Step 6b

The other three VAD knobs: Confidence, Minimum Volume, Start Delay VAD

What they are. Stop Delay is one of four VAD knobs Freya builds into VADParams (base_service.py:1667-1672). The other three don't add a fixed timer the way Stop Delay does, but they decide whether a 32 ms chunk even counts as "speaking" — speaking = confidence >= params.confidence and volume >= params.min_volume (pipecat audio/vad/vad_analyzer.py:206) — so they drive the same failure modes Steps 6–7 discuss. A field engineer's first VAD question is "what about sensitivity / noise threshold?" — these are the answer. All three live in sttConfig.additionalSettings and are constrained only by the dashboard sliders (see Step 9b).

vadConfidence → VADParams.confidence slider 0.05–1.0, step 0.05 default 0.7

Minimum Silero speech-probability for a chunk to count as voice (dashboard stt-config-panel.tsx:767-777; base_service.py:1668; pipecat vad_analyzer.py:25).

vadMinVolume → VADParams.min_volume slider 0.0–1.0, step 0.05 default 0.4

Note the deliberate mismatch: Freya's fallback is 0.4, but the pipecat library default is 0.6 (vad_analyzer.py:28) — Freya runs a lower (more sensitive) floor on purpose (stt-config-panel.tsx:778-788; base_service.py:1669).

vadStartSecs → VADParams.start_secs slider 0.05–1.0, step 0.05 default 0.2

Start Delay does not affect end-of-turn dead air — it gates how much speech is needed before VAD declares the turn STARTED, so it governs turn-start detection and the pure-VAD barge-in path (Step 10), and it shifts the streaming-STT preroll window (src/exts/stt/freya_v3.py:139, preroll_secs=0.3). A too-high Start Delay is the source of the "agent ignored my 'alo'" / didn't-hear-my-opening-word complaint (stt-config-panel.tsx:789-800; base_service.py:1670; vad_analyzer.py:26).

Latency framing (dossier 02, section 1)Confidence and Minimum Volume add no fixed latency, but too high = late or blind VAD (missed turn start → the caller repeats themselves; trailing low-energy speech read as silence → early endpoint → fragmentation, the same +600–1000 ms/fragment cost as Step 6) and too low = false VAD-starts that hold the turn open and cause false barge-ins. Start Delay likewise adds no end-of-turn dead air; it only delays turn-START detection.

Try it: VAD state machine, chunk by chunk

Silero scores one 32 ms chunk at a time. A chunk counts as voice only when confidence >= Confidence and volume >= Min Volume (vad_analyzer.py:206). Step through a real-shaped utterance with a breath in the middle and watch the silence counter reset — the reason a thinking-pause does not end the turn so long as Stop Delay is generous.

VAD state-machine stepper — 32 ms chunks
0.70
0.40
0.20
chunk counts as voice chunk counts as silence cursor
t = 0 ms
QUIET — step forward to feed chunks into the state machine.

Watch for: when the cursor reaches the breath (a low-energy chunk that drops below your thresholds), the consecutive-silence counter restarts the moment a voice chunk follows, so the mid-thought pause does not end the turn. Raise Confidence/Min Volume high enough and trailing soft speech reads as silence — an early endpoint, the fragmentation failure mode.

Step 7

Wait seconds — the resume window and the dead-air equation VAD

What it is. After VAD-stop, how long the bot waits for the caller to resume before claiming the turn.

ttsConfig.startSpeakingPlan.waitSeconds → user_speech_timeout range 0–5 s, step 0.1 zod default 0.4, runtime fallback 0.6

Runtime effect. Becomes the SpeechTimeoutUserTurnStopStrategy timeout (base_service.py:1658). It is a floor counted from VAD-stop, concurrent with STT finalization — not additive. The zod default is 0.4 but the runtime fallback is 0.6 when unset (base_service.py:1565-1571; zod src/app/api/v2/agents/validators.ts:60) — a known schema-vs-runtime mismatch.

The dead-air equationper-turn endpointing dead air = stop_secs + max(stt_final_after_vad_stop, waitSeconds)0.8 s with runtime defaults, ~1.1 s with production stop_secs=0.5.
The coupling trapThe strategy waits max(ttfs_p99 − stop_secs, waitSeconds). With on-prem p99 0.55 and waitSeconds 0.6, raising stop_secs 0.2 → 0.5 moves the STT allowance 0.35 → 0.05 but the gate is still 0.6 — the +0.3 s is a pure loss, nothing is "bought back" (pipecat speech_timeout_user_turn_stop_strategy.py:196-203).

Symptom: "it jumps in before I finish" → raise; "awkward pause before every answer" → lower toward 0.3–0.4 (and verify mid-sentence cutoffs don't return).

Try it: the turn timeline and the cutoff failure mode

Drag stop_secs and waitSeconds, pick an STT-p99, and watch the dead-air segment grow. Then move the "caller resumes after" marker past stop_secs to trigger the cutoff/fragmentation failure — the caller paused longer than the silence floor, so VAD already declared the turn over and the resumed speech becomes a separate fragment.

Turn timeline — dead air and the resume cutoff
0.20
0.60
0.00
stop_secs (silence floor) STT finalization waitSeconds gate turn handed to LLM
Adjust the controls to compute the dead-air tax.
Step 8

Smart Turn's asymmetric bet VAD

What it is. An ONNX model that predicts end-of-turn from prosody instead of pure silence timing.

sttConfig.additionalSettings.smartTurnEnabled default false disableSmartTurnOnDigitNodes default true dashboard stt-config-panel.tsx:731-757

Runtime effect. Wraps a TurnAnalyzerUserTurnStopStrategy(LocalSmartTurnAnalyzerV3()) and a VAD-only fallback in a per-node NodeAwareUserTurnStopStrategy (base_service.py:1641-1651; src/core/strategies/node_aware_turn_stop_strategy.py:46). On VAD-stop it runs one CPU ONNX inference (measured as e2e_processing_time_ms, pipecat audio/turn/smart_turn/base_smart_turn.py:217-227; no recorded production value — not verified in source).

The asymmetry is why it ships offA good prediction saves ~0.5 s; a bad one costs ~2.4 s. That lopsided bet is why Smart Turn is default-false and auto-disabled on digit nodes.

Symptom it fixes: callers who pause mid-thought getting cut off. Symptom it causes: occasional 3 s hangs after a turn the model misread.

Step 9

The backstops: Turn Stop Timeout (the slider that lies) and finalizeTimeout VAD

Turn Stop Timeout (sttConfig.additionalSettings.userTurnStopTimeout) — slider 0.5–5.0 s in the dashboard, but effectively decorative in voice mode: runtime clamps it to max(30.0, finalize_timeout × 10) (_VOICE_BACKSTOP_FLOOR, base_service.py:148,1594-1616). Routine turn-close is fully covered by the stop strategy; a low backstop was the foot-gun that dropped slow-STT turns on call e894a846 (2026-05-20). You can only raise it above 30 s, never lower.

finalizeTimeout (no dashboard slider — runtime-only key) default 3.0 s

finalizeTimeout (sttConfig.additionalSettings.finalizeTimeout) — default 3.0 s (src/exts/stt/_finalize_timeout.py:30,33-45). When VAD said "stopped" but STT delivers nothing (cough, glitch), the timer fires, force-closes the turn, and re-prompts with a localized "could you repeat" (:80-128). Transcripts arriving after the timer are dropped to avoid duplicate turns (2026-05-10 incident).

Latency signature~3 s of dead air then "could you repeat" = STT failure path, not LLM slowness. Check STT health and GPU contention, not the model — the LLM never saw the turn.
Step 9b

The range-validation foot-gun: sliders guard the UI, not the API gotcha

Every VAD/endpointing knob in this part — vadStopSecs, vadStartSecs, vadConfidence, vadMinVolume, finalizeTimeout, userTurnStopTimeout — lives inside the sttConfig.additionalSettings record, which the dashboard validates as z.record(z.string(), z.any()) (freya-dashboard src/app/api/v2/agents/validators.ts:33,107,138). There is no per-key range rule for any of them — the only superRefine on that record caps keyword/keyterm count, nothing else (validators.ts:140-160).

The "why is THIS one agent slow" surpriseThe dashboard sliders (e.g. Stop Delay max 1.0 s) are the only guardrail. A config written via the API or raw additionalSettings JSON can carry an out-of-range value (e.g. vadStopSecs: 2.0) and the runtime applies it verbatim — silently adding seconds of dead air to every turn. Check the agent's raw additionalSettings, not just the slider positions. (waitSeconds is the exception — a separately-validated field at validators.ts:60, min(0).max(5), not part of the passthrough record.)
Step 10

Barge-in latency has different physics VAD

Interrupting the bot is a separate path from normal turn-taking, governed by the Stop Speaking Plan:

Dead knobs, re-verified at 5b29206cstopSpeakingPlan.voiceSeconds and stopSpeakingPlan.backOffSeconds still have no runtime consumer — don't tune them expecting latency changes.

Try it: the barge-in race

When the caller interrupts, two triggers compete: pure-VAD start (fires ~0.2 s after voice onset) and the word-gated min-words strategy (needs numberOfWords words). With streaming STT the word count comes from interim partials during speech; with batch STT it can't arrive until after the full utterance + stop_secs + the HTTP roundtrip. Pick the config and see which trigger wins — and how long the bot talks over the caller (the double-talk readout).

Barge-in race simulator — which trigger wins
0.0
VAD-start trigger word-gate / STT grace mute window bot still speaking (double-talk)
double-talk: —
Pick a config and run the race.
Checkpoint: an on-prem deployment runs batch STT with numberOfWords=2. Callers complain the bot talks over them for seconds when they try to interrupt. What are the two structural fixes?

Either drop numberOfWords to 0 (pure-VAD barge-in fires ~0.2 s after voice onset, no transcript needed) — at the cost of noise sensitivity — or switch to streaming STT (additionalSettings.streaming), whose interim partials let the min-words strategy count words during speech.

With batch STT and word gating, the interruption can mathematically only fire after the full utterance + stop_secs + STT roundtrip (pipecat services/stt_service.py:596-605).

Ask Claude Code: "Show me the agent's raw sttConfig.additionalSettings for this workspace — I want to check vadStopSecs, vadConfidence, vadMinVolume and finalizeTimeout against the dashboard slider ranges, because one agent is slow on every turn."

What dominates here

stop_secs + waitSeconds — a configured 0.8–1.1 s tax on every turn, and the only stage whose cost is 100% policy, 0% compute. The next part (STT) shows how that waitSeconds floor usually hides STT finalization — until shared-GPU load drags it back into view.