Stop Delay — the irreducible floor VAD
What it is. How much continuous silence VAD must accumulate before it declares end-of-speech. Nothing downstream can start earlier — this is the hard floor of every turn.
Runtime effect. The Silero state machine consumes 32 ms chunks (512 samples @16 kHz, pipecat audio/vad/silero.py:191-197) and flips SPEAKING → QUIET only after stop_secs of consecutive non-speech chunks (pipecat vad_analyzer.py:237-242). Defaults live in pipecat-agent src/services/base_service.py:1671, dashboard src/components/config-panels/stt-config-panel.tsx:801-812, pipecat audio/vad/vad_analyzer.py:27.
- When to change: it is a pure 1:1 dead-air dial — 0.2 → 0.5 adds exactly +300 ms to every turn (latency-analyzer
references/interpreting.md:43-45). - Why TR production runs 0.5 anyway: the alternative is worse. The Allianz 2026-05-08 batch showed 28% of turns fragmented by early endpointing on thinking pauses, each fragment adding ~600–1000 ms (
notes/investigations/2026-05-08-allianz-1600-quality/_sections_5_10.md:175-179). Garanti, Anadolu, and Allianz tr-app2 all deliberately run 0.5 s (Garantinotes/investigations/garanti-stt-pr747/README.md:31).
Symptom: mid-sentence cutoffs → raise; sluggish every-turn response → lower. Both can't win — this is the central trade-off of the domain. Healthy endpointing target: 200–350 ms, suspect >500 ms.
The other three VAD knobs: Confidence, Minimum Volume, Start Delay VAD
What they are. Stop Delay is one of four VAD knobs Freya builds into VADParams (base_service.py:1667-1672). The other three don't add a fixed timer the way Stop Delay does, but they decide whether a 32 ms chunk even counts as "speaking" — speaking = confidence >= params.confidence and volume >= params.min_volume (pipecat audio/vad/vad_analyzer.py:206) — so they drive the same failure modes Steps 6–7 discuss. A field engineer's first VAD question is "what about sensitivity / noise threshold?" — these are the answer. All three live in sttConfig.additionalSettings and are constrained only by the dashboard sliders (see Step 9b).
Minimum Silero speech-probability for a chunk to count as voice (dashboard stt-config-panel.tsx:767-777; base_service.py:1668; pipecat vad_analyzer.py:25).
Note the deliberate mismatch: Freya's fallback is 0.4, but the pipecat library default is 0.6 (vad_analyzer.py:28) — Freya runs a lower (more sensitive) floor on purpose (stt-config-panel.tsx:778-788; base_service.py:1669).
Start Delay does not affect end-of-turn dead air — it gates how much speech is needed before VAD declares the turn STARTED, so it governs turn-start detection and the pure-VAD barge-in path (Step 10), and it shifts the streaming-STT preroll window (src/exts/stt/freya_v3.py:139, preroll_secs=0.3). A too-high Start Delay is the source of the "agent ignored my 'alo'" / didn't-hear-my-opening-word complaint (stt-config-panel.tsx:789-800; base_service.py:1670; vad_analyzer.py:26).
Try it: VAD state machine, chunk by chunk
Silero scores one 32 ms chunk at a time. A chunk counts as voice only when confidence >= Confidence and volume >= Min Volume (vad_analyzer.py:206). Step through a real-shaped utterance with a breath in the middle and watch the silence counter reset — the reason a thinking-pause does not end the turn so long as Stop Delay is generous.
Watch for: when the cursor reaches the breath (a low-energy chunk that drops below your thresholds), the consecutive-silence counter restarts the moment a voice chunk follows, so the mid-thought pause does not end the turn. Raise Confidence/Min Volume high enough and trailing soft speech reads as silence — an early endpoint, the fragmentation failure mode.
Wait seconds — the resume window and the dead-air equation VAD
What it is. After VAD-stop, how long the bot waits for the caller to resume before claiming the turn.
Runtime effect. Becomes the SpeechTimeoutUserTurnStopStrategy timeout (base_service.py:1658). It is a floor counted from VAD-stop, concurrent with STT finalization — not additive. The zod default is 0.4 but the runtime fallback is 0.6 when unset (base_service.py:1565-1571; zod src/app/api/v2/agents/validators.ts:60) — a known schema-vs-runtime mismatch.
stop_secs + max(stt_final_after_vad_stop, waitSeconds) ≈ 0.8 s with runtime defaults, ~1.1 s with production stop_secs=0.5.max(ttfs_p99 − stop_secs, waitSeconds). With on-prem p99 0.55 and waitSeconds 0.6, raising stop_secs 0.2 → 0.5 moves the STT allowance 0.35 → 0.05 but the gate is still 0.6 — the +0.3 s is a pure loss, nothing is "bought back" (pipecat speech_timeout_user_turn_stop_strategy.py:196-203).Symptom: "it jumps in before I finish" → raise; "awkward pause before every answer" → lower toward 0.3–0.4 (and verify mid-sentence cutoffs don't return).
Try it: the turn timeline and the cutoff failure mode
Drag stop_secs and waitSeconds, pick an STT-p99, and watch the dead-air segment grow. Then move the "caller resumes after" marker past stop_secs to trigger the cutoff/fragmentation failure — the caller paused longer than the silence floor, so VAD already declared the turn over and the resumed speech becomes a separate fragment.
Smart Turn's asymmetric bet VAD
What it is. An ONNX model that predicts end-of-turn from prosody instead of pure silence timing.
stt-config-panel.tsx:731-757
Runtime effect. Wraps a TurnAnalyzerUserTurnStopStrategy(LocalSmartTurnAnalyzerV3()) and a VAD-only fallback in a per-node NodeAwareUserTurnStopStrategy (base_service.py:1641-1651; src/core/strategies/node_aware_turn_stop_strategy.py:46). On VAD-stop it runs one CPU ONNX inference (measured as e2e_processing_time_ms, pipecat audio/turn/smart_turn/base_smart_turn.py:217-227; no recorded production value — not verified in source).
- Predicts COMPLETE → the turn closes as soon as a finalized transcript exists, skipping the
waitSecondsfloor entirely (pipecatturn_analyzer_user_turn_stop_strategy.py:259-276): saves ~0.4–0.6 s. - Predicts INCOMPLETE → nothing closes the turn until silence reaches
SmartTurnParams.stop_secs= 3 s (pipecatbase_smart_turn.py:27,125-131): costs ~2.4 s per misjudgment.
Symptom it fixes: callers who pause mid-thought getting cut off. Symptom it causes: occasional 3 s hangs after a turn the model misread.
The backstops: Turn Stop Timeout (the slider that lies) and finalizeTimeout VAD
Turn Stop Timeout (sttConfig.additionalSettings.userTurnStopTimeout) — slider 0.5–5.0 s in the dashboard, but effectively decorative in voice mode: runtime clamps it to max(30.0, finalize_timeout × 10) (_VOICE_BACKSTOP_FLOOR, base_service.py:148,1594-1616). Routine turn-close is fully covered by the stop strategy; a low backstop was the foot-gun that dropped slow-STT turns on call e894a846 (2026-05-20). You can only raise it above 30 s, never lower.
finalizeTimeout (sttConfig.additionalSettings.finalizeTimeout) — default 3.0 s (src/exts/stt/_finalize_timeout.py:30,33-45). When VAD said "stopped" but STT delivers nothing (cough, glitch), the timer fires, force-closes the turn, and re-prompts with a localized "could you repeat" (:80-128). Transcripts arriving after the timer are dropped to avoid duplicate turns (2026-05-10 incident).
The range-validation foot-gun: sliders guard the UI, not the API gotcha
Every VAD/endpointing knob in this part — vadStopSecs, vadStartSecs, vadConfidence, vadMinVolume, finalizeTimeout, userTurnStopTimeout — lives inside the sttConfig.additionalSettings record, which the dashboard validates as z.record(z.string(), z.any()) (freya-dashboard src/app/api/v2/agents/validators.ts:33,107,138). There is no per-key range rule for any of them — the only superRefine on that record caps keyword/keyterm count, nothing else (validators.ts:140-160).
additionalSettings JSON can carry an out-of-range value (e.g. vadStopSecs: 2.0) and the runtime applies it verbatim — silently adding seconds of dead air to every turn. Check the agent's raw additionalSettings, not just the slider positions. (waitSeconds is the exception — a separately-validated field at validators.ts:60, min(0).max(5), not part of the passthrough record.)Barge-in latency has different physics VAD
Interrupting the bot is a separate path from normal turn-taking, governed by the Stop Speaking Plan:
- Number of words = 0 (
ttsConfig.stopSpeakingPlan.numberOfWords → DynamicVADUserTurnStartStrategy): interruption fires on VAD-start, ~start_secs(0.2 s) after voice onset — no transcription needed (base_service.py:1544-1558;src/core/strategies/dynamic_vad_strategy.py:43). - numberOfWords ≥ 1 + streaming STT: the min-words strategy counts interim partials (
use_interim=True, pipecatturns/user_start/min_words_user_turn_start_strategy.py:31,69) — barge-in ≈ one partial-emission latency, sub-second. - numberOfWords ≥ 1 + batch STT (the on-prem default):
FreyaSTTServiceonly transcribes after VAD-stop (pipecatservices/stt_service.py:596-605), so a word-gated interruption can fire only after the caller's entire utterance +stop_secs+ the HTTP STT roundtrip — the bot talks over the caller that whole time. The single biggest hidden barge-in cost on batch deployments. min_wordsapplies only while the bot is speaking; when silent it drops to 1 (pipecatmin_words_user_turn_start_strategy.py:105) — raising it never slows normal turn-taking.- Bot-speech grace window (
sttConfig.additionalSettings.botSpeechGraceSecs, default 0 = off) suppresses barge-in for its window at each bot-utterance start via a mute holder that raises min_words toMUTED_MIN_WORDS = 99(boot_steps.py:2848-2864;src/core/processors/call_action_trigger.py:564-568;src/core/strategies/dynamic_min_words_strategy.py:65). Anti-echo; costs only interruption responsiveness, never normal-turn dead air. - DeepFilterNet (
deepFilterEnabled+ envNC_OPT_URL) puts a network hop inside every VAD decision — each chunk round-trips to nc-opt over WebSocket before Silero scores it (src/exts/audio/deepfilternet_vad.py:18,48-57). Server-side inference is ~3 ms p50 per 50 ms chunk on H100 (nc-opt README); the in-container RTT is unmeasured (not verified in source). Fail-open: VAD runs on raw audio if NC is down.
stopSpeakingPlan.voiceSeconds and stopSpeakingPlan.backOffSeconds still have no runtime consumer — don't tune them expecting latency changes.Try it: the barge-in race
When the caller interrupts, two triggers compete: pure-VAD start (fires ~0.2 s after voice onset) and the word-gated min-words strategy (needs numberOfWords words). With streaming STT the word count comes from interim partials during speech; with batch STT it can't arrive until after the full utterance + stop_secs + the HTTP roundtrip. Pick the config and see which trigger wins — and how long the bot talks over the caller (the double-talk readout).
Checkpoint: an on-prem deployment runs batch STT with numberOfWords=2. Callers complain the bot talks over them for seconds when they try to interrupt. What are the two structural fixes?
Either drop numberOfWords to 0 (pure-VAD barge-in fires ~0.2 s after voice onset, no transcript needed) — at the cost of noise sensitivity — or switch to streaming STT (additionalSettings.streaming), whose interim partials let the min-words strategy count words during speech.
With batch STT and word gating, the interruption can mathematically only fire after the full utterance + stop_secs + STT roundtrip (pipecat services/stt_service.py:596-605).
sttConfig.additionalSettings for this workspace — I want to check vadStopSecs, vadConfidence, vadMinVolume and finalizeTimeout against the dashboard slider ranges, because one agent is slow on every turn."What dominates here
stop_secs + waitSeconds — a configured 0.8–1.1 s tax on every turn, and the only stage whose cost is 100% policy, 0% compute. The next part (STT) shows how that waitSeconds floor usually hides STT finalization — until shared-GPU load drags it back into view.