Part 2 — VAD & endpointing: the dead-air tax

Step 6

Stop Delay — the irreducible floor VAD

What it is. How much continuous silence VAD must accumulate before it declares end-of-speech. Nothing downstream can start earlier — this is the hard floor of every turn.

sttConfig.additionalSettings.vadStopSecs → VADParams.stop_secs slider 0.05–1.0 s, step 0.05 default 0.2

Runtime effect. The Silero state machine consumes 32 ms chunks (512 samples @16 kHz, pipecat audio/vad/silero.py:191-197) and flips SPEAKING → QUIET only after stop_secs of consecutive non-speech chunks (pipecat vad_analyzer.py:237-242). Defaults live in pipecat-agent src/services/base_service.py:1671, dashboard src/components/config-panels/stt-config-panel.tsx:801-812, pipecat audio/vad/vad_analyzer.py:27.

When to change: it is a pure 1:1 dead-air dial — 0.2 → 0.5 adds exactly +300 ms to every turn (latency-analyzer references/interpreting.md:43-45).
Why TR production runs 0.5 anyway: the alternative is worse. The Allianz 2026-05-08 batch showed 28% of turns fragmented by early endpointing on thinking pauses, each fragment adding ~600–1000 ms (notes/investigations/2026-05-08-allianz-1600-quality/_sections_5_10.md:175-179). Garanti, Anadolu, and Allianz tr-app2 all deliberately run 0.5 s (Garanti notes/investigations/garanti-stt-pr747/README.md:31).

Symptom: mid-sentence cutoffs → raise; sluggish every-turn response → lower. Both can't win — this is the central trade-off of the domain. Healthy endpointing target: 200–350 ms, suspect >500 ms.

Step 6b

The other three VAD knobs: Confidence, Minimum Volume, Start Delay VAD

What they are. Stop Delay is one of four VAD knobs Freya builds into VADParams (base_service.py:1667-1672). The other three don't add a fixed timer the way Stop Delay does, but they decide whether a 32 ms chunk even counts as "speaking" — speaking = confidence >= params.confidence and volume >= params.min_volume (pipecat audio/vad/vad_analyzer.py:206) — so they drive the same failure modes Steps 6–7 discuss. A field engineer's first VAD question is "what about sensitivity / noise threshold?" — these are the answer. All three live in sttConfig.additionalSettings and are constrained only by the dashboard sliders (see Step 9b).

vadConfidence → VADParams.confidence slider 0.05–1.0, step 0.05 default 0.7

Minimum Silero speech-probability for a chunk to count as voice (dashboard stt-config-panel.tsx:767-777; base_service.py:1668; pipecat vad_analyzer.py:25).

vadMinVolume → VADParams.min_volume slider 0.0–1.0, step 0.05 default 0.4

Note the deliberate mismatch: Freya's fallback is 0.4, but the pipecat library default is 0.6 (vad_analyzer.py:28) — Freya runs a lower (more sensitive) floor on purpose (stt-config-panel.tsx:778-788; base_service.py:1669).

vadStartSecs → VADParams.start_secs slider 0.05–1.0, step 0.05 default 0.2

Start Delay does not affect end-of-turn dead air — it gates how much speech is needed before VAD declares the turn STARTED, so it governs turn-start detection and the pure-VAD barge-in path (Step 10), and it shifts the streaming-STT preroll window (src/exts/stt/freya_v3.py:139, preroll_secs=0.3). A too-high Start Delay is the source of the "agent ignored my 'alo'" / didn't-hear-my-opening-word complaint (stt-config-panel.tsx:789-800; base_service.py:1670; vad_analyzer.py:26).

Latency framing (dossier 02, section 1)Confidence and Minimum Volume add no fixed latency, but too high = late or blind VAD (missed turn start → the caller repeats themselves; trailing low-energy speech read as silence → early endpoint → fragmentation, the same +600–1000 ms/fragment cost as Step 6) and too low = false VAD-starts that hold the turn open and cause false barge-ins. Start Delay likewise adds no end-of-turn dead air; it only delays turn-START detection.

Try it: VAD state machine, chunk by chunk

Silero scores one 32 ms chunk at a time. A chunk counts as voice only when confidence >= Confidence and volume >= Min Volume (vad_analyzer.py:206). Step through a real-shaped utterance with a breath in the middle and watch the silence counter reset — the reason a thinking-pause does not end the turn so long as Stop Delay is generous.

VAD state-machine stepper — 32 ms chunks

Confidence threshold

0.70

Min Volume threshold

0.40

Stop Delay (s)

0.20

chunk counts as voice chunk counts as silence cursor

t = 0 ms

QUIET — step forward to feed chunks into the state machine.

Watch for: when the cursor reaches the breath (a low-energy chunk that drops below your thresholds), the consecutive-silence counter restarts the moment a voice chunk follows, so the mid-thought pause does not end the turn. Raise Confidence/Min Volume high enough and trailing soft speech reads as silence — an early endpoint, the fragmentation failure mode.

Step 7

Wait seconds — the resume window and the dead-air equation VAD

What it is. After VAD-stop, how long the bot waits for the caller to resume before claiming the turn.

ttsConfig.startSpeakingPlan.waitSeconds → user_speech_timeout range 0–5 s, step 0.1 zod default 0.4, runtime fallback 0.6

Runtime effect. Becomes the SpeechTimeoutUserTurnStopStrategy timeout (base_service.py:1658). It is a floor counted from VAD-stop, concurrent with STT finalization — not additive. The zod default is 0.4 but the runtime fallback is 0.6 when unset (base_service.py:1565-1571; zod src/app/api/v2/agents/validators.ts:60) — a known schema-vs-runtime mismatch.

The dead-air equationper-turn endpointing dead air = stop_secs + max(stt_final_after_vad_stop, waitSeconds) ≈ 0.8 s with runtime defaults, ~1.1 s with production stop_secs=0.5.

The coupling trapThe strategy waits max(ttfs_p99 − stop_secs, waitSeconds). With on-prem p99 0.55 and waitSeconds 0.6, raising stop_secs 0.2 → 0.5 moves the STT allowance 0.35 → 0.05 but the gate is still 0.6 — the +0.3 s is a pure loss, nothing is "bought back" (pipecat speech_timeout_user_turn_stop_strategy.py:196-203).

Symptom: "it jumps in before I finish" → raise; "awkward pause before every answer" → lower toward 0.3–0.4 (and verify mid-sentence cutoffs don't return).

Try it: the turn timeline and the cutoff failure mode

Drag stop_secs and waitSeconds, pick an STT-p99, and watch the dead-air segment grow. Then move the "caller resumes after" marker past stop_secs to trigger the cutoff/fragmentation failure — the caller paused longer than the silence floor, so VAD already declared the turn over and the resumed speech becomes a separate fragment.

Turn timeline — dead air and the resume cutoff

Stop Delay (stop_secs)

0.20

Wait seconds (waitSeconds)

0.60

STT finalization p99

Caller resumes after pause of (s)

0.00

stop_secs (silence floor) STT finalization waitSeconds gate turn handed to LLM

Adjust the controls to compute the dead-air tax.

Step 8

Smart Turn's asymmetric bet VAD

What it is. An ONNX model that predicts end-of-turn from prosody instead of pure silence timing.

sttConfig.additionalSettings.smartTurnEnabled default false disableSmartTurnOnDigitNodes default true dashboard stt-config-panel.tsx:731-757

Runtime effect. Wraps a TurnAnalyzerUserTurnStopStrategy(LocalSmartTurnAnalyzerV3()) and a VAD-only fallback in a per-node NodeAwareUserTurnStopStrategy (base_service.py:1641-1651; src/core/strategies/node_aware_turn_stop_strategy.py:46). On VAD-stop it runs one CPU ONNX inference (measured as e2e_processing_time_ms, pipecat audio/turn/smart_turn/base_smart_turn.py:217-227; no recorded production value — not verified in source).

Predicts COMPLETE → the turn closes as soon as a finalized transcript exists, skipping the waitSeconds floor entirely (pipecat turn_analyzer_user_turn_stop_strategy.py:259-276): saves ~0.4–0.6 s.
Predicts INCOMPLETE → nothing closes the turn until silence reaches SmartTurnParams.stop_secs = 3 s (pipecat base_smart_turn.py:27,125-131): costs ~2.4 s per misjudgment.

The asymmetry is why it ships offA good prediction saves ~0.5 s; a bad one costs ~2.4 s. That lopsided bet is why Smart Turn is default-false and auto-disabled on digit nodes.

Symptom it fixes: callers who pause mid-thought getting cut off. Symptom it causes: occasional 3 s hangs after a turn the model misread.

Step 9

The backstops: Turn Stop Timeout (the slider that lies) and finalizeTimeout VAD

Turn Stop Timeout (sttConfig.additionalSettings.userTurnStopTimeout) — slider 0.5–5.0 s in the dashboard, but effectively decorative in voice mode: runtime clamps it to max(30.0, finalize_timeout × 10) (_VOICE_BACKSTOP_FLOOR, base_service.py:148,1594-1616). Routine turn-close is fully covered by the stop strategy; a low backstop was the foot-gun that dropped slow-STT turns on call e894a846 (2026-05-20). You can only raise it above 30 s, never lower.

finalizeTimeout (no dashboard slider — runtime-only key) default 3.0 s

finalizeTimeout (sttConfig.additionalSettings.finalizeTimeout) — default 3.0 s (src/exts/stt/_finalize_timeout.py:30,33-45). When VAD said "stopped" but STT delivers nothing (cough, glitch), the timer fires, force-closes the turn, and re-prompts with a localized "could you repeat" (:80-128). Transcripts arriving after the timer are dropped to avoid duplicate turns (2026-05-10 incident).

Latency signature~3 s of dead air then "could you repeat" = STT failure path, not LLM slowness. Check STT health and GPU contention, not the model — the LLM never saw the turn.

Step 9b

The range-validation foot-gun: sliders guard the UI, not the API gotcha

Every VAD/endpointing knob in this part — vadStopSecs, vadStartSecs, vadConfidence, vadMinVolume, finalizeTimeout, userTurnStopTimeout — lives inside the sttConfig.additionalSettings record, which the dashboard validates as z.record(z.string(), z.any()) (freya-dashboard src/app/api/v2/agents/validators.ts:33,107,138). There is no per-key range rule for any of them — the only superRefine on that record caps keyword/keyterm count, nothing else (validators.ts:140-160).

The "why is THIS one agent slow" surpriseThe dashboard sliders (e.g. Stop Delay max 1.0 s) are the only guardrail. A config written via the API or raw additionalSettings JSON can carry an out-of-range value (e.g. vadStopSecs: 2.0) and the runtime applies it verbatim — silently adding seconds of dead air to every turn. Check the agent's raw additionalSettings, not just the slider positions. (waitSeconds is the exception — a separately-validated field at validators.ts:60, min(0).max(5), not part of the passthrough record.)

Step 10

Barge-in latency has different physics VAD

Interrupting the bot is a separate path from normal turn-taking, governed by the Stop Speaking Plan:

Number of words = 0 (ttsConfig.stopSpeakingPlan.numberOfWords → DynamicVADUserTurnStartStrategy): interruption fires on VAD-start, ~start_secs (0.2 s) after voice onset — no transcription needed (base_service.py:1544-1558; src/core/strategies/dynamic_vad_strategy.py:43).
numberOfWords ≥ 1 + streaming STT: the min-words strategy counts interim partials (use_interim=True, pipecat turns/user_start/min_words_user_turn_start_strategy.py:31,69) — barge-in ≈ one partial-emission latency, sub-second.
numberOfWords ≥ 1 + batch STT (the on-prem default): FreyaSTTService only transcribes after VAD-stop (pipecat services/stt_service.py:596-605), so a word-gated interruption can fire only after the caller's entire utterance + stop_secs + the HTTP STT roundtrip — the bot talks over the caller that whole time. The single biggest hidden barge-in cost on batch deployments.
min_words applies only while the bot is speaking; when silent it drops to 1 (pipecat min_words_user_turn_start_strategy.py:105) — raising it never slows normal turn-taking.
Bot-speech grace window (sttConfig.additionalSettings.botSpeechGraceSecs, default 0 = off) suppresses barge-in for its window at each bot-utterance start via a mute holder that raises min_words to MUTED_MIN_WORDS = 99 (boot_steps.py:2848-2864; src/core/processors/call_action_trigger.py:564-568; src/core/strategies/dynamic_min_words_strategy.py:65). Anti-echo; costs only interruption responsiveness, never normal-turn dead air.
DeepFilterNet (deepFilterEnabled + env NC_OPT_URL) puts a network hop inside every VAD decision — each chunk round-trips to nc-opt over WebSocket before Silero scores it (src/exts/audio/deepfilternet_vad.py:18,48-57). Server-side inference is ~3 ms p50 per 50 ms chunk on H100 (nc-opt README); the in-container RTT is unmeasured (not verified in source). Fail-open: VAD runs on raw audio if NC is down.

Dead knobs, re-verified at 5b29206cstopSpeakingPlan.voiceSeconds and stopSpeakingPlan.backOffSeconds still have no runtime consumer — don't tune them expecting latency changes.

Try it: the barge-in race

When the caller interrupts, two triggers compete: pure-VAD start (fires ~0.2 s after voice onset) and the word-gated min-words strategy (needs numberOfWords words). With streaming STT the word count comes from interim partials during speech; with batch STT it can't arrive until after the full utterance + stop_secs + the HTTP roundtrip. Pick the config and see which trigger wins — and how long the bot talks over the caller (the double-talk readout).

Barge-in race simulator — which trigger wins

numberOfWords

STT mode

botSpeechGraceSecs

0.0

VAD-start trigger word-gate / STT grace mute window bot still speaking (double-talk)

double-talk: —

Pick a config and run the race.

Checkpoint: an on-prem deployment runs batch STT with numberOfWords=2. Callers complain the bot talks over them for seconds when they try to interrupt. What are the two structural fixes?

Either drop numberOfWords to 0 (pure-VAD barge-in fires ~0.2 s after voice onset, no transcript needed) — at the cost of noise sensitivity — or switch to streaming STT (additionalSettings.streaming), whose interim partials let the min-words strategy count words during speech.

With batch STT and word gating, the interruption can mathematically only fire after the full utterance + stop_secs + STT roundtrip (pipecat services/stt_service.py:596-605).

Ask Claude Code: "Show me the agent's raw sttConfig.additionalSettings for this workspace — I want to check vadStopSecs, vadConfidence, vadMinVolume and finalizeTimeout against the dashboard slider ranges, because one agent is slow on every turn."

What dominates here

stop_secs + waitSeconds — a configured 0.8–1.1 s tax on every turn, and the only stage whose cost is 100% policy, 0% compute. The next part (STT) shows how that waitSeconds floor usually hides STT finalization — until shared-GPU load drags it back into view.