Part 3

STT: finalization, streaming, and the filters in front of it

The vad_end → stt_final window. On a healthy box STT hides behind the wait floor; under shared-GPU load it degrades linearly and produces the classic ~3 s dead-air-then-"could you repeat" failure.

The window

What this stage measures STT

Healthy band: vad_end → stt_final = 150–400 ms median; suspect above 800 ms. The agent records exactly this segment as stt.latency_from_vad_stop_secs on each VAD span — look there before instrumenting anything.

stt.latency_from_vad_stop_secs healthy 150–400 ms suspect >800 ms

Runtime effect. The whole STT cost lands in this one span pipecat-agent src/exts/diagnostics/pipeline_monitor.py:388-390. Whether you run the batch path or the streaming path, the committed final is a full batch transcription — streaming only adds cheaper interims, never a cheaper final.

Symptom it causes/fixes: ~3 s of dead air after the caller stops, then "could you repeat that?" — the finalize-timeout signature, almost always GPU contention, not the LLM. Inline cost on a loaded box: +359 ms p50 vs ~84 ms idle.

Step 11

The batch path (on-prem default) STT

What it is. FreyaSTTService (src/exts/stt/freya.py:91), a segmented HTTP STT: the whole utterance is buffered, then POSTed after VAD-stop.

FreyaSTTService → /audio/transcriptions client timeout 5.0 s ttfs_p99 0.55 s

Runtime effect. pipecat's SegmentedSTTService keeps a rolling 1 s buffer while idle so VAD-detection delay doesn't clip onsets (services/stt_service.py:629), then on VADUserStoppedSpeakingFrame wraps the utterance as WAV and POSTs to /audio/transcriptions with verbose_json for per-word confidences (freya.py:215-271).

Latency shape: the whole STT cost lands in the vad_end → stt_final window and scales with utterance length and GPU load. Declared budget: ONPREM_FREYA_STT_TTFS_P99 = 0.55 s (freya.py:30-32).

Step 12

The streaming path (opt-in) — interims, not cheaper finals STT

What it is. FreyaSTTStreamingService (src/exts/stt/freya_v3.py:108), one long-lived WebSocket per call to the Qwen-ASR /v1/audio/stream endpoint. Enabled by sttConfig.additionalSettings.streaming.

No dashboard UI toggle exists for this keyChecked stt-config-panel.tsxstreaming is reachable only via raw additionalSettings JSON / API, not a panel switch.

Runtime effect. Audio streams only while VAD reports speech, with a 0.3 s pre-roll flushed at VAD-start so onsets aren't clipped (freya_v3.py:139,366-390). Every 500 ms the server's LocalAgreement tick re-transcribes the rolling buffer and emits agreed-prefix / partial pairs — the client maps both to InterimTranscriptionFrame, never committed (freya_v3.py:445-456; stt-qwen-asr-opt src/streaming.py:718-751).

On VAD-stop the client sends {"type":"finalize"} and the server runs a full-utterance batch_transcribe_vad pass — the same accurate batch pipeline — returning the single committed TranscriptionFrame with finalized=True (freya_v3.py:613-631,458-487; streaming.py:756-803).

The point to internalize"stt final" is a batch transcription in both Freya modes. Streaming buys you low-latency interims (for barge-in word counting and fallback transcripts), not a cheaper final. Don't promise "streaming = faster finals".

Try it: LocalAgreement — the committed prefix vs the unstable tail

Every 500 ms the streaming server re-transcribes the growing audio buffer. The longest common prefix across ticks is treated as stable (mapped to an interim, never committed); the divergent tail is provisional and keeps changing. On finalize the whole thing is replaced by one accurate batch pass. Watch a prefix stabilize tick by tick.

LocalAgreement 500 ms ticks
t = 0.0 s
committed prefix (stable, interim) unstable partial tail finalized (batch pass)
Press Play to step through the 500 ms re-transcription ticks.
Step 13

The three declared p99 constants are live wait-time inputs STT

What it is. 0.55 s (batch on-prem, freya.py:32) / 1.5 s (streaming, freya_v3.py:171) / 2.01 s (upstream OpenAI default) are not documentation — they parameterize the turn-stop timeout and are broadcast via STTMetadataFrame.

batch on-prem 0.55 s streaming 1.5 s OpenAI batch 2.01 s no constant → 1.0 s + warning

Runtime effect. The turn-stop timeout is max(ttfs_p99 − stop_secs, waitSeconds), broadcast via STTMetadataFrame (stt_service.py:450-454). A provider with no constant gets DEFAULT_TTFS_P99 = 1.0 plus a warning (services/stt_latency.py:31).

Symptom it explains: a turn that feels slower than your configured waitSeconds on a cloud STT provider — the provider's declared p99 is raising the timeout above the floor.

Try it: where stt_final lands on the turn clock

This stitches the VAD stop, the STT inference, and the wait floor onto one axis. The red marker at +3 s is the finalize-timeout: if no finalized frame arrives before it, the turn force-closes and the agent re-prompts ("could you repeat?"). Toggle "finalized frame?" off to see the timeout fire. Pick a ttfs_p99 preset to see the turn-stop wait move.

Turn timeline: user_stop → vad_end → stt_final → llm_first_tok
0.60 s
0.36 s
0.40 s
VAD silence wait turn-stop wait STT inference LLM first token +3 s finalize_timeout
Adjust the sliders.
Step 14

Audio filters: what actually runs (and the dead toggles) STT

What it is. The transport's audio_in_filter runs synchronously in the audio loop (transports/base_input.py:235-262) — anything slow here delays VAD, STT, and recording at once. Here is what is real on dev (base_service.py:1434-1482):

ToggleStatus
AIC Filter (aicEnabled)dead hard-disabled, logs "hard disabled due to crate instability" (base_service.py:1467-1468). NO runtime consumer.
WebRTC APM (webrtcApmEnabled)dead on dev PR #625 merged then reverted 2026-04-10 (commit dc90fd7f); the ImportError guard at boot_steps.py:159-163 means WebRTCAPMFilter is None always. NO runtime consumer.
AGCcloud only — appended as fallback when no other filter and NOT on-prem (base_service.py:1471-1472). Sub-ms numpy; its cost is accuracy, not latency: AGC amplifying leading noise caused phantom/dropped words in ~30% of short utterances. On-prem runs no transport audio filter at all.
DeepFilterNet (deepFilterEnabled)live but VAD-only since PR #747 (2026-05-05): denoise happens privately inside the VAD analyzer; STT always receives raw audio (src/exts/audio/deepfilternet_vad.py:1-6,47-56).
Krispa plan (notes/plans/krisp-nc-integration.md), not a feature. Do not present as shipping.
The PR-747 war story: filter placement matters Pre-747 (DFN feeding both VAD and STT) suppressed speech onsets — Silero fired ~950 ms late, ~260 ms of effective pre-roll reached STT, and the model produced a garbled domain phrase. Post-747 replays were near-correct (notes/investigations/garanti-stt-pr747/README.md). A 100 ms slice shift flips the transcript — onset context is precious.

Symptom it fixes: garbled or dropped first words. The fix was placement (denoise VAD-only, feed STT raw audio), not a slower or "better" filter.

Step 15

Keyterm biasing: nearly free, with one tail risk STT

What it is. Bias the decoder toward domain vocabulary. Keys parsed at base_service.py:622-645; server consumers at stt-qwen-asr-opt src/server.py:381-412,493-512,604-620:

Key Terms (keyterms) Boost Terms (boostTerms → boost_terms) Boost Strength (boostStrength → boost_strength) Audio-length Gate (keytermMinAudioSec, default 0.8 s) Anti-Parrot Retry (keytermAntiParrot, default true)

Runtime effect / cost profile. Logit-bias construction is tokenize-only (microseconds); the HTTP path no longer injects keyterms into the prompt. The real cost is the anti-parrot retry: a full second transcription pass on the same audio when the biased output parrots the term list (~70–100 ms extra server-side, worse under contention; server.py:611-620). The 0.8 s audio gate is free and skips bias on exactly the turns most likely to trigger that retry.

Symptom it causes: with boost ≥5, hallucinated insertions of keyterms; under contention, the anti-parrot retry's extra pass surfaces as an STT tail. Inline cost of a triggered retry: +70–100 ms.

Step 16

Self-hosted STT latency is a contention story, not a model story STT

What it is. The single most important on-prem STT fact: the engine is fast; the GPU is shared.

idle Hopper 5 s TR audio ~68–84 ms KKB same concurrency 359 ms per-GPU plateau ~13 STT RPS

Runtime effect. Single-flight inference for 5 s Turkish audio is ~68–84 ms on an idle Hopper GPU (isolated H200 84 ms p50). Production KKB wall-clock at the same concurrency: 359 ms — a 4× gap entirely explained by GPU sharing (STT shares KKB GPU 3 with TTS and NC), not engine config (2026-05-12 investigation, notes/reports/detailed/2026-05-12-stt-tts-latency-investigation/).

Symptom it causes: STT p50 climbing from a healthy <400 ms to 800 ms+ to multi-second under campaign load — the second-biggest stage on a contended box, invisible on an idle test box.

Try it: STT capacity calculator

Estimate the vad_end → stt_final p50 you should expect from your concurrency. A single Hopper GPU plateaus around 13 STT RPS; sharing the card with TTS+NC multiplies wall-clock by a contention factor (KKB's measured 4× gap is the canonical example). Compare the result against the 400 / 800 ms thresholds.

vad_end → stt_final p50 estimator
25
8
1
13 RPS
4.0×
Adjust the inputs.

The model: offered RPS = calls × upm / 60; per-GPU load = offered / GPUs; a base ~84 ms idle inference is multiplied by the contention factor and stretched as load approaches the plateau (queueing), exploding past it (the C=75 → 6.3 s curve). This is illustrative, anchored on the published KKB / H200 numbers — not a substitute for measuring your own stt.latency_from_vad_stop_secs.

Checkpoint: a call shows ~3 s of dead air after the caller stops, then the agent says "could you repeat that?". Which stage failed, and what do you check?

That is the finalizeTimeout signature (_finalize_timeout.py:30): VAD declared stop, STT never delivered a transcript within 3 s, the mixin force-closed the turn and re-prompted.

Check STT server health and GPU contention (the KKB concurrency curve), not the LLM — the LLM never saw the turn. The +3 s red marker in the turn-timeline demo above is exactly this boundary.

What dominates here

On a healthy box, nothing — STT hides behind the waitSeconds floor. Under shared-GPU load it degrades linearly and becomes the second-biggest stage; the dead-toggle list (AIC, WebRTC APM, Krisp) saves you from tuning phantoms. The committed final is always a batch pass, so the levers are capacity (add a GPU, stop sharing GPU 3) and filter placement (PR-747: feed STT raw audio), not engine knobs.

Ask Claude Code: "Show me stt.latency_from_vad_stop_secs for this call's turns and compare against the 400/800 ms band — was it contention or a finalize timeout?"