Part 3 — STT: finalization, streaming, and the filters in front of it

The window

What this stage measures STT

Healthy band: vad_end → stt_final = 150–400 ms median; suspect above 800 ms. The agent records exactly this segment as stt.latency_from_vad_stop_secs on each VAD span — look there before instrumenting anything.

stt.latency_from_vad_stop_secs healthy 150–400 ms suspect >800 ms

Runtime effect. The whole STT cost lands in this one span pipecat-agent src/exts/diagnostics/pipeline_monitor.py:388-390. Whether you run the batch path or the streaming path, the committed final is a full batch transcription — streaming only adds cheaper interims, never a cheaper final.

Symptom it causes/fixes: ~3 s of dead air after the caller stops, then "could you repeat that?" — the finalize-timeout signature, almost always GPU contention, not the LLM. Inline cost on a loaded box: +359 ms p50 vs ~84 ms idle.

Step 11

The batch path (on-prem default) STT

What it is. FreyaSTTService (src/exts/stt/freya.py:91), a segmented HTTP STT: the whole utterance is buffered, then POSTed after VAD-stop.

FreyaSTTService → /audio/transcriptions client timeout 5.0 s ttfs_p99 0.55 s

Runtime effect. pipecat's SegmentedSTTService keeps a rolling 1 s buffer while idle so VAD-detection delay doesn't clip onsets (services/stt_service.py:629), then on VADUserStoppedSpeakingFrame wraps the utterance as WAV and POSTs to /audio/transcriptions with verbose_json for per-word confidences (freya.py:215-271).

Retry policy: client timeout 5.0 s; one immediate retry on 429/502/503/504, but 500s deliberately not retried — on-prem 500s are slow and the finalize-timeout re-prompt recovers instead (freya.py:37-41,128,193-213).
When to change: rarely a knob — the batch path's cost is utterance length × GPU load, not config.

Latency shape: the whole STT cost lands in the vad_end → stt_final window and scales with utterance length and GPU load. Declared budget: ONPREM_FREYA_STT_TTFS_P99 = 0.55 s (freya.py:30-32).

Step 12

The streaming path (opt-in) — interims, not cheaper finals STT

What it is. FreyaSTTStreamingService (src/exts/stt/freya_v3.py:108), one long-lived WebSocket per call to the Qwen-ASR /v1/audio/stream endpoint. Enabled by sttConfig.additionalSettings.streaming.

No dashboard UI toggle exists for this keyChecked stt-config-panel.tsx — streaming is reachable only via raw additionalSettings JSON / API, not a panel switch.

Runtime effect. Audio streams only while VAD reports speech, with a 0.3 s pre-roll flushed at VAD-start so onsets aren't clipped (freya_v3.py:139,366-390). Every 500 ms the server's LocalAgreement tick re-transcribes the rolling buffer and emits agreed-prefix / partial pairs — the client maps both to InterimTranscriptionFrame, never committed (freya_v3.py:445-456; stt-qwen-asr-opt src/streaming.py:718-751).

On VAD-stop the client sends {"type":"finalize"} and the server runs a full-utterance batch_transcribe_vad pass — the same accurate batch pipeline — returning the single committed TranscriptionFrame with finalized=True (freya_v3.py:613-631,458-487; streaming.py:756-803).

The point to internalize"stt final" is a batch transcription in both Freya modes. Streaming buys you low-latency interims (for barge-in word counting and fallback transcripts), not a cheaper final. Don't promise "streaming = faster finals".

Declared ttfs_p99_latency = 1.5 s (freya_v3.py:171).
Reconnect budget: 5 trouble events / 30 s window, then graceful call end (base_service.py:656-659).

Try it: LocalAgreement — the committed prefix vs the unstable tail

Every 500 ms the streaming server re-transcribes the growing audio buffer. The longest common prefix across ticks is treated as stable (mapped to an interim, never committed); the divergent tail is provisional and keeps changing. On finalize the whole thing is replaced by one accurate batch pass. Watch a prefix stabilize tick by tick.

LocalAgreement 500 ms ticks

t = 0.0 s

committed prefix (stable, interim) unstable partial tail finalized (batch pass)

Press Play to step through the 500 ms re-transcription ticks.

Step 13

The three declared p99 constants are live wait-time inputs STT

What it is. 0.55 s (batch on-prem, freya.py:32) / 1.5 s (streaming, freya_v3.py:171) / 2.01 s (upstream OpenAI default) are not documentation — they parameterize the turn-stop timeout and are broadcast via STTMetadataFrame.

batch on-prem 0.55 s streaming 1.5 s OpenAI batch 2.01 s no constant → 1.0 s + warning

Runtime effect. The turn-stop timeout is max(ttfs_p99 − stop_secs, waitSeconds), broadcast via STTMetadataFrame (stt_service.py:450-454). A provider with no constant gets DEFAULT_TTFS_P99 = 1.0 plus a warning (services/stt_latency.py:31).

When to change: you don't tune these directly — but know that cloud providers with big p99s (Azure 1.8, OpenAI batch 2.01) can make the turn wait well past the waitSeconds floor when no finalized frame arrives.

Symptom it explains: a turn that feels slower than your configured waitSeconds on a cloud STT provider — the provider's declared p99 is raising the timeout above the floor.

Try it: where `stt_final` lands on the turn clock

This stitches the VAD stop, the STT inference, and the wait floor onto one axis. The red marker at +3 s is the finalize-timeout: if no finalized frame arrives before it, the turn force-closes and the agent re-prompts ("could you repeat?"). Toggle "finalized frame?" off to see the timeout fire. Pick a ttfs_p99 preset to see the turn-stop wait move.

Turn timeline: user_stop → vad_end → stt_final → llm_first_tok

vadStopSecs (silence VAD waits before declaring stop)

0.60 s

STT inference time (vad_end → stt_final)

0.36 s

waitSeconds (turn-stop floor)

0.40 s

ttfs_p99 preset (provider declared)

finalized frame arrives (uncheck to force the timeout)

VAD silence wait turn-stop wait STT inference LLM first token +3 s finalize_timeout

Adjust the sliders.

Step 14

Audio filters: what actually runs (and the dead toggles) STT

What it is. The transport's audio_in_filter runs synchronously in the audio loop (transports/base_input.py:235-262) — anything slow here delays VAD, STT, and recording at once. Here is what is real on dev (base_service.py:1434-1482):

Toggle	Status
AIC Filter (`aicEnabled`)	dead hard-disabled, logs "hard disabled due to crate instability" (`base_service.py:1467-1468`). NO runtime consumer.
WebRTC APM (`webrtcApmEnabled`)	dead on dev PR #625 merged then reverted 2026-04-10 (commit `dc90fd7f`); the ImportError guard at `boot_steps.py:159-163` means `WebRTCAPMFilter is None` always. NO runtime consumer.
AGC	cloud only — appended as fallback when no other filter and NOT on-prem (`base_service.py:1471-1472`). Sub-ms numpy; its cost is accuracy, not latency: AGC amplifying leading noise caused phantom/dropped words in ~30% of short utterances. On-prem runs no transport audio filter at all.
DeepFilterNet (`deepFilterEnabled`)	live but VAD-only since PR #747 (2026-05-05): denoise happens privately inside the VAD analyzer; STT always receives raw audio (`src/exts/audio/deepfilternet_vad.py:1-6,47-56`).
Krisp	a plan (`notes/plans/krisp-nc-integration.md`), not a feature. Do not present as shipping.

The PR-747 war story: filter placement matters Pre-747 (DFN feeding both VAD and STT) suppressed speech onsets — Silero fired ~950 ms late, ~260 ms of effective pre-roll reached STT, and the model produced a garbled domain phrase. Post-747 replays were near-correct (notes/investigations/garanti-stt-pr747/README.md). A 100 ms slice shift flips the transcript — onset context is precious.

Symptom it fixes: garbled or dropped first words. The fix was placement (denoise VAD-only, feed STT raw audio), not a slower or "better" filter.

Step 15

Keyterm biasing: nearly free, with one tail risk STT

What it is. Bias the decoder toward domain vocabulary. Keys parsed at base_service.py:622-645; server consumers at stt-qwen-asr-opt src/server.py:381-412,493-512,604-620:

Key Terms (keyterms) Boost Terms (boostTerms → boost_terms) Boost Strength (boostStrength → boost_strength) Audio-length Gate (keytermMinAudioSec, default 0.8 s) Anti-Parrot Retry (keytermAntiParrot, default true)

Runtime effect / cost profile. Logit-bias construction is tokenize-only (microseconds); the HTTP path no longer injects keyterms into the prompt. The real cost is the anti-parrot retry: a full second transcription pass on the same audio when the biased output parrots the term list (~70–100 ms extra server-side, worse under contention; server.py:611-620). The 0.8 s audio gate is free and skips bias on exactly the turns most likely to trigger that retry.

When to change: boost 1–3 is safe, ≥5 hallucinates (dashboard guidance, en.ts:3498).
Version-skew caveat: the checked-out STT server branch ignores the boost / gate / anti-parrot keys on the streaming config path (streaming.py:647-659) — check your deployed STT image.

Symptom it causes: with boost ≥5, hallucinated insertions of keyterms; under contention, the anti-parrot retry's extra pass surfaces as an STT tail. Inline cost of a triggered retry: +70–100 ms.

Step 16

Self-hosted STT latency is a contention story, not a model story STT

What it is. The single most important on-prem STT fact: the engine is fast; the GPU is shared.

idle Hopper 5 s TR audio ~68–84 ms KKB same concurrency 359 ms per-GPU plateau ~13 STT RPS

Runtime effect. Single-flight inference for 5 s Turkish audio is ~68–84 ms on an idle Hopper GPU (isolated H200 84 ms p50). Production KKB wall-clock at the same concurrency: 359 ms — a 4× gap entirely explained by GPU sharing (STT shares KKB GPU 3 with TTS and NC), not engine config (2026-05-12 investigation, notes/reports/detailed/2026-05-12-stt-tts-latency-investigation/).

Engine knobs moved nothing (max_num_seqs, batch size) — the audio encoder saturates compute (97% SM).
Capacity planning: ~13 STT RPS per Hopper GPU, scale horizontally; p50 degrades to 6.3 s at concurrency 75.
When to change: don't tune the engine — add a GPU or stop sharing GPU 3 with TTS+NC.

Symptom it causes: STT p50 climbing from a healthy <400 ms to 800 ms+ to multi-second under campaign load — the second-biggest stage on a contended box, invisible on an idle test box.

Try it: STT capacity calculator

Estimate the vad_end → stt_final p50 you should expect from your concurrency. A single Hopper GPU plateaus around 13 STT RPS; sharing the card with TTS+NC multiplies wall-clock by a contention factor (KKB's measured 4× gap is the canonical example). Compare the result against the 400 / 800 ms thresholds.

vad_end → stt_final p50 estimator

Concurrent calls

Utterances per minute per call

GPU count (STT replicas)

Per-GPU plateau (STT RPS, Hopper preset 13)

13 RPS

Shared-GPU contention multiplier (1 = dedicated, ~4 = KKB GPU 3 sharing)

4.0×

Adjust the inputs.

The model: offered RPS = calls × upm / 60; per-GPU load = offered / GPUs; a base ~84 ms idle inference is multiplied by the contention factor and stretched as load approaches the plateau (queueing), exploding past it (the C=75 → 6.3 s curve). This is illustrative, anchored on the published KKB / H200 numbers — not a substitute for measuring your own stt.latency_from_vad_stop_secs.

Checkpoint: a call shows ~3 s of dead air after the caller stops, then the agent says "could you repeat that?". Which stage failed, and what do you check?

That is the finalizeTimeout signature (_finalize_timeout.py:30): VAD declared stop, STT never delivered a transcript within 3 s, the mixin force-closed the turn and re-prompted.

Check STT server health and GPU contention (the KKB concurrency curve), not the LLM — the LLM never saw the turn. The +3 s red marker in the turn-timeline demo above is exactly this boundary.

What dominates here

On a healthy box, nothing — STT hides behind the waitSeconds floor. Under shared-GPU load it degrades linearly and becomes the second-biggest stage; the dead-toggle list (AIC, WebRTC APM, Krisp) saves you from tuning phantoms. The committed final is always a batch pass, so the levers are capacity (add a GPU, stop sharing GPU 3) and filter placement (PR-747: feed STT raw audio), not engine knobs.

Ask Claude Code: "Show me stt.latency_from_vad_stop_secs for this call's turns and compare against the 400/800 ms band — was it contention or a finalize timeout?"

What this stage measures STT

The batch path (on-prem default) STT

The streaming path (opt-in) — interims, not cheaper finals STT

Try it: LocalAgreement — the committed prefix vs the unstable tail

The three declared p99 constants are live wait-time inputs STT

Try it: where stt_final lands on the turn clock

Audio filters: what actually runs (and the dead toggles) STT

Keyterm biasing: nearly free, with one tail risk STT

Self-hosted STT latency is a contention story, not a model story STT

Try it: STT capacity calculator

What dominates here

Try it: where `stt_final` lands on the turn clock