What this stage measures STT
Healthy band: vad_end → stt_final = 150–400 ms median; suspect above 800 ms. The agent records exactly this segment as stt.latency_from_vad_stop_secs on each VAD span — look there before instrumenting anything.
Runtime effect. The whole STT cost lands in this one span pipecat-agent src/exts/diagnostics/pipeline_monitor.py:388-390. Whether you run the batch path or the streaming path, the committed final is a full batch transcription — streaming only adds cheaper interims, never a cheaper final.
Symptom it causes/fixes: ~3 s of dead air after the caller stops, then "could you repeat that?" — the finalize-timeout signature, almost always GPU contention, not the LLM. Inline cost on a loaded box: +359 ms p50 vs ~84 ms idle.
The batch path (on-prem default) STT
What it is. FreyaSTTService (src/exts/stt/freya.py:91), a segmented HTTP STT: the whole utterance is buffered, then POSTed after VAD-stop.
Runtime effect. pipecat's SegmentedSTTService keeps a rolling 1 s buffer while idle so VAD-detection delay doesn't clip onsets (services/stt_service.py:629), then on VADUserStoppedSpeakingFrame wraps the utterance as WAV and POSTs to /audio/transcriptions with verbose_json for per-word confidences (freya.py:215-271).
- Retry policy: client timeout 5.0 s; one immediate retry on 429/502/503/504, but 500s deliberately not retried — on-prem 500s are slow and the finalize-timeout re-prompt recovers instead (
freya.py:37-41,128,193-213). - When to change: rarely a knob — the batch path's cost is utterance length × GPU load, not config.
Latency shape: the whole STT cost lands in the vad_end → stt_final window and scales with utterance length and GPU load. Declared budget: ONPREM_FREYA_STT_TTFS_P99 = 0.55 s (freya.py:30-32).
The streaming path (opt-in) — interims, not cheaper finals STT
What it is. FreyaSTTStreamingService (src/exts/stt/freya_v3.py:108), one long-lived WebSocket per call to the Qwen-ASR /v1/audio/stream endpoint. Enabled by sttConfig.additionalSettings.streaming.
stt-config-panel.tsx — streaming is reachable only via raw additionalSettings JSON / API, not a panel switch.Runtime effect. Audio streams only while VAD reports speech, with a 0.3 s pre-roll flushed at VAD-start so onsets aren't clipped (freya_v3.py:139,366-390). Every 500 ms the server's LocalAgreement tick re-transcribes the rolling buffer and emits agreed-prefix / partial pairs — the client maps both to InterimTranscriptionFrame, never committed (freya_v3.py:445-456; stt-qwen-asr-opt src/streaming.py:718-751).
On VAD-stop the client sends {"type":"finalize"} and the server runs a full-utterance batch_transcribe_vad pass — the same accurate batch pipeline — returning the single committed TranscriptionFrame with finalized=True (freya_v3.py:613-631,458-487; streaming.py:756-803).
- Declared
ttfs_p99_latency = 1.5 s(freya_v3.py:171). - Reconnect budget: 5 trouble events / 30 s window, then graceful call end (
base_service.py:656-659).
Try it: LocalAgreement — the committed prefix vs the unstable tail
Every 500 ms the streaming server re-transcribes the growing audio buffer. The longest common prefix across ticks is treated as stable (mapped to an interim, never committed); the divergent tail is provisional and keeps changing. On finalize the whole thing is replaced by one accurate batch pass. Watch a prefix stabilize tick by tick.
The three declared p99 constants are live wait-time inputs STT
What it is. 0.55 s (batch on-prem, freya.py:32) / 1.5 s (streaming, freya_v3.py:171) / 2.01 s (upstream OpenAI default) are not documentation — they parameterize the turn-stop timeout and are broadcast via STTMetadataFrame.
Runtime effect. The turn-stop timeout is max(ttfs_p99 − stop_secs, waitSeconds), broadcast via STTMetadataFrame (stt_service.py:450-454). A provider with no constant gets DEFAULT_TTFS_P99 = 1.0 plus a warning (services/stt_latency.py:31).
- When to change: you don't tune these directly — but know that cloud providers with big p99s (Azure 1.8, OpenAI batch 2.01) can make the turn wait well past the
waitSecondsfloor when no finalized frame arrives.
Symptom it explains: a turn that feels slower than your configured waitSeconds on a cloud STT provider — the provider's declared p99 is raising the timeout above the floor.
Try it: where stt_final lands on the turn clock
This stitches the VAD stop, the STT inference, and the wait floor onto one axis. The red marker at +3 s is the finalize-timeout: if no finalized frame arrives before it, the turn force-closes and the agent re-prompts ("could you repeat?"). Toggle "finalized frame?" off to see the timeout fire. Pick a ttfs_p99 preset to see the turn-stop wait move.
Audio filters: what actually runs (and the dead toggles) STT
What it is. The transport's audio_in_filter runs synchronously in the audio loop (transports/base_input.py:235-262) — anything slow here delays VAD, STT, and recording at once. Here is what is real on dev (base_service.py:1434-1482):
| Toggle | Status |
|---|---|
AIC Filter (aicEnabled) | dead hard-disabled, logs "hard disabled due to crate instability" (base_service.py:1467-1468). NO runtime consumer. |
WebRTC APM (webrtcApmEnabled) | dead on dev PR #625 merged then reverted 2026-04-10 (commit dc90fd7f); the ImportError guard at boot_steps.py:159-163 means WebRTCAPMFilter is None always. NO runtime consumer. |
| AGC | cloud only — appended as fallback when no other filter and NOT on-prem (base_service.py:1471-1472). Sub-ms numpy; its cost is accuracy, not latency: AGC amplifying leading noise caused phantom/dropped words in ~30% of short utterances. On-prem runs no transport audio filter at all. |
DeepFilterNet (deepFilterEnabled) | live but VAD-only since PR #747 (2026-05-05): denoise happens privately inside the VAD analyzer; STT always receives raw audio (src/exts/audio/deepfilternet_vad.py:1-6,47-56). |
| Krisp | a plan (notes/plans/krisp-nc-integration.md), not a feature. Do not present as shipping. |
notes/investigations/garanti-stt-pr747/README.md). A 100 ms slice shift flips the transcript — onset context is precious.Symptom it fixes: garbled or dropped first words. The fix was placement (denoise VAD-only, feed STT raw audio), not a slower or "better" filter.
Keyterm biasing: nearly free, with one tail risk STT
What it is. Bias the decoder toward domain vocabulary. Keys parsed at base_service.py:622-645; server consumers at stt-qwen-asr-opt src/server.py:381-412,493-512,604-620:
Runtime effect / cost profile. Logit-bias construction is tokenize-only (microseconds); the HTTP path no longer injects keyterms into the prompt. The real cost is the anti-parrot retry: a full second transcription pass on the same audio when the biased output parrots the term list (~70–100 ms extra server-side, worse under contention; server.py:611-620). The 0.8 s audio gate is free and skips bias on exactly the turns most likely to trigger that retry.
- When to change: boost 1–3 is safe, ≥5 hallucinates (dashboard guidance,
en.ts:3498). - Version-skew caveat: the checked-out STT server branch ignores the boost / gate / anti-parrot keys on the streaming config path (
streaming.py:647-659) — check your deployed STT image.
Symptom it causes: with boost ≥5, hallucinated insertions of keyterms; under contention, the anti-parrot retry's extra pass surfaces as an STT tail. Inline cost of a triggered retry: +70–100 ms.
Self-hosted STT latency is a contention story, not a model story STT
What it is. The single most important on-prem STT fact: the engine is fast; the GPU is shared.
Runtime effect. Single-flight inference for 5 s Turkish audio is ~68–84 ms on an idle Hopper GPU (isolated H200 84 ms p50). Production KKB wall-clock at the same concurrency: 359 ms — a 4× gap entirely explained by GPU sharing (STT shares KKB GPU 3 with TTS and NC), not engine config (2026-05-12 investigation, notes/reports/detailed/2026-05-12-stt-tts-latency-investigation/).
- Engine knobs moved nothing (
max_num_seqs, batch size) — the audio encoder saturates compute (97% SM). - Capacity planning: ~13 STT RPS per Hopper GPU, scale horizontally; p50 degrades to 6.3 s at concurrency 75.
- When to change: don't tune the engine — add a GPU or stop sharing GPU 3 with TTS+NC.
Symptom it causes: STT p50 climbing from a healthy <400 ms to 800 ms+ to multi-second under campaign load — the second-biggest stage on a contended box, invisible on an idle test box.
Try it: STT capacity calculator
Estimate the vad_end → stt_final p50 you should expect from your concurrency. A single Hopper GPU plateaus around 13 STT RPS; sharing the card with TTS+NC multiplies wall-clock by a contention factor (KKB's measured 4× gap is the canonical example). Compare the result against the 400 / 800 ms thresholds.
The model: offered RPS = calls × upm / 60; per-GPU load = offered / GPUs; a base ~84 ms idle inference is multiplied by the contention factor and stretched as load approaches the plateau (queueing), exploding past it (the C=75 → 6.3 s curve). This is illustrative, anchored on the published KKB / H200 numbers — not a substitute for measuring your own stt.latency_from_vad_stop_secs.
Checkpoint: a call shows ~3 s of dead air after the caller stops, then the agent says "could you repeat that?". Which stage failed, and what do you check?
That is the finalizeTimeout signature (_finalize_timeout.py:30): VAD declared stop, STT never delivered a transcript within 3 s, the mixin force-closed the turn and re-prompted.
Check STT server health and GPU contention (the KKB concurrency curve), not the LLM — the LLM never saw the turn. The +3 s red marker in the turn-timeline demo above is exactly this boundary.
What dominates here
On a healthy box, nothing — STT hides behind the waitSeconds floor. Under shared-GPU load it degrades linearly and becomes the second-biggest stage; the dead-toggle list (AIC, WebRTC APM, Krisp) saves you from tuning phantoms. The committed final is always a batch pass, so the levers are capacity (add a GPU, stop sharing GPU 3) and filter placement (PR-747: feed STT raw audio), not engine knobs.
stt.latency_from_vad_stop_secs for this call's turns and compare against the 400/800 ms band — was it contention or a finalize timeout?"