llm_first_tok → tts_first_aud gap with a healthy TTS TTFB means "blame the LLM stream or the sentence shape", and the fix lives in the prompt, not the TTS server.The sentence-aggregation rule TTS
What it is. TTS does not synthesize token-by-token. It buffers the LLM token stream in a SimpleTextAggregator running in SENTENCE mode (the default), and only fires an HTTP request once a full sentence is ready.
Runtime effect. A sentence is released only when (1) the buffer ends in sentence punctuation AND (2) at least one non-whitespace lookahead character has arrived after it. That lookahead disambiguates "$29." (a decimal mid-sentence, do not flush) from "$29. Next" (a real sentence boundary, flush). See pipecat utils/text/simple_text_aggregator.py:99-120.
- A final sentence whose lookahead never arrives is flushed only at
LLMFullResponseEndFrame(pipecattts_service.py:676). - For a one-sentence reply, dispatch waits for the entire LLM stream to finish — there is no character after the closing period to trigger the release.
Consequences. First audio is proportional to LLM token rate × first-sentence length. Short first sentences ("Anladım.", "Tabii ki.") are a free latency lever — prompt-writing is a latency tool. Static messages bypass all of this: a TTSSpeakFrame (a workflow on_enter message, a static first_message) goes straight to synthesis (pipecat tts_service.py:707-729), which is why a static greeting starts as fast as the server allows while a model-generated one pays LLM TTFT plus a full first sentence.
Symptom it causes/fixes: long single-sentence replies show a fat llm_first_tok → tts_first_aud gap that lands in the tts.text_aggregation span, not in TTS TTFB. Fix in the prompt: instruct shorter, multi-sentence replies −hundreds ms.
tts.text_aggregation span (src/exts/tts/freya_tts_v2.py:248-280; the dashboard pairs it with LLM spans for TTFS, src/core/tracing.py:803-816). A long tts.text_aggregation span means "blame the LLM stream or sentence shape", not TTS.Try it — first-sentence dispatcher
Type a reply, set the LLM's token rate, and press Stream. Tokens arrive one at a time.
Watch the aggregator hold the buffer until punctuation plus a lookahead character appears,
then fire the first TTS request. The presets show the "$29." decimal trap and the
single-sentence end-flush. Toggle static TTSSpeakFrame to bypass the wait entirely.
The filter chain (and its double execution) TTS
What it is. Filters run per aggregated sentence, before the HTTP request, on the critical path: MarkdownTextFilter → SpeechTextFilter (emoji/email/URL → user regex substitutions → TR number normalization) → LanguageTextFilter.
Runtime effect & costs.
- Built-in layers: sub-millisecond compiled regex. Negligible.
- User substitution rules: the
regexengine enforces a 0.1 s wall-clock timeout per rule per sentence (speech_text_filter.py:42,311). A pathological pattern costs exactly 100 ms each time, then is skipped. - Filters execute twice per sentence under FreyaTTSService. The strip-to-empty probe (
freya_tts_v2.py:466-518, a hotfix for the 2026-05-22 permanent-mute bug on call6eb6969f) runs the chain, thensuper()runs it again. Harmless for sane rules; doubles the 100 ms hit for bad ones — a single pathological substitution costs +200 ms per sentence. - The langdetect layer of
LanguageTextFilteris deliberately dead code behind an early return ("DISABLE LAYER 2 UNTIL WE CAN BENCHMARK LANGDETECT",language_text_filter.py:210-238).
app/api/routes.py:38,117-122); the VoxCPM2 server runs a GPU NER+ByT5 normalizer on a single-thread executor that serializes all concurrent requests (voxcpm2 server.py:51-61,120,215-216; per-request latency unmeasured — not verified in source).Symptom it causes/fixes: a customer adds one greedy substitution rule and every sentence gets two surprise 100 ms stalls. The fix is the rule, not the engine — the double-execution is a deliberate mute-bug guard.
The FreyaTTS request path TTS
What it is. FreyaTTSService (src/exts/tts/freya_tts_v2.py:158) is selected on-prem (ENVIRONMENT=on-premise) or under provider freya with a native voice name.
- One HTTP POST per sentence to
/audio/speech— no websocket. Sentence 2+ overlaps sentence 1's playback, so per-sentence RTT surfaces only as inter-sentence gaps under load (:233-236,313). - Requests raw PCM (
response_format: "pcm",:297) — critical: the spark server buffers mp3/opus/aac/flac to completion before yielding (spark-ttsapp/api/streaming.py:328,365-369); a client requesting mp3 gets whole-utterance latency instead of TTFB. - Streams without chunk buffering, only sample-aligned, "to minimize TTFB" (
:287,350-378); the TTFB metric is the first non-empty body chunk (:311,357). - Retry only before the first audio byte: 2 attempts, 250 ms backoff, 5 s connect timeout; exhaustion ends the call rather than leaving dead air (
:170-171,306-330). - Fixed tail costs: 100 ms silence pad + ~40 ms transport buffer + ~10–30 ms resampler tail (
:21-23,409-410) — these are turn-end, not first-audio.
Symptom it causes/fixes: request mp3/opus from this server and your "TTFB" silently becomes whole-utterance latency. PCM is non-negotiable for streaming.
Try it — TTFA budget waterfall
Stack the contributors to time-to-first-audio. The sentence wait is the LLM-stream cost; the rest is the TTS path. Toggle a bad substitution rule (+100 ms, ×2 under FreyaTTSService), pick an HTTP RTT preset, and slide concurrency to read the server queue off the measured Spark TTFB curve. The 300 ms healthy and 600 ms suspect bands are drawn as overlays.
Server-side knobs and the concurrency curve TTS
INITIAL_CHUNK_SIZE— the spark server decodes its first audio block only after this many semantic tokens accumulate from the 0.5B vLLM generator; default 100, every Freya on-prem deploy overrides to 50 (spark-ttsapp/config.py:112; freya-onpremhosts/kkb/docker-compose.gpu.yaml:135). This is the server-side first-audio knob; lowering further trades TTFB against more vocoder calls and fade seams.- Long-text chunking — texts >400 chars split into ≤250-char chunks processed sequentially (
app/config.py:107-108); mostly affects long staticon_entermonologues. - Predefined phrase tokens — common phrases served from pre-baked token sequences skip generation entirely (
config.py:101-106).
StreamingResponse, not first audio.Try it — concurrency vs TTFB
The same Spark engine on three deployments. Verda H200 loopback and fin03 dedicated-H200 stay flat to C=20; KKB over WAN starts higher (WAN RTT + GPU sharing) and climbs faster. The dashed line is the ~30 turns/s throughput ceiling where one card saturates; the marker at C=10 flags the measured 5.4 s p95 cold-burst spike (p50 was still 170 ms). Slide to read off each curve.
Selection traps and dead knobs TTS
The ElevenLabs remap trap. Under provider freya, three magic voice UUIDs silently route to ElevenLabs cloud (base_service.py:996-1031) — the latency profile (and its WAN dependence) changes with a voice pick, not a provider pick. No clean internal Freya-vs-ElevenLabs A/B exists; the only internal ElevenLabs numbers are 2026-03 sim-environment pathology (2–10 s greetings) on an old stack.
minTextLength / minWords are read and logged but passed to nothing — the constructor that accepted them is commented out (base_service.py:988-994,1051-1059). speed on the native VoxCPM2 server is "accepted for compatibility; not yet applied" (voxcpm2 server.py:119). NO runtime consumers — do not tune phantoms or promise a customer a latency change via them.Sample-rate bookkeeping. Spark outputs 16 kHz, the nanovllm shim 24 kHz, native VoxCPM2 48 kHz. A mismatched TTS_SAMPLE_RATE does not add latency but silently corrupts the duration tracker and word timestamps (freya_tts_v2.py:166,394).
Symptom it causes/fixes: a "TTS got slow when we changed the voice" report is usually the ElevenLabs UUID remap, not the engine. Check the voice ID before profiling the server.
Checkpoint: traces show healthy tts TTFB (~150 ms) but a fat llm_first_tok → tts_first_aud gap on certain turns. The agent's replies on those turns are long single sentences. What's happening?
The sentence aggregator can't release text: a single-sentence reply has no lookahead character after its final punctuation, so dispatch waits for the entire LLM stream to finish (the LLMFullResponseEndFrame flush, pipecat tts_service.py:676).
The wait shows up in tts.text_aggregation, not in TTS TTFB — which is exactly why TTFB still looks healthy. Fix in the prompt: instruct shorter, multi-sentence replies. The first short sentence dispatches as soon as the second sentence's first character arrives.
tts.text_aggregation is emitted in pipecat-agent and how the dashboard pairs it with the LLM span to compute TTFS — file:line."