Part 9 — TTS: from first token to first audio

The headline of Part 9What dominates here is the sentence wait (LLM token rate × first-sentence length), not synthesis. TTS itself is ~120–270 ms and only degrades under GPU sharing or past C≈20 on one card. So: a fat llm_first_tok → tts_first_aud gap with a healthy TTS TTFB means "blame the LLM stream or the sentence shape", and the fix lives in the prompt, not the TTS server.

Step 45

The sentence-aggregation rule TTS

What it is. TTS does not synthesize token-by-token. It buffers the LLM token stream in a SimpleTextAggregator running in SENTENCE mode (the default), and only fires an HTTP request once a full sentence is ready.

SimpleTextAggregator → SENTENCE mode pipecat services/tts_service.py:275-276 release rule punctuation + lookahead char

Runtime effect. A sentence is released only when (1) the buffer ends in sentence punctuation AND (2) at least one non-whitespace lookahead character has arrived after it. That lookahead disambiguates "$29." (a decimal mid-sentence, do not flush) from "$29. Next" (a real sentence boundary, flush). See pipecat utils/text/simple_text_aggregator.py:99-120.

A final sentence whose lookahead never arrives is flushed only at LLMFullResponseEndFrame (pipecat tts_service.py:676).
For a one-sentence reply, dispatch waits for the entire LLM stream to finish — there is no character after the closing period to trigger the release.

Consequences. First audio is proportional to LLM token rate × first-sentence length. Short first sentences ("Anladım.", "Tabii ki.") are a free latency lever — prompt-writing is a latency tool. Static messages bypass all of this: a TTSSpeakFrame (a workflow on_enter message, a static first_message) goes straight to synthesis (pipecat tts_service.py:707-729), which is why a static greeting starts as fast as the server allows while a model-generated one pays LLM TTFT plus a full first sentence.

Symptom it causes/fixes: long single-sentence replies show a fat llm_first_tok → tts_first_aud gap that lands in the tts.text_aggregation span, not in TTS TTFB. Fix in the prompt: instruct shorter, multi-sentence replies −hundreds ms.

MeasurementThe window LLM-first-token → first-sentence-ready is surfaced as the tts.text_aggregation span (src/exts/tts/freya_tts_v2.py:248-280; the dashboard pairs it with LLM spans for TTFS, src/core/tracing.py:803-816). A long tts.text_aggregation span means "blame the LLM stream or sentence shape", not TTS.

Try it — first-sentence dispatcher

Sentence aggregator & the release moment

Type a reply, set the LLM's token rate, and press Stream. Tokens arrive one at a time. Watch the aggregator hold the buffer until punctuation plus a lookahead character appears, then fire the first TTS request. The presets show the "$29." decimal trap and the single-sentence end-flush. Toggle static TTSSpeakFrame to bypass the wait entirely.

Reply text (the LLM's full response)

LLM output rate (tokens / second)

40 tok/s

static TTSSpeakFrame (bypass aggregation)

agg 0 ms

Press Stream to watch tokens arrive.

The first TTS request fires when the first sentence is releasable.

Step 46

The filter chain (and its double execution) TTS

What it is. Filters run per aggregated sentence, before the HTTP request, on the critical path: MarkdownTextFilter → SpeechTextFilter (emoji/email/URL → user regex substitutions → TR number normalization) → LanguageTextFilter.

base_service.py:884-902 src/exts/filters/speech_text_filter.py:325-364

Runtime effect & costs.

Built-in layers: sub-millisecond compiled regex. Negligible.
User substitution rules: the regex engine enforces a 0.1 s wall-clock timeout per rule per sentence (speech_text_filter.py:42,311). A pathological pattern costs exactly 100 ms each time, then is skipped.
Filters execute twice per sentence under FreyaTTSService. The strip-to-empty probe (freya_tts_v2.py:466-518, a hotfix for the 2026-05-22 permanent-mute bug on call 6eb6969f) runs the chain, then super() runs it again. Harmless for sane rules; doubles the 100 ms hit for bad ones — a single pathological substitution costs +200 ms per sentence.
The langdetect layer of LanguageTextFilter is deliberately dead code behind an early return ("DISABLE LAYER 2 UNTIL WE CAN BENCHMARK LANGDETECT", language_text_filter.py:210-238).

A second, server-side normalization pass also defaults onspark-tts attempts an HTTP sidecar call per request (fast connection-refused when absent, up to 30 s if deployed-but-hung — spark-tts app/api/routes.py:38,117-122); the VoxCPM2 server runs a GPU NER+ByT5 normalizer on a single-thread executor that serializes all concurrent requests (voxcpm2 server.py:51-61,120,215-216; per-request latency unmeasured — not verified in source).

Symptom it causes/fixes: a customer adds one greedy substitution rule and every sentence gets two surprise 100 ms stalls. The fix is the rule, not the engine — the double-execution is a deliberate mute-bug guard.

Step 47

The FreyaTTS request path TTS

What it is. FreyaTTSService (src/exts/tts/freya_tts_v2.py:158) is selected on-prem (ENVIRONMENT=on-premise) or under provider freya with a native voice name.

One HTTP POST per sentence to /audio/speech — no websocket. Sentence 2+ overlaps sentence 1's playback, so per-sentence RTT surfaces only as inter-sentence gaps under load (:233-236,313).
Requests raw PCM (response_format: "pcm", :297) — critical: the spark server buffers mp3/opus/aac/flac to completion before yielding (spark-tts app/api/streaming.py:328,365-369); a client requesting mp3 gets whole-utterance latency instead of TTFB.
Streams without chunk buffering, only sample-aligned, "to minimize TTFB" (:287,350-378); the TTFB metric is the first non-empty body chunk (:311,357).
Retry only before the first audio byte: 2 attempts, 250 ms backoff, 5 s connect timeout; exhaustion ends the call rather than leaving dead air (:170-171,306-330).
Fixed tail costs: 100 ms silence pad + ~40 ms transport buffer + ~10–30 ms resampler tail (:21-23,409-410) — these are turn-end, not first-audio.

Symptom it causes/fixes: request mp3/opus from this server and your "TTFB" silently becomes whole-utterance latency. PCM is non-negotiable for streaming.

Try it — TTFA budget waterfall

From llm_first_tok to first audible sample

Stack the contributors to time-to-first-audio. The sentence wait is the LLM-stream cost; the rest is the TTS path. Toggle a bad substitution rule (+100 ms, ×2 under FreyaTTSService), pick an HTTP RTT preset, and slide concurrency to read the server queue off the measured Spark TTFB curve. The 300 ms healthy and 600 ms suspect bands are drawn as overlays.

Sentence wait (LLM token rate × first-sentence length)

180 ms

Filter chain one bad substitution rule (+100 ms × 2) HTTP round-trip

Loopback 1 ms LAN 5 ms WAN 150 ms

Server concurrency (maps to measured Spark TTFB)

C=1

Sentence wait (LLM stream) Filters HTTP RTT Server queue First decode + transport

Total time-to-first-audio.

Step 48

Server-side knobs and the concurrency curve TTS

INITIAL_CHUNK_SIZE — the spark server decodes its first audio block only after this many semantic tokens accumulate from the 0.5B vLLM generator; default 100, every Freya on-prem deploy overrides to 50 (spark-tts app/config.py:112; freya-onprem hosts/kkb/docker-compose.gpu.yaml:135). This is the server-side first-audio knob; lowering further trades TTFB against more vocoder calls and fade seams.
Long-text chunking — texts >400 chars split into ≤250-char chunks processed sequentially (app/config.py:107-108); mostly affects long static on_enter monologues.
Predefined phrase tokens — common phrases served from pre-baked token sequences skip generation entirely (config.py:101-106).

Concurrency (Spark engine, measured)Dedicated H200 loopback TTFB p50 113 ms @C=1 → 244 ms @C=20 → 1,332 ms @C=75; throughput saturates ~30 turns/s (analysis/verda-h200-tts-latency, 2026-05-11). KKB over public HTTPS: 269 ms @C=1 — and the KKB gap is GPU contention (TTS+NC sharing GPU 3), not engine config.

Caveat: the benchmarks are Spark, production may not beKKB production TTS swapped to VoxCPM2 Leyla on 2026-06-04 (memory note; the in-tree compose still shows Spark) and no VoxCPM2 TTFB benchmark exists yet — all sweep numbers here are the Spark engine. Also ignore the VoxCPM2 docs' "TTFB ~3 ms": that is the HTTP response-header TTFB of a StreamingResponse, not first audio.

Try it — concurrency vs TTFB

Three real Spark sweeps, plotted

The same Spark engine on three deployments. Verda H200 loopback and fin03 dedicated-H200 stay flat to C=20; KKB over WAN starts higher (WAN RTT + GPU sharing) and climbs faster. The dashed line is the ~30 turns/s throughput ceiling where one card saturates; the marker at C=10 flags the measured 5.4 s p95 cold-burst spike (p50 was still 170 ms). Slide to read off each curve.

Concurrency

C=20

Verda H200 (loopback) fin03 (dedicated H200) KKB (public WAN, GPU shared)

At C=20: read each curve.

Step 49

Selection traps and dead knobs TTS

The ElevenLabs remap trap. Under provider freya, three magic voice UUIDs silently route to ElevenLabs cloud (base_service.py:996-1031) — the latency profile (and its WAN dependence) changes with a voice pick, not a provider pick. No clean internal Freya-vs-ElevenLabs A/B exists; the only internal ElevenLabs numbers are 2026-03 sim-environment pathology (2–10 s greetings) on an old stack.

Dead knobs — no runtime consumer found minTextLength / minWords are read and logged but passed to nothing — the constructor that accepted them is commented out (base_service.py:988-994,1051-1059). speed on the native VoxCPM2 server is "accepted for compatibility; not yet applied" (voxcpm2 server.py:119). NO runtime consumers — do not tune phantoms or promise a customer a latency change via them.

Sample-rate bookkeeping. Spark outputs 16 kHz, the nanovllm shim 24 kHz, native VoxCPM2 48 kHz. A mismatched TTS_SAMPLE_RATE does not add latency but silently corrupts the duration tracker and word timestamps (freya_tts_v2.py:166,394).

Symptom it causes/fixes: a "TTS got slow when we changed the voice" report is usually the ElevenLabs UUID remap, not the engine. Check the voice ID before profiling the server.

Checkpoint: traces show healthy tts TTFB (~150 ms) but a fat llm_first_tok → tts_first_aud gap on certain turns. The agent's replies on those turns are long single sentences. What's happening?

The sentence aggregator can't release text: a single-sentence reply has no lookahead character after its final punctuation, so dispatch waits for the entire LLM stream to finish (the LLMFullResponseEndFrame flush, pipecat tts_service.py:676).

The wait shows up in tts.text_aggregation, not in TTS TTFB — which is exactly why TTFB still looks healthy. Fix in the prompt: instruct shorter, multi-sentence replies. The first short sentence dispatches as soon as the second sentence's first character arrives.

Ask Claude Code: "Show me where tts.text_aggregation is emitted in pipecat-agent and how the dashboard pairs it with the LLM span to compute TTFS — file:line."