Part 9

TTS: from first token to first audio

The window from llm_first_tok to tts_first_aud is dominated by the sentence-aggregation wait, not by synthesis. Healthy band: 100–300 ms median, suspect above 600 ms. On healthy self-hosted infra TTS time-to-first-byte is the smallest big-stage cost (~120–135 ms, ~18% of perceived turn latency) — check the LLM's token rate before blaming TTS.

The headline of Part 9What dominates here is the sentence wait (LLM token rate × first-sentence length), not synthesis. TTS itself is ~120–270 ms and only degrades under GPU sharing or past C≈20 on one card. So: a fat llm_first_tok → tts_first_aud gap with a healthy TTS TTFB means "blame the LLM stream or the sentence shape", and the fix lives in the prompt, not the TTS server.
Step 45

The sentence-aggregation rule TTS

What it is. TTS does not synthesize token-by-token. It buffers the LLM token stream in a SimpleTextAggregator running in SENTENCE mode (the default), and only fires an HTTP request once a full sentence is ready.

SimpleTextAggregator → SENTENCE mode pipecat services/tts_service.py:275-276 release rule punctuation + lookahead char

Runtime effect. A sentence is released only when (1) the buffer ends in sentence punctuation AND (2) at least one non-whitespace lookahead character has arrived after it. That lookahead disambiguates "$29." (a decimal mid-sentence, do not flush) from "$29. Next" (a real sentence boundary, flush). See pipecat utils/text/simple_text_aggregator.py:99-120.

Consequences. First audio is proportional to LLM token rate × first-sentence length. Short first sentences ("Anladım.", "Tabii ki.") are a free latency lever — prompt-writing is a latency tool. Static messages bypass all of this: a TTSSpeakFrame (a workflow on_enter message, a static first_message) goes straight to synthesis (pipecat tts_service.py:707-729), which is why a static greeting starts as fast as the server allows while a model-generated one pays LLM TTFT plus a full first sentence.

Symptom it causes/fixes: long single-sentence replies show a fat llm_first_tok → tts_first_aud gap that lands in the tts.text_aggregation span, not in TTS TTFB. Fix in the prompt: instruct shorter, multi-sentence replies −hundreds ms.

MeasurementThe window LLM-first-token → first-sentence-ready is surfaced as the tts.text_aggregation span (src/exts/tts/freya_tts_v2.py:248-280; the dashboard pairs it with LLM spans for TTFS, src/core/tracing.py:803-816). A long tts.text_aggregation span means "blame the LLM stream or sentence shape", not TTS.

Try it — first-sentence dispatcher

Sentence aggregator & the release moment

Type a reply, set the LLM's token rate, and press Stream. Tokens arrive one at a time. Watch the aggregator hold the buffer until punctuation plus a lookahead character appears, then fire the first TTS request. The presets show the "$29." decimal trap and the single-sentence end-flush. Toggle static TTSSpeakFrame to bypass the wait entirely.

40 tok/s
agg 0 ms
Press Stream to watch tokens arrive.
The first TTS request fires when the first sentence is releasable.
Step 46

The filter chain (and its double execution) TTS

What it is. Filters run per aggregated sentence, before the HTTP request, on the critical path: MarkdownTextFilterSpeechTextFilter (emoji/email/URL → user regex substitutions → TR number normalization) → LanguageTextFilter.

base_service.py:884-902 src/exts/filters/speech_text_filter.py:325-364

Runtime effect & costs.

A second, server-side normalization pass also defaults onspark-tts attempts an HTTP sidecar call per request (fast connection-refused when absent, up to 30 s if deployed-but-hung — spark-tts app/api/routes.py:38,117-122); the VoxCPM2 server runs a GPU NER+ByT5 normalizer on a single-thread executor that serializes all concurrent requests (voxcpm2 server.py:51-61,120,215-216; per-request latency unmeasured — not verified in source).

Symptom it causes/fixes: a customer adds one greedy substitution rule and every sentence gets two surprise 100 ms stalls. The fix is the rule, not the engine — the double-execution is a deliberate mute-bug guard.

Step 47

The FreyaTTS request path TTS

What it is. FreyaTTSService (src/exts/tts/freya_tts_v2.py:158) is selected on-prem (ENVIRONMENT=on-premise) or under provider freya with a native voice name.

Symptom it causes/fixes: request mp3/opus from this server and your "TTFB" silently becomes whole-utterance latency. PCM is non-negotiable for streaming.

Try it — TTFA budget waterfall

From llm_first_tok to first audible sample

Stack the contributors to time-to-first-audio. The sentence wait is the LLM-stream cost; the rest is the TTS path. Toggle a bad substitution rule (+100 ms, ×2 under FreyaTTSService), pick an HTTP RTT preset, and slide concurrency to read the server queue off the measured Spark TTFB curve. The 300 ms healthy and 600 ms suspect bands are drawn as overlays.

180 ms
C=1
Sentence wait (LLM stream) Filters HTTP RTT Server queue First decode + transport
Total time-to-first-audio.
Step 48

Server-side knobs and the concurrency curve TTS

Concurrency (Spark engine, measured)Dedicated H200 loopback TTFB p50 113 ms @C=1 → 244 ms @C=20 → 1,332 ms @C=75; throughput saturates ~30 turns/s (analysis/verda-h200-tts-latency, 2026-05-11). KKB over public HTTPS: 269 ms @C=1 — and the KKB gap is GPU contention (TTS+NC sharing GPU 3), not engine config.
Caveat: the benchmarks are Spark, production may not beKKB production TTS swapped to VoxCPM2 Leyla on 2026-06-04 (memory note; the in-tree compose still shows Spark) and no VoxCPM2 TTFB benchmark exists yet — all sweep numbers here are the Spark engine. Also ignore the VoxCPM2 docs' "TTFB ~3 ms": that is the HTTP response-header TTFB of a StreamingResponse, not first audio.

Try it — concurrency vs TTFB

Three real Spark sweeps, plotted

The same Spark engine on three deployments. Verda H200 loopback and fin03 dedicated-H200 stay flat to C=20; KKB over WAN starts higher (WAN RTT + GPU sharing) and climbs faster. The dashed line is the ~30 turns/s throughput ceiling where one card saturates; the marker at C=10 flags the measured 5.4 s p95 cold-burst spike (p50 was still 170 ms). Slide to read off each curve.

C=20
Verda H200 (loopback) fin03 (dedicated H200) KKB (public WAN, GPU shared)
At C=20: read each curve.
Step 49

Selection traps and dead knobs TTS

The ElevenLabs remap trap. Under provider freya, three magic voice UUIDs silently route to ElevenLabs cloud (base_service.py:996-1031) — the latency profile (and its WAN dependence) changes with a voice pick, not a provider pick. No clean internal Freya-vs-ElevenLabs A/B exists; the only internal ElevenLabs numbers are 2026-03 sim-environment pathology (2–10 s greetings) on an old stack.

Dead knobs — no runtime consumer found minTextLength / minWords are read and logged but passed to nothing — the constructor that accepted them is commented out (base_service.py:988-994,1051-1059). speed on the native VoxCPM2 server is "accepted for compatibility; not yet applied" (voxcpm2 server.py:119). NO runtime consumers — do not tune phantoms or promise a customer a latency change via them.

Sample-rate bookkeeping. Spark outputs 16 kHz, the nanovllm shim 24 kHz, native VoxCPM2 48 kHz. A mismatched TTS_SAMPLE_RATE does not add latency but silently corrupts the duration tracker and word timestamps (freya_tts_v2.py:166,394).

Symptom it causes/fixes: a "TTS got slow when we changed the voice" report is usually the ElevenLabs UUID remap, not the engine. Check the voice ID before profiling the server.

Checkpoint: traces show healthy tts TTFB (~150 ms) but a fat llm_first_tok → tts_first_aud gap on certain turns. The agent's replies on those turns are long single sentences. What's happening?

The sentence aggregator can't release text: a single-sentence reply has no lookahead character after its final punctuation, so dispatch waits for the entire LLM stream to finish (the LLMFullResponseEndFrame flush, pipecat tts_service.py:676).

The wait shows up in tts.text_aggregation, not in TTS TTFB — which is exactly why TTFB still looks healthy. Fix in the prompt: instruct shorter, multi-sentence replies. The first short sentence dispatches as soon as the second sentence's first character arrives.

Ask Claude Code: "Show me where tts.text_aggregation is emitted in pipecat-agent and how the dashboard pairs it with the LLM span to compute TTFS — file:line."