Part 11

The long tail: API calls, fillers, guardrails, voicemail, DTMF, transfers, call start

External latency landing inside a turn, one-shot boot costs, and the masking machinery that hides latency rather than removing it.

Framing

Buy masking time with audio

Most of what follows is not a per-turn tax. It is external latency landing inside a turn, a one-shot cost at call start, or masking machinery that hides latency rather than removing it. The platform's recurring design pattern repeats at every layer:

Buy masking time with audioEntry message before pre-actions. Filler before tools. Warmup behind entry playback. Farewell before transfer. None of these make the slow thing faster — they spend speech to cover the gap so the caller never hears silence.

What dominates here: external API round-trip time — the one latency source Freya does not control — and whether you spent audio to mask it. Everything else in the long tail is bounded, rare, or off-path by design.

Step 55

api_call: where external latency lands in the turn net

What it is. An api_call can run two different ways, and the two shapes put the vendor's round-trip time in completely different places on the critical path.

blocking → awaited pre-action default blocking: true masked by entry message
Hands-on lever for testingEvery mock-api worker ships GET /test/slow?delay=ms (default 5000) and GET /test/fail?status=&after= (mock-api lib/router.js:140-167). Point a node's api_call at /test/slow?delay=4000 to reproduce a 4 s vendor and watch the entry message cover it.

Symptom it causes/fixes: dead air mid-flow on a specific node — a blocking pre-action with no entry message to hide behind. The cost is the vendor RTT itself, often +2–4 s of in-turn silence.

Try it — Where does the API call hide?

Turn timeline: API RTT vs the audio that masks it

The caller only hears silence when the API is still running and nothing is being spoken. Spend entry-message or filler duration to cover the vendor's round trip.

4000 ms
3000 ms
What lands on the clock
API RTT Masking audio Caller-heard silence
Move the sliders.
Step 56

The Timeout / Max-retries gotcha: real vs dead consumer gotcha

What it is. The same two fields — Timeout (config.timeoutMstimeout_ms) and Max retries (config.maxRetriesmax_retries) — behave completely differently depending on whether the api_call is a node action or an LLM-invoked tool.

LLM-invoked tools — dead knobThe same ToolExecutionConfig fields exist (src/core/types.py:286-293) but tool registration consumes only always_runs_at and pre_speech (src/tools/handler_utils.py:1378-1394). timeout_ms / max_retries on an LLM-callable api_call tool are parsed and ignored — no runtime consumer. Only api_call's hardcoded 30 s total budget across all redirect hops applies (src/tools/api_call.py:68). SSRF DNS pre-validation runs serially per hop in prod, bounded at 10 s (api_call.py:181-202). Do not promise a customer a per-tool timeout on an LLM-invoked api_call.

Symptom it causes: a hung vendor riding the full 33 s (node action) or 30 s budget (LLM tool) as dead air. The fix on node actions is a tight timeoutMs; on LLM tools there is no fix but the 30 s ceiling.

Try it — Retry / backoff cost calculator

Worst-case in-turn dead air for a node-action api_call

Models _invoke_with_retry exactly: each attempt waits up to timeoutMs (timeout mode) or returns fast then sleeps min(2**attempt, 8) s of backoff before the next. The 30 s api_call budget is drawn as the ceiling.

10000 ms
2
Stacked attempt + backoff cost
Attempt (timeout / RTT) Backoff sleep 30 s budget ceiling
Move the sliders.
Step 57

Guardrails: nearly free until they fire LLM

What it is. Output guardrails run on the streaming text path. Almost everything they do is cheap; only two outcomes cost real latency.

When disabled, it is goneWhen guardrails_enabled is false the processor is removed from the pipeline entirely (types.py:437-465) — zero cost, not a no-op pass-through.

Symptom it causes: an occasional +500 ms hitch (prefix holdback) on a turn whose wording grazed a blocked prefix, or a multi-second stall on the rare regenerate path (extra LLM round trip × retries).

Try it — Guardrail stream simulator

Prefix-holdback buffer and the regenerate round trip

Type a sentence. If a blocked word appears, watch the holdback buffer fill once a risky prefix forms: it either flushes at prefix_holdback_max_ms (safe) or triggers refuse / regenerate, which costs a second LLM round trip.

500 ms
0 ms
Press "Stream tokens" to animate.
Latency cost
LLM #1 stream Prefix holdback / wait msg Regenerate LLM #2
Stream a sentence to see the cost.
Step 58

Pre-tool speech and the idle handler: masking, never reduction TTS

config.preSpeech → PreToolSpeech Auto: avg >1000 ms → speak saves up to ~1.4 s perceived

Symptom it fixes: dead air while an LLM-invoked tool runs. Pre-tool speech does not make the tool faster — it dispatches a filler ~1.4 s earlier than waiting for the full response would, masking the gap.

Step 59

Voicemail, DTMF, transfers net

Symptom it causes/fixes: a ~2 s tail at the end of a DTMF entry (lower the timeout, or train callers to press #), and a several-second farewell-then-handoff window on transfer (by design — the caller hears the goodbye, not silence).

Step 60

Call start: prefetch waves and who pays the boot net

What it is. The boot DAG (bot.py:197-305) splits into waves. Serializable waves 0–2 (resolve_configlifecycle_tools + register_callgenerate_first_message + warmup_llm_cache) run at ringtime via the telephony /prefetch (src/routes/telephony.py:355-449); only waves 3–6 run at connect.

Telephony rings; web does notOn telephony, start-of-call API tools (CRM lookups), MODEL_GENERATED first-message generation, and the boot prefix warmup are all masked by the ring. Web calls have no prefetch — the whole DAG is user-visible at connect (plus the 1.1 s early-media delay from Part 10).

Workflow agents skip the boot warmup and warm per node instead (Part 5); a first node without entry speech eats the cold prefill on turn 1.

Symptom it causes: slow time-to-first-hello on web calls only — the boot work that telephony hides behind the ring is fully exposed. Fix with a static first message and trimmed boot work (Parts 10, 11).

Checkpoint: a workflow node calls a vendor API that averages 4 s. Callers hear dead air. List the masking levers in the order you would apply them.
  1. Give the node a 3–4 s entry message — blocking pre-actions execute behind it by design (processor.py:2653-2659).
  2. If the call is LLM-invoked instead, configure Pre-tool speech (static, or Auto which will trigger since avg >1000 ms) — dispatched ~1.4 s earlier thanks to the early-fire frame.
  3. Consider blocking: false if the flow can proceed and re-route when the result arrives.
  4. Set a node-action timeoutMs so a hung vendor cannot ride the 30 s budget — remembering that on LLM-invoked tools that field is dead and only the 30 s budget applies.
Ask Claude Code: "Show me every place blocking on a workflow action is read in pipecat-agent, and confirm whether timeout_ms / max_retries have a runtime consumer for LLM-invoked api_call tools."