Part 11 — The long tail: API calls, fillers, guardrails, voicemail, DTMF, transfers, call start

Framing

Buy masking time with audio

Most of what follows is not a per-turn tax. It is external latency landing inside a turn, a one-shot cost at call start, or masking machinery that hides latency rather than removing it. The platform's recurring design pattern repeats at every layer:

Buy masking time with audioEntry message before pre-actions. Filler before tools. Warmup behind entry playback. Farewell before transfer. None of these make the slow thing faster — they spend speech to cover the gap so the caller never hears silence.

What dominates here: external API round-trip time — the one latency source Freya does not control — and whether you spent audio to mask it. Everything else in the long tail is bounded, rare, or off-path by design.

Step 55

api_call: where external latency lands in the turn net

What it is. An api_call can run two different ways, and the two shapes put the vendor's round-trip time in completely different places on the critical path.

Workflow pre-action. blocking: true (the default) actions are awaited at pipecat-agent processor.py:3065; routing does not proceed until the API returns, so the vendor's RTT is inside the felt turn — but it is masked by the node's entry message, which is deliberately spoken first (processor.py:2653-2659). blocking: false actions run in the background and re-trigger routing when the result arrives (processor.py:3145-3190). Results are injected as [Action Result] system messages (processor.py:3192-3211) — which also grow the speaking prompt. The idle timer is suppressed during blocking actions (processor.py:3045-3047).
LLM-invoked tool. The turn becomes LLM call #1 (decides the tool) → api_call (external RTT) → LLM call #2 (speaks the result). The API sits between two LLM calls on the critical path; only pre-tool speech masks it (src/tools/api_call.py:266-499).

blocking → awaited pre-action default blocking: true masked by entry message

Hands-on lever for testingEvery mock-api worker ships GET /test/slow?delay=ms (default 5000) and GET /test/fail?status=&after= (mock-api lib/router.js:140-167). Point a node's api_call at /test/slow?delay=4000 to reproduce a 4 s vendor and watch the entry message cover it.

Symptom it causes/fixes: dead air mid-flow on a specific node — a blocking pre-action with no entry message to hide behind. The cost is the vendor RTT itself, often +2–4 s of in-turn silence.

Try it — Where does the API call hide?

Turn timeline: API RTT vs the audio that masks it

The caller only hears silence when the API is still running and nothing is being spoken. Spend entry-message or filler duration to cover the vendor's round trip.

API delay (the vendor's RTT, like /test/slow?delay=)

4000 ms

Entry message / filler duration (the audio you spend)

3000 ms

Blocking pre-action Mask with entry / filler audio

What lands on the clock

API RTT Masking audio Caller-heard silence

Move the sliders.

Step 56

The Timeout / Max-retries gotcha: real vs dead consumer gotcha

What it is. The same two fields — Timeout (config.timeoutMs → timeout_ms) and Max retries (config.maxRetries → max_retries) — behave completely differently depending on whether the api_call is a node action or an LLM-invoked tool.

Node actions (real). Timeout and Max retries are honored: _invoke_with_retry (processor.py:3072-3143) applies the timeout and retries on exceptions and HTTP {408, 429, 500, 502, 503, 504} with backoff min(2**attempt, 8) s. Worst case with maxRetries=2 and a 10 s timeout: 10 + 1 + 10 + 2 + 10 = 33 s of in-turn dead air minus whatever the entry message masks.

LLM-invoked tools — dead knobThe same ToolExecutionConfig fields exist (src/core/types.py:286-293) but tool registration consumes only always_runs_at and pre_speech (src/tools/handler_utils.py:1378-1394). timeout_ms / max_retries on an LLM-callable api_call tool are parsed and ignored — no runtime consumer. Only api_call's hardcoded 30 s total budget across all redirect hops applies (src/tools/api_call.py:68). SSRF DNS pre-validation runs serially per hop in prod, bounded at 10 s (api_call.py:181-202). Do not promise a customer a per-tool timeout on an LLM-invoked api_call.

Symptom it causes: a hung vendor riding the full 33 s (node action) or 30 s budget (LLM tool) as dead air. The fix on node actions is a tight timeoutMs; on LLM tools there is no fix but the 30 s ceiling.

Try it — Retry / backoff cost calculator

Worst-case in-turn dead air for a node-action api_call

Models _invoke_with_retry exactly: each attempt waits up to timeoutMs (timeout mode) or returns fast then sleeps min(2**attempt, 8) s of backoff before the next. The 30 s api_call budget is drawn as the ceiling.

Timeout (config.timeoutMs)

10000 ms

Max retries (config.maxRetries)

Failure mode

Stacked attempt + backoff cost

Attempt (timeout / RTT) Backoff sleep 30 s budget ceiling

Move the sliders.

Step 57

Guardrails: nearly free until they fire LLM

What it is. Output guardrails run on the streaming text path. Almost everything they do is cheap; only two outcomes cost real latency.

Per-chunk deterministic checks — regex 1–5 ms, structured 1–2 ms, dictionary <1 ms (in-source estimates, src/guardrails/engine.py:27-34) run inline on every LLMTextFrame. Cheap.
Prefix holdback — the one streaming feature that delays audio: chunks ending in a risky prefix are buffered up to prefix_holdback_max_ms (default 500 ms, src/guardrails/models.py:101-103).
The output classifier does NOT gate streaming audio — it runs at response end, after safe chunks already went to TTS; ≤400 ms timeout, fail-open (src/guardrails/processor.py:192-204; models.py:94-97).
The expensive outcome is regenerate — wait-message TTS plus a full extra LLM round trip, up to max_regenerate_retries (default 2) before falling back to refuse (processor.py:308-361; src/core/types.py:458).
Input guardrails, when enabled (default off), add normalization + injection regex (<1 ms) and optionally one ≤400 ms classifier call serially before the LLM (src/guardrails/input_processor.py:140-179). Refused input turns skip the LLM entirely — they are faster than normal turns.

When disabled, it is goneWhen guardrails_enabled is false the processor is removed from the pipeline entirely (types.py:437-465) — zero cost, not a no-op pass-through.

Symptom it causes: an occasional +500 ms hitch (prefix holdback) on a turn whose wording grazed a blocked prefix, or a multi-second stall on the rare regenerate path (extra LLM round trip × retries).

Try it — Guardrail stream simulator

Prefix-holdback buffer and the regenerate round trip

Type a sentence. If a blocked word appears, watch the holdback buffer fill once a risky prefix forms: it either flushes at prefix_holdback_max_ms (safe) or triggers refuse / regenerate, which costs a second LLM round trip.

Blocked words (comma-separated) Bot sentence to stream prefix_holdback_max_ms

500 ms

Outcome when a blocked word commits

0 ms

Press "Stream tokens" to animate.

Latency cost

LLM #1 stream Prefix holdback / wait msg Regenerate LLM #2

Stream a sentence to see the cost.

Step 58

Pre-tool speech and the idle handler: masking, never reduction TTS

Pre-tool speech (Pre-tool speech, config.preSpeech → PreToolSpeech; modes None / Static / Auto / Audio) is dispatched the moment the tool name appears mid-stream (EarlyFunctionCallStartedFrame, src/core/processors/call_action_trigger.py:368-398) — the in-source docstring quantifies "saves up to ~1.4 s of perceived voice latency" vs waiting for the full response. Auto mode keeps a 10-sample history per tool: avg <500 ms → silent, >1000 ms or first call → speak (src/core/pre_tool_speech.py:16-18,115-135). Fillers are pushed directly downstream (not through the pipeline head — they would serialize behind the awaited tool, call_action_trigger.py:488-494) and never committed to LLM context (no token cost). Pure masking: the tool runs on its own schedule regardless.
Idle handler — fires after 45 s of user silence (env default USER_AWAY_TIMEOUT_S, src/core/settings.py:277; per-node/agent overrides). A custom check_in_message costs only TTS; the default LLM check-in is one extra LLM+TTS round. It never delays a live turn, and it is explicitly held while blocking workflow actions run so the agent does not check in on itself.

config.preSpeech → PreToolSpeech Auto: avg >1000 ms → speak saves up to ~1.4 s perceived

Symptom it fixes: dead air while an LLM-invoked tool runs. Pre-tool speech does not make the tool faster — it dispatches a filler ~1.4 s earlier than waiting for the full response would, masking the gap.

Step 59

Voicemail, DTMF, transfers net

Voicemail detection is a race, not a tax. The classifier and the response LLM run in parallel; the gate buffers LLM-origin frames only, so first-turn cost = max(classifier, response_LLM), bounded by a 5.0 s fail-open timeout (src/core/processors/voicemail_detector.py:1-22,109,277-353). The 2.0 s voicemail_response_delay is intentional beep-clearing, post-verdict.
DTMF flush delay. Every DTMF turn ends at lastKey + timeout (default 2.0 s) unless the caller presses the termination digit (#). Freya's FreyaDTMFAggregator subclass exists purely for latency: it emits TranscriptionFrame(finalized=True) so the turn closes immediately on flush instead of stalling until the user speaks (src/exts/dtmf_aggregator.py:1-38). Lower the timeout for fixed-length codes; DTMF remains the latency- AND accuracy-optimal channel for digits.
Transfers speak first, act second. transfer_call is a deferred terminal action: handoff delay = farewell duration (one extra LLM call if message_mode: "llm") + an optional interruptible grace window (default ≥3.0 s, user speech restarts it) + the REFER POST to the dashboard/PBX (src/tools/transfer_call.py:150-171; call_action_trigger.py:688-806; src/core/freya_client.py:584-624). No measured REFER round-trip numbers exist (not verified in source).

Symptom it causes/fixes: a ~2 s tail at the end of a DTMF entry (lower the timeout, or train callers to press #), and a several-second farewell-then-handoff window on transfer (by design — the caller hears the goodbye, not silence).

Step 60

Call start: prefetch waves and who pays the boot net

What it is. The boot DAG (bot.py:197-305) splits into waves. Serializable waves 0–2 (resolve_config → lifecycle_tools + register_call → generate_first_message + warmup_llm_cache) run at ringtime via the telephony /prefetch (src/routes/telephony.py:355-449); only waves 3–6 run at connect.

Telephony rings; web does notOn telephony, start-of-call API tools (CRM lookups), MODEL_GENERATED first-message generation, and the boot prefix warmup are all masked by the ring. Web calls have no prefetch — the whole DAG is user-visible at connect (plus the 1.1 s early-media delay from Part 10).

Workflow agents skip the boot warmup and warm per node instead (Part 5); a first node without entry speech eats the cold prefill on turn 1.

Symptom it causes: slow time-to-first-hello on web calls only — the boot work that telephony hides behind the ring is fully exposed. Fix with a static first message and trimmed boot work (Parts 10, 11).

Checkpoint: a workflow node calls a vendor API that averages 4 s. Callers hear dead air. List the masking levers in the order you would apply them.

Give the node a 3–4 s entry message — blocking pre-actions execute behind it by design (processor.py:2653-2659).
If the call is LLM-invoked instead, configure Pre-tool speech (static, or Auto which will trigger since avg >1000 ms) — dispatched ~1.4 s earlier thanks to the early-fire frame.
Consider blocking: false if the flow can proceed and re-route when the result arrives.
Set a node-action timeoutMs so a hung vendor cannot ride the 30 s budget — remembering that on LLM-invoked tools that field is dead and only the 30 s budget applies.

Ask Claude Code: "Show me every place blocking on a workflow action is read in pipecat-agent, and confirm whether timeout_ms / max_retries have a runtime consumer for LLM-invoked api_call tools."