Buy masking time with audio
Most of what follows is not a per-turn tax. It is external latency landing inside a turn, a one-shot cost at call start, or masking machinery that hides latency rather than removing it. The platform's recurring design pattern repeats at every layer:
What dominates here: external API round-trip time — the one latency source Freya does not control — and whether you spent audio to mask it. Everything else in the long tail is bounded, rare, or off-path by design.
api_call: where external latency lands in the turn net
What it is. An api_call can run two different ways, and the two shapes put the vendor's round-trip time in completely different places on the critical path.
- Workflow pre-action.
blocking: true(the default) actions are awaited atpipecat-agent processor.py:3065; routing does not proceed until the API returns, so the vendor's RTT is inside the felt turn — but it is masked by the node's entry message, which is deliberately spoken first (processor.py:2653-2659).blocking: falseactions run in the background and re-trigger routing when the result arrives (processor.py:3145-3190). Results are injected as[Action Result]system messages (processor.py:3192-3211) — which also grow the speaking prompt. The idle timer is suppressed during blocking actions (processor.py:3045-3047). - LLM-invoked tool. The turn becomes LLM call #1 (decides the tool) → api_call (external RTT) → LLM call #2 (speaks the result). The API sits between two LLM calls on the critical path; only pre-tool speech masks it (
src/tools/api_call.py:266-499).
GET /test/slow?delay=ms (default 5000) and GET /test/fail?status=&after= (mock-api lib/router.js:140-167). Point a node's api_call at /test/slow?delay=4000 to reproduce a 4 s vendor and watch the entry message cover it.Symptom it causes/fixes: dead air mid-flow on a specific node — a blocking pre-action with no entry message to hide behind. The cost is the vendor RTT itself, often +2–4 s of in-turn silence.
Try it — Where does the API call hide?
The caller only hears silence when the API is still running and nothing is being spoken. Spend entry-message or filler duration to cover the vendor's round trip.
The Timeout / Max-retries gotcha: real vs dead consumer gotcha
What it is. The same two fields — Timeout (config.timeoutMs → timeout_ms) and Max retries (config.maxRetries → max_retries) — behave completely differently depending on whether the api_call is a node action or an LLM-invoked tool.
- Node actions (real). Timeout and Max retries are honored:
_invoke_with_retry(processor.py:3072-3143) applies the timeout and retries on exceptions and HTTP {408, 429, 500, 502, 503, 504} with backoffmin(2**attempt, 8)s. Worst case withmaxRetries=2and a 10 s timeout: 10 + 1 + 10 + 2 + 10 = 33 s of in-turn dead air minus whatever the entry message masks.
ToolExecutionConfig fields exist (src/core/types.py:286-293) but tool registration consumes only always_runs_at and pre_speech (src/tools/handler_utils.py:1378-1394). timeout_ms / max_retries on an LLM-callable api_call tool are parsed and ignored — no runtime consumer. Only api_call's hardcoded 30 s total budget across all redirect hops applies (src/tools/api_call.py:68). SSRF DNS pre-validation runs serially per hop in prod, bounded at 10 s (api_call.py:181-202). Do not promise a customer a per-tool timeout on an LLM-invoked api_call.Symptom it causes: a hung vendor riding the full 33 s (node action) or 30 s budget (LLM tool) as dead air. The fix on node actions is a tight timeoutMs; on LLM tools there is no fix but the 30 s ceiling.
Try it — Retry / backoff cost calculator
Models _invoke_with_retry exactly: each attempt waits up to timeoutMs (timeout mode) or returns fast then sleeps min(2**attempt, 8) s of backoff before the next. The 30 s api_call budget is drawn as the ceiling.
Guardrails: nearly free until they fire LLM
What it is. Output guardrails run on the streaming text path. Almost everything they do is cheap; only two outcomes cost real latency.
- Per-chunk deterministic checks — regex 1–5 ms, structured 1–2 ms, dictionary <1 ms (in-source estimates,
src/guardrails/engine.py:27-34) run inline on everyLLMTextFrame. Cheap. - Prefix holdback — the one streaming feature that delays audio: chunks ending in a risky prefix are buffered up to
prefix_holdback_max_ms(default 500 ms,src/guardrails/models.py:101-103). - The output classifier does NOT gate streaming audio — it runs at response end, after safe chunks already went to TTS; ≤400 ms timeout, fail-open (
src/guardrails/processor.py:192-204;models.py:94-97). - The expensive outcome is
regenerate— wait-message TTS plus a full extra LLM round trip, up tomax_regenerate_retries(default 2) before falling back to refuse (processor.py:308-361;src/core/types.py:458). - Input guardrails, when enabled (default off), add normalization + injection regex (<1 ms) and optionally one ≤400 ms classifier call serially before the LLM (
src/guardrails/input_processor.py:140-179). Refused input turns skip the LLM entirely — they are faster than normal turns.
guardrails_enabled is false the processor is removed from the pipeline entirely (types.py:437-465) — zero cost, not a no-op pass-through.Symptom it causes: an occasional +500 ms hitch (prefix holdback) on a turn whose wording grazed a blocked prefix, or a multi-second stall on the rare regenerate path (extra LLM round trip × retries).
Try it — Guardrail stream simulator
Type a sentence. If a blocked word appears, watch the holdback buffer fill once a risky prefix forms: it either flushes at prefix_holdback_max_ms (safe) or triggers refuse / regenerate, which costs a second LLM round trip.
Pre-tool speech and the idle handler: masking, never reduction TTS
- Pre-tool speech (Pre-tool speech,
config.preSpeech→PreToolSpeech; modes None / Static / Auto / Audio) is dispatched the moment the tool name appears mid-stream (EarlyFunctionCallStartedFrame,src/core/processors/call_action_trigger.py:368-398) — the in-source docstring quantifies "saves up to ~1.4 s of perceived voice latency" vs waiting for the full response. Auto mode keeps a 10-sample history per tool: avg <500 ms → silent, >1000 ms or first call → speak (src/core/pre_tool_speech.py:16-18,115-135). Fillers are pushed directly downstream (not through the pipeline head — they would serialize behind the awaited tool,call_action_trigger.py:488-494) and never committed to LLM context (no token cost). Pure masking: the tool runs on its own schedule regardless. - Idle handler — fires after 45 s of user silence (env default
USER_AWAY_TIMEOUT_S,src/core/settings.py:277; per-node/agent overrides). A customcheck_in_messagecosts only TTS; the default LLM check-in is one extra LLM+TTS round. It never delays a live turn, and it is explicitly held while blocking workflow actions run so the agent does not check in on itself.
Symptom it fixes: dead air while an LLM-invoked tool runs. Pre-tool speech does not make the tool faster — it dispatches a filler ~1.4 s earlier than waiting for the full response would, masking the gap.
Voicemail, DTMF, transfers net
- Voicemail detection is a race, not a tax. The classifier and the response LLM run in parallel; the gate buffers LLM-origin frames only, so first-turn cost =
max(classifier, response_LLM), bounded by a 5.0 s fail-open timeout (src/core/processors/voicemail_detector.py:1-22,109,277-353). The 2.0 svoicemail_response_delayis intentional beep-clearing, post-verdict. - DTMF flush delay. Every DTMF turn ends at
lastKey + timeout(default 2.0 s) unless the caller presses the termination digit (#). Freya'sFreyaDTMFAggregatorsubclass exists purely for latency: it emitsTranscriptionFrame(finalized=True)so the turn closes immediately on flush instead of stalling until the user speaks (src/exts/dtmf_aggregator.py:1-38). Lower the timeout for fixed-length codes; DTMF remains the latency- AND accuracy-optimal channel for digits. - Transfers speak first, act second.
transfer_callis a deferred terminal action: handoff delay = farewell duration (one extra LLM call ifmessage_mode: "llm") + an optional interruptible grace window (default ≥3.0 s, user speech restarts it) + the REFER POST to the dashboard/PBX (src/tools/transfer_call.py:150-171;call_action_trigger.py:688-806;src/core/freya_client.py:584-624). No measured REFER round-trip numbers exist (not verified in source).
Symptom it causes/fixes: a ~2 s tail at the end of a DTMF entry (lower the timeout, or train callers to press #), and a several-second farewell-then-handoff window on transfer (by design — the caller hears the goodbye, not silence).
Call start: prefetch waves and who pays the boot net
What it is. The boot DAG (bot.py:197-305) splits into waves. Serializable waves 0–2 (resolve_config → lifecycle_tools + register_call → generate_first_message + warmup_llm_cache) run at ringtime via the telephony /prefetch (src/routes/telephony.py:355-449); only waves 3–6 run at connect.
Workflow agents skip the boot warmup and warm per node instead (Part 5); a first node without entry speech eats the cold prefill on turn 1.
Symptom it causes: slow time-to-first-hello on web calls only — the boot work that telephony hides behind the ring is fully exposed. Fix with a static first message and trimmed boot work (Parts 10, 11).
Checkpoint: a workflow node calls a vendor API that averages 4 s. Callers hear dead air. List the masking levers in the order you would apply them.
- Give the node a 3–4 s entry message — blocking pre-actions execute behind it by design (
processor.py:2653-2659). - If the call is LLM-invoked instead, configure Pre-tool speech (static, or Auto which will trigger since avg >1000 ms) — dispatched ~1.4 s earlier thanks to the early-fire frame.
- Consider
blocking: falseif the flow can proceed and re-route when the result arrives. - Set a node-action
timeoutMsso a hung vendor cannot ride the 30 s budget — remembering that on LLM-invoked tools that field is dead and only the 30 s budget applies.
blocking on a workflow action is read in pipecat-agent, and confirm whether timeout_ms / max_retries have a runtime consumer for LLM-invoked api_call tools."