Part 10

Network & telephony: the invisible 150–300 ms

The legs outside the AI pipeline add a mostly-fixed per-turn cost that never appears in Langfuse traces, plus large one-time setup costs that hit "time to first hello", not per-turn latency.

Overview

The media path is uninstrumented NET

Steps 1–9 of a turn (VAD, STT, the LLM bundle, TTS) all show up in the trace. The legs between the caller's handset and the agent process do not: RTP packetization, carrier jitter buffers, the Asterisk relay, the agent's output chunking, and any geographic round trip. Together they are a fixed floor of roughly 150–300 ms per turn that you will hunt for forever in the span timeline and never find.

The triage ruleIf the trace timeline is clean but the caller feels lag, suspect the media path. debug-call-audio references/latency.md:122-126

This part also covers the one-time setup costs (Wait(1), the WebRTC early-media delay) which hit time-to-first-word once per call, and the open barge-in question on Asterisk. SIP signalling anatomy and coturn credential matching live in the sibling sip-telephony and coturn-webrtc tutorials — this part does not re-derive them.

Step 50

The fixed per-turn floor: packetization, jitter, relay NET

What it is. The irreducible, mostly-protocol-fixed cost every turn pays just to move audio across the wire and through the PBX. None of it is a tunable Freya knob.

packetization 20 ms × 2 jitter buffers 100–200 ms relay + transcode single-digit ms

Quality thresholds (read via RTCP / pjsip show channelstats): jitter >30 ms audible, loss >1% degrades, RTT >300 ms produces talk-over. sip-telephony guide.md:124-133

Try it — round-trip budget stacker

One user turn, end to end. Toggle the deployment and call type, drag the carrier jitter slider, and watch the network share change while the AI share keeps dominating. The point: even the worst-case media path is small next to the model.

One turn — where the milliseconds live
120 ms
Network / telephony AI pipeline (other parts)
Adjust the controls.
Step 51

Output pacing and the open barge-in question NET

What it is. The agent's outbound audio leaves in 40 ms chunks (4 × 10 ms default, not overridden anywhere in pipecat-agent) written every 20 ms — i.e. draining at 2× real-time. Asterisk's chan_websocket buffer (cap 1000 frames = 20 s) re-times the bursty stream into clean 20 ms RTP. Steady-state added latency: ~one frame; first audio waits for one 40 ms chunk fill.

audio_out_10ms_chunks → 4 (= 40 ms) pace 40 ms chunk / 20 ms Asterisk buffer cap 1000 frames = 20 s

Runtime effect. A 40 ms chunk every 20 ms means a 10 s TTS utterance leaves ~5 s of audio sitting in Asterisk's buffer by the end. pipecat transports/base_transport.py:63, pipecat transports/websocket/fastapi.py:394,518-527.

Two findings worth flagging:

The live serializer never sends FLUSH_MEDIAThe Kartik serializer handles only audio frames and HANGUP (src/serializers/kartik_serializer.py:81-87). On barge-in, pipecat clears its own queue, but audio already shipped to Asterisk (up to seconds, given the 2× drain) has no in-band flush on the live path. Whether Asterisk drops or plays it is not verifiable in these repos — if users report "the agent keeps talking after I interrupt" on Asterisk deployments, look here first.
The 80 ms jitter buffer is dead codeAsteriskWSServerTransport with its initial_jitter_buffer_ms: 80 has NO runtime consumer (src/transports/chan_ws_server.py:157; nothing imports it). The live path is the un-buffered FastAPIWebsocketTransport + Kartik serializer (bot.py:611-620). Do not present the 80 ms buffer as a live cost.

Try it — packetization & pacing visualizer

Inbound: 20 ms frames arrive at 50 pps. Outbound: the agent emits 40 ms chunks every 20 ms (2× real-time), so the Asterisk buffer fills during long TTS and drains at 1×. Press Barge-in mid-utterance: it clears the agent's queue but the Asterisk gauge stays full — the missing FLUSH_MEDIA made tangible.

20 ms in @ 50 pps · 40 ms out every 20 ms · Asterisk re-times to 20 ms RTP
8.0 s
Press "Speak utterance" to start, then barge in.
Step 52

Setup latency ≠ turn latency NET

What it is. The big configured numbers all hit time-to-first-word, once per call — not per-turn latency. Don't confuse them.

CostWhere
+1,000 msdialplan Wait(1) between Answer() and Dial(WebSocket) — the largest configured fixed delay in the telephony setup path; remove once early-media races are confirmed solved. openshift/garanti/cpu/configmaps/asterisk-config.yaml:354-358
+1,100 msWEBRTC_EARLY_MEDIA_DELAY before bot start on web calls. src/utils/general.py:33; bot.py:690-691
91–450 ms
0.4–44 s spike
Twilio media-WS setup (cloud); spike root cause left open. notes/investigations/call-investigation-2026-03-03.md:393
63–443 ms
2,773 ms miss
inbound config resolve (cached / cache-miss); 332–625 ms pipeline init. same investigation :386-396

Runtime effect. On telephony, the inbound webhook + /prefetch runs config resolution during the ring (src/routes/telephony.py:355-449,453-594), so most of this is masked. Web calls have no prefetch and pay the whole boot DAG at connect, plus the 1.1 s early-media delay.

Ask Claude Code: "Show me every Wait( in the freya-onprem Asterisk dialplans and which contexts they sit in."
Step 53

Web calls: ICE, TURN, and when media hairpins NET

What it is. Browser calls run SmallWebRTCTransport (bot.py:578-582); ICE servers come from the dashboard route built from TURN_PUBLIC_HOST/TURN_PORT/TURN_USERNAME/TURN_PASSWORD, falling back to public Google STUN only when unset. freya-dashboard src/app/api/token/webrtc/ice-servers/route.ts:8-35

Cross-link, don't re-deriveSIP/coturn details live in the sip-telephony and coturn-webrtc tutorials in this repo.

Try it — ICE path picker

Pick the NAT type on each side and whether a TURN relay is configured. The picker runs the candidate-pair priority logic (host → srflx → relay) and shows which path media takes — including the silent-call failure when strict NAT meets no relay.

browser → coturn → agent · direct vs srflx vs relay
Choose NAT types.
Step 54

Geography: the cloud round trip and what removing it revealed NET

What it is. On EU cloud (eu-central-1), every audio frame of a Turkish caller crosses TR↔Frankfurt twice per turn. No measured TR↔Frankfurt RTT exists in source (not verified in source).

The indirect evidence. When Garanti moved to a TR on-prem pod, total turn latency dropped to 0.49–0.60 s — fast enough that the agent started interrupting callers mid-digit-dictation; "cloud models were slow enough to mask this". notes/customers/garanti/current-state.md:104-108

Latency wins can create tuning workThe on-prem move removed the geographic leg and cloud model queueing at once, so the delta conflates network and model serving. But the lesson holds: getting faster surfaced VAD over-triggering that the cloud's slowness had hidden.

What dominates here: nothing tunable per-turn — it's a fixed ~150–300 ms floor. The actionable items are the one-time setup delays (Wait(1), early-media delay) and knowing the floor exists so you stop hunting for it in traces.

Checkpoint: a customer reports "+latency" on production calls; you pull the Langfuse trace and every span is healthy — total span time ~1.1 s, but callers swear it's 1.5 s+. Where did the missing 400 ms go?

The media path: ~20 ms packetization × 2, carrier/SBC jitter buffers (100–200 ms), Asterisk relay + transcode, the 40 ms output chunk fill, and any geographic RTT — none of which is instrumented in the trace (debug-call-audio references/latency.md:122-126).

To see the true gap, measure audio-to-audio with the separated recording tracks (Part 12), not span timestamps.

Ask Claude Code: "Confirm the live Asterisk transport in bot.py and show me every place the Kartik serializer's serialize() can emit a command frame."