The media path is uninstrumented NET
Steps 1–9 of a turn (VAD, STT, the LLM bundle, TTS) all show up in the trace. The legs between the caller's handset and the agent process do not: RTP packetization, carrier jitter buffers, the Asterisk relay, the agent's output chunking, and any geographic round trip. Together they are a fixed floor of roughly 150–300 ms per turn that you will hunt for forever in the span timeline and never find.
debug-call-audio references/latency.md:122-126This part also covers the one-time setup costs (Wait(1), the WebRTC early-media delay) which hit time-to-first-word once per call, and the open barge-in question on Asterisk. SIP signalling anatomy and coturn credential matching live in the sibling sip-telephony and coturn-webrtc tutorials — this part does not re-derive them.
The fixed per-turn floor: packetization, jitter, relay NET
What it is. The irreducible, mostly-protocol-fixed cost every turn pays just to move audio across the wire and through the PBX. None of it is a tunable Freya knob.
- RTP packetization. PSTN audio moves in 20 ms G.711 frames (~50 pps, 160 bytes payload). Each direction has a ~20 ms framing floor; a round trip pays it twice. Protocol-fixed, no knob.
sip-telephony guide.md:85-93,852-862 - Carrier jitter buffers. The triage rule of thumb: "RTP jitter buffers add 100–200 ms, transcoding adds more".
debug-call-audio references/latency.md:122-124 - Our side adds no jitter buffer. The rendered
rtp.confhas nojbenable/JITTERBUFFER()anywhere — receive-side smoothing is the carrier's, never ours.openshift/garanti/cpu/configmaps/asterisk-config.yaml:47-57 - Asterisk media relay.
direct_media = noon every endpoint is non-negotiable (recording + AI need the media), costing one hop and an alaw↔ulaw transcode per direction — single-digit ms on a healthy box.asterisk-config.yaml:158
Quality thresholds (read via RTCP / pjsip show channelstats): jitter >30 ms audible, loss >1% degrades, RTT >300 ms produces talk-over. sip-telephony guide.md:124-133
Try it — round-trip budget stacker
One user turn, end to end. Toggle the deployment and call type, drag the carrier jitter slider, and watch the network share change while the AI share keeps dominating. The point: even the worst-case media path is small next to the model.
Output pacing and the open barge-in question NET
What it is. The agent's outbound audio leaves in 40 ms chunks (4 × 10 ms default, not overridden anywhere in pipecat-agent) written every 20 ms — i.e. draining at 2× real-time. Asterisk's chan_websocket buffer (cap 1000 frames = 20 s) re-times the bursty stream into clean 20 ms RTP. Steady-state added latency: ~one frame; first audio waits for one 40 ms chunk fill.
Runtime effect. A 40 ms chunk every 20 ms means a 10 s TTS utterance leaves ~5 s of audio sitting in Asterisk's buffer by the end. pipecat transports/base_transport.py:63, pipecat transports/websocket/fastapi.py:394,518-527.
Two findings worth flagging:
src/serializers/kartik_serializer.py:81-87). On barge-in, pipecat clears its own queue, but audio already shipped to Asterisk (up to seconds, given the 2× drain) has no in-band flush on the live path. Whether Asterisk drops or plays it is not verifiable in these repos — if users report "the agent keeps talking after I interrupt" on Asterisk deployments, look here first.AsteriskWSServerTransport with its initial_jitter_buffer_ms: 80 has NO runtime consumer (src/transports/chan_ws_server.py:157; nothing imports it). The live path is the un-buffered FastAPIWebsocketTransport + Kartik serializer (bot.py:611-620). Do not present the 80 ms buffer as a live cost.Try it — packetization & pacing visualizer
Inbound: 20 ms frames arrive at 50 pps. Outbound: the agent emits 40 ms chunks every 20 ms (2× real-time), so the Asterisk buffer fills during long TTS and drains at 1×. Press Barge-in mid-utterance: it clears the agent's queue but the Asterisk gauge stays full — the missing FLUSH_MEDIA made tangible.
Setup latency ≠ turn latency NET
What it is. The big configured numbers all hit time-to-first-word, once per call — not per-turn latency. Don't confuse them.
| Cost | Where |
|---|---|
| +1,000 ms | dialplan Wait(1) between Answer() and Dial(WebSocket) — the largest configured fixed delay in the telephony setup path; remove once early-media races are confirmed solved. openshift/garanti/cpu/configmaps/asterisk-config.yaml:354-358 |
| +1,100 ms | WEBRTC_EARLY_MEDIA_DELAY before bot start on web calls. src/utils/general.py:33; bot.py:690-691 |
| 91–450 ms 0.4–44 s spike | Twilio media-WS setup (cloud); spike root cause left open. notes/investigations/call-investigation-2026-03-03.md:393 |
| 63–443 ms 2,773 ms miss | inbound config resolve (cached / cache-miss); 332–625 ms pipeline init. same investigation :386-396 |
Runtime effect. On telephony, the inbound webhook + /prefetch runs config resolution during the ring (src/routes/telephony.py:355-449,453-594), so most of this is masked. Web calls have no prefetch and pay the whole boot DAG at connect, plus the 1.1 s early-media delay.
Wait( in the freya-onprem Asterisk dialplans and which contexts they sit in."Web calls: ICE, TURN, and when media hairpins NET
What it is. Browser calls run SmallWebRTCTransport (bot.py:578-582); ICE servers come from the dashboard route built from TURN_PUBLIC_HOST/TURN_PORT/TURN_USERNAME/TURN_PASSWORD, falling back to public Google STUN only when unset. freya-dashboard src/app/api/token/webrtc/ice-servers/route.ts:8-35
- A second web-call transport exists. When
context.transport_type == "web"with aroom_url, the agent builds aDailyTransportinstead (bot.py:592-607), routing media through Daily's SFU — an extra hop whose latency is not measurable from these repos (not verified in source). Which transport a web call uses is a deploy/dashboard choice, not determinable from source — so when debugging a slow web call, first establish whether it ran on SmallWebRTC or Daily. - When media hairpins. It relays through coturn exactly when no direct/srflx candidate pair survives — symmetric NAT / strict corporate firewall, "i.e. every bank".
coturn-webrtc guide.md:42-52 - On-prem the relay is a DMZ LAN hop — single-digit ms. The crucial fact: TURN misconfigs cause silence, not slowness.
sip-telephony and coturn-webrtc tutorials in this repo.Try it — ICE path picker
Pick the NAT type on each side and whether a TURN relay is configured. The picker runs the candidate-pair priority logic (host → srflx → relay) and shows which path media takes — including the silent-call failure when strict NAT meets no relay.
Geography: the cloud round trip and what removing it revealed NET
What it is. On EU cloud (eu-central-1), every audio frame of a Turkish caller crosses TR↔Frankfurt twice per turn. No measured TR↔Frankfurt RTT exists in source (not verified in source).
The indirect evidence. When Garanti moved to a TR on-prem pod, total turn latency dropped to 0.49–0.60 s — fast enough that the agent started interrupting callers mid-digit-dictation; "cloud models were slow enough to mask this". notes/customers/garanti/current-state.md:104-108
What dominates here: nothing tunable per-turn — it's a fixed ~150–300 ms floor. The actionable items are the one-time setup delays (Wait(1), early-media delay) and knowing the floor exists so you stop hunting for it in traces.
Checkpoint: a customer reports "+latency" on production calls; you pull the Langfuse trace and every span is healthy — total span time ~1.1 s, but callers swear it's 1.5 s+. Where did the missing 400 ms go?
The media path: ~20 ms packetization × 2, carrier/SBC jitter buffers (100–200 ms), Asterisk relay + transcode, the 40 ms output chunk fill, and any geographic RTT — none of which is instrumented in the trace (debug-call-audio references/latency.md:122-126).
To see the true gap, measure audio-to-audio with the separated recording tracks (Part 12), not span timestamps.
bot.py and show me every place the Kartik serializer's serialize() can emit a command frame."