Every fix is downstream of a measurement
Every lever in Parts 2–11 is downstream of a measurement. The platform gives you two complementary instruments; the discipline of this part is learning which question each answers, and not wasting an hour looking in the wrong place.
What dominates here: discipline. The instruments already exist; wrong-instance lookups, double-counted spans, and idle-metrics exoneration are how a 30-minute diagnosis becomes a day.
Measure before fixing: the two instruments method
- Audio-anchored 5-stage timeline. debug-call-audio's
latency.pyreconstructs, per turn:user_stop_ts → vad_end_ts → stt_final_ts → llm_first_tok → tts_first_audfrom the separated audio tracks plus the trace (freya-skillsplugins/freya/skills/debug-call-audio/scripts/latency.py:8-26). Answers which stage is slow — and it is the only instrument that measures the true caller-perceived gap. - Trace-anchored serving decomposition. The
latency-analyzerskill: per-call-type table, prefill/decode verdict, eviction curve, serving-config math, optional live probes. Answers why the LLM stage is slow.
The Langfuse instance gotcha (the #1 time-waster) gotcha
What it is. Every call is a Langfuse session keyed on the call_id (the trace id is different — lookup is session → trace → observations). Freya runs multiple Langfuse instances, resolved from the environment, never from the URL printed in the call record.
| Dashboard host | Langfuse instance |
|---|---|
staging-app.freyavoice.ai / app.freyavoice.ai | cloud (cloud.langfuse.com) |
tr-app.freyavoice.ai | kkb self-hosted (kkb-langfuse.freyavoice.ai) |
references/langfuse-instances.md:3-19) The dashboard can print a cloud.langfuse.com session URL even for an on-prem call — it is wrong; the bundled langfuse-obs MCP is hardwired to cloud, so "trace not found" for a tr-app call means wrong instance, not no trace. Auth is HTTP Basic pk:sk; Cloudflare 403s the default urllib User-Agent on *.freyavoice.ai; cloud retention ~30 days.Symptom it causes: "trace not found" on a real, recent call — almost always you queried cloud for a tr-app call. The fix is to resolve the instance from the dashboard host, not the printed URL.
The observation-name dictionary LLM
What it is. You cannot read a trace without this map (emitters in pipecat-agent). Several spans are renamed at export or are two names for one wall-clock — this is where double-counting starts.
| Observation | What it measures | Source |
|---|---|---|
turn (attr turn.user_bot_latency_seconds) | the felt gap: VAD-stop → bot audio start | utils/tracing/turn_trace_observer.py:194; observers/user_bot_latency_observer.py:87-88 |
vad | the speech segment including user speech — NOT pure endpointing | src/exts/diagnostics/pipeline_monitor.py:316-328 |
workflow.routing.batch | the whole extract+route DAG wall-clock (contains extraction) | src/core/workflow/processor.py:1258-1261 |
workflow.intent_matching | the single batched intent-leaf LLM call (child of routing.batch — same wall-clock, two names) | src/core/workflow/router.py:1137 |
workflow.extraction / .batch / _retry | extraction LLM calls / fan-out wall / retry | src/core/workflow/extractor.py:215-216,362,449 |
llm.<node> / tts.<node> / stt.<node> | speaking LLM / TTS / STT, renamed at export by appending workflow.node | src/core/tracing.py:110-111,222-227 |
tts.text_aggregation | LLM first token → first full sentence (the TTFS span) | src/exts/tts/freya_tts_v2.py:271-277 |
llm_warmup.node_entry / .router / .extractor | background prefix primes — off the critical path | processor.py:2883-2895,2945-2991 |
turn.user_bot_latency_seconds ≈ (waitSeconds/STT overlap) + workflow.routing.batch + llm.<node> TTFB + tts.text_aggregation + tts TTFB. Because routing.batch already contains extraction, never add workflow.extraction.batch on top.Try it — Decompose a turn (and avoid double-counting)
Set each span's wall-clock. The felt turn sums the components per the decomposition rule. Toggle the two classic double-count traps to see how they inflate the number — routing.batch already contains extraction, and intent_matching is the same pass as routing.batch.
The layered diagnosis method LLM
What it is. The latency-analyzer recipe (SKILL.md:53-72) — run in order, stop when one layer fully explains the latency.
- Layer 0 — fetch from the RIGHT instance (
fetch_latency_data.py "<call-url>"; main trace = most observations). - Layer 2 —
latency_breakdown.pyprints the three core artifacts: per-call-type table (find the dominant Σ time); the verdict — PREFILL-BOUND if median input ≥8000 tok AND output ≤200, DECODE-BOUND if output ≥300, else mixed/contention; the eviction curve for the dominant speaking node (warm threshold default 1.5 s — a heuristic, tune it for slower boxes). Reading: warm throughout = cache fine; warm-then-cold-stays-cold = classic eviction (no mid-call re-warm); cold from turn 1 = never warmed. - Layer 3 —
serving_config.py --endpoint <bare base URL, NO /v1>pulls/v1/models+ vLLM/metricsand runs the Part 6 capacity math. Pasting the agent'sLLM_URLenv var verbatim yields/v1/v1/models. - Layer 4 (optional) — prompt composition: needs a dashboard FAK matching the call's workspace; without it, stop and say so — the Layer 0–3 diagnosis is complete (exactly what the
eb4a83f7report did for workspace 69016551). - Layer 5 (opt-in only) — live probes send real production load; never without explicit OK. Cold/warm/concurrent modes isolate prefill cost, cache savings, and contention. Caveat: an idle-box contention probe reading 1–2× inflation does NOT exonerate the live call — the ~4× collisions happen under campaign load a 3-request probe cannot recreate. The trace eviction curve remains the authoritative record.
Try it — Layer-2 verdict & eviction curve
Set the dominant speaking node's median input/output tokens for the verdict, then read the per-turn TTFB curve against the warm threshold (default 1.5 s) for the eviction story.
Separated recording tracks: the prerequisite for true-gap measurement TTS
What it is. GET /api/v2/call/{callId}/recording?track=user|assistant — track is the only recognized parameter (freya-dashboard src/app/api/v2/call/[callId]/recording/route.ts:38-47); no track, or wrong params like channel=, fall through to the same mixed file.
fetch_call.py helper downloads user.wav, assistant.wav, mixed.wav plus call/transcript/workflow/agent JSON in one pass.Symptom it fixes: a latency analysis that cannot locate the true caller-perceived gap because echo and speech are tangled in one waveform. Always pull the separated tracks before timing endpointing.
Reading gotchas and statistical hygiene gotcha
What it is. The traps that produce wrong diagnoses:
- Idle
/metricscounters ≈ 0 prove nothing — they reflect the probe instant (Part 6, Step 32). - Never sum
workflow.intent_matching+workflow.routing.batch— parent/child spans of one routing pass; summing double-counts. input_tokens: 0rows = Langfuse usage-logging gaps; ignore for warm/cold reads.- Cumulative prefix hit rate hides cold big prompts (86.5% during three cold turns).
- The
vadspan includes user speech — its 1.7–4.5 s rows are not dead air; use the audio-anchored timeline for true endpointing. timeToFirstTokenis often absent, and when present sometimes in nanoseconds (divide by 1e6 when >1e6).- Hygiene rules: no latency stats on <3-turn calls; a turn is "slow" only if total >2× the call's own median AND >2 s absolute; a dominant segment must be ≥40% of the turn's total, else label it "general".
timeToFirstToken arrives in seconds, milliseconds, or nanoseconds depending on the row. The rule of thumb: if the value is > 1e6, it is nanoseconds — divide by 1e6 to get milliseconds. Applying it blindly to a value already in ms turns 1.4 s into 0.0000014 s.Try it — timeToFirstToken unit normalizer
Paste a raw timeToFirstToken value as it appears in a Langfuse row. The normalizer applies the guide rule and shows the millisecond value you should actually reason about.
The symptom → fix playbook net
What it is. One complaint per row, the likely stage, the first thing to check, and the single lever that moves it (with the part that explains the lever).
| Symptom | Likely stage | First check | Fix lever (part) |
|---|---|---|---|
| Constant ~1 s pause before every answer, all turns alike | endpointing timers | agent's vadStopSecs + waitSeconds | lower Stop Delay / Wait seconds within the cutoff trade-off (Part 2) |
| Mid-sentence cutoffs after lowering timers | endpointing | fragmentation rate in transcript | raise stop_secs back, or Smart Turn for pause-heavy flows (Part 2) |
| ~3 s dead air then "could you repeat" | STT failure path | finalizeTimeout signature; STT server health/GPU contention | fix STT serving, not the LLM (Part 3) |
| Bot talks over interrupting callers for seconds | barge-in path | batch STT + numberOfWords ≥1? | numberOfWords 0 or streaming STT (Part 2) |
| Some turns instant, some 3–4 s, same node | prefix-cache eviction | eviction curve (Layer 2) | shrink prompt; FP8 KV; admission control (Parts 6, 8) |
| Every turn in one node slow, others fine | cold prefix / no warmup | does the node have an entry message? llm_warmup.node_entry present? | add entry message; merge nodes (Parts 5, 8) |
| Every turn slow, prompt huge, input ≥8K out ≤200 | PREFILL-BOUND | Layer-2 verdict | shrink prompt (the lever board, Part 6) |
| Long replies feel slow despite fast TTFT | DECODE-BOUND / sentence shape | output tokens; tts.text_aggregation | shorten spoken turns; short first sentences (Parts 5, 9) |
| Routing >2 s with fat tail | GPU contention | warmup collisions; num_preemptions | reduce warmup pressure = shrink prompts; never warm harder (Part 5) |
| New customers slower than repeat customers (cloud) | cross-call cache | cached-token % per call | move dynamic vars to a trailing context block (Part 7) |
| Cloud agent 2.4 s TTFT after a model swap | reasoning default | reasoning_effort in request, service_tier in response | pin reasoning to none/minimal (Part 7) |
| Dead air mid-flow on a specific node | blocking api_call | node pre_actions + vendor RTT | entry message / pre-tool speech / non-blocking (Part 11) |
| Trace clean but caller feels lag | media path | nothing — it's not in the trace | accept the 150–300 ms floor; check carrier jitter, setup delays (Part 10) |
| Slow time-to-first-hello on web calls only | setup path | no prefetch + 1.1 s early-media delay | static first message; trim boot work (Parts 10, 11) |
| "Agent keeps talking after I interrupt" (Asterisk) | shipped-audio buffer | the missing FLUSH_MEDIA on the live serializer | open engineering question — escalate, don't tune (Part 10) |
The no-runtime-consumer list (latency edition) dead
What it is. Knobs that look like latency levers but do nothing today (verified against pipecat-agent@dev 5b29206c unless noted).
| Knob | Status |
|---|---|
aicEnabled (AIC Filter) | hard-disabled ("crate instability", base_service.py:1467-1468) |
webrtcApmEnabled | merged then reverted (commit dc90fd7f, 2026-04-10); ImportError guard means it never loads on dev |
stopSpeakingPlan.voiceSeconds / backOffSeconds | no runtime consumer |
userTurnStopTimeout slider (voice mode) | floored to ≥30 s at runtime — slider effectively decorative |
timeout_ms / max_retries on LLM-invoked tools | parsed, never consumed (only node actions honor them) |
TTS minTextLength / minWords | read and logged, passed to nothing |
speed on the native VoxCPM2 server | "accepted for compatibility; not yet applied" |
LanguageTextFilter langdetect layer | dead code behind an early return, deliberately |
AsteriskWSServerTransport (initial_jitter_buffer_ms: 80) | whole class has no runtime consumer; live path is un-buffered |
prompt_cache_key | not sent anywhere in production code — measured worse |
serviceTier on non-OpenAI / on-prem providers | no runtime consumer (OpenAI branch only) |
freya-235b-enhanced | removed (PR #803) + catalog-collapsed; stale test fixtures only |
DTMF ignore_speech / ignore_speech_timeout | defined; consumer not traced — possibly unwired (not verified either way) |
Checkpoint: you're handed a tr-app call URL and the complaint "geç cevap veriyor". Name the first three commands/artifacts you produce, in order, and the one mistake that would waste an hour.
fetch_latency_data.py "<call-url>"— which resolves tr-app → the kkb Langfuse instance (the hour-wasting mistake is queryingcloud.langfuse.combecause the dashboard printed a cloud URL — "not found" there means wrong instance, not no trace).latency_breakdown.pyon the observations — per-call-type table, prefill/decode verdict, eviction curve.- If the verdict is PREFILL-BOUND,
serving_config.py --endpoint <bare base URL>for the KV-capacity math. Then, only if needed and with explicit approval, live probes.