Part 12 — Measurement, and the symptom -> fix playbook

Framing

Every fix is downstream of a measurement

Every lever in Parts 2–11 is downstream of a measurement. The platform gives you two complementary instruments; the discipline of this part is learning which question each answers, and not wasting an hour looking in the wrong place.

What dominates here: discipline. The instruments already exist; wrong-instance lookups, double-counted spans, and idle-metrics exoneration are how a 30-minute diagnosis becomes a day.

Step 61

Measure before fixing: the two instruments method

Audio-anchored 5-stage timeline. debug-call-audio's latency.py reconstructs, per turn: user_stop_ts → vad_end_ts → stt_final_ts → llm_first_tok → tts_first_aud from the separated audio tracks plus the trace (freya-skills plugins/freya/skills/debug-call-audio/scripts/latency.py:8-26). Answers which stage is slow — and it is the only instrument that measures the true caller-perceived gap.
Trace-anchored serving decomposition. The latency-analyzer skill: per-call-type table, prefill/decode verdict, eviction curve, serving-config math, optional live probes. Answers why the LLM stage is slow.

They compose, not competeThe targets table from the intro is the shared yardstick for both. Use the audio-anchored timeline to locate the slow stage, then the trace-anchored decomposition to explain it when the LLM is the culprit.

Step 62

The Langfuse instance gotcha (the #1 time-waster) gotcha

What it is. Every call is a Langfuse session keyed on the call_id (the trace id is different — lookup is session → trace → observations). Freya runs multiple Langfuse instances, resolved from the environment, never from the URL printed in the call record.

Dashboard host	Langfuse instance
`staging-app.freyavoice.ai` / `app.freyavoice.ai`	cloud (`cloud.langfuse.com`)
`tr-app.freyavoice.ai`	kkb self-hosted (`kkb-langfuse.freyavoice.ai`)

The traps(latency-analyzer references/langfuse-instances.md:3-19) The dashboard can print a cloud.langfuse.com session URL even for an on-prem call — it is wrong; the bundled langfuse-obs MCP is hardwired to cloud, so "trace not found" for a tr-app call means wrong instance, not no trace. Auth is HTTP Basic pk:sk; Cloudflare 403s the default urllib User-Agent on *.freyavoice.ai; cloud retention ~30 days.

Symptom it causes: "trace not found" on a real, recent call — almost always you queried cloud for a tr-app call. The fix is to resolve the instance from the dashboard host, not the printed URL.

Step 63

The observation-name dictionary LLM

What it is. You cannot read a trace without this map (emitters in pipecat-agent). Several spans are renamed at export or are two names for one wall-clock — this is where double-counting starts.

Observation	What it measures	Source
`turn` (attr `turn.user_bot_latency_seconds`)	the felt gap: VAD-stop → bot audio start	`utils/tracing/turn_trace_observer.py:194`; `observers/user_bot_latency_observer.py:87-88`
`vad`	the speech segment including user speech — NOT pure endpointing	`src/exts/diagnostics/pipeline_monitor.py:316-328`
`workflow.routing.batch`	the whole extract+route DAG wall-clock (contains extraction)	`src/core/workflow/processor.py:1258-1261`
`workflow.intent_matching`	the single batched intent-leaf LLM call (child of routing.batch — same wall-clock, two names)	`src/core/workflow/router.py:1137`
`workflow.extraction` / `.batch` / `_retry`	extraction LLM calls / fan-out wall / retry	`src/core/workflow/extractor.py:215-216,362,449`
`llm.<node>` / `tts.<node>` / `stt.<node>`	speaking LLM / TTS / STT, renamed at export by appending `workflow.node`	`src/core/tracing.py:110-111,222-227`
`tts.text_aggregation`	LLM first token → first full sentence (the TTFS span)	`src/exts/tts/freya_tts_v2.py:271-277`
`llm_warmup.node_entry` / `.router` / `.extractor`	background prefix primes — off the critical path	`processor.py:2883-2895,2945-2991`

Decomposition rule

turn.user_bot_latency_seconds ≈ (waitSeconds/STT overlap) + workflow.routing.batch + llm.<node> TTFB + tts.text_aggregation + tts TTFB

. Because routing.batch already contains extraction, never add workflow.extraction.batch on top.

Try it — Decompose a turn (and avoid double-counting)

turn.user_bot_latency_seconds from its component spans

Set each span's wall-clock. The felt turn sums the components per the decomposition rule. Toggle the two classic double-count traps to see how they inflate the number — routing.batch already contains extraction, and intent_matching is the same pass as routing.batch.

waitSeconds / STT overlap

400 ms

workflow.routing.batch (contains extraction)

1200 ms

workflow.extraction.batch (inside routing)

600 ms

llm.<node> TTFB

500 ms

tts.text_aggregation + TTS TTFB

350 ms

Add extraction.batch on top (WRONG) Add intent_matching on top (WRONG)

Where the felt turn comes from

wait / STT overlap routing + LLM TTS double-counted

Move the sliders.

Step 64

The layered diagnosis method LLM

What it is. The latency-analyzer recipe (SKILL.md:53-72) — run in order, stop when one layer fully explains the latency.

Layer 0 — fetch from the RIGHT instance (fetch_latency_data.py "<call-url>"; main trace = most observations).
Layer 2 — latency_breakdown.py prints the three core artifacts: per-call-type table (find the dominant Σ time); the verdict — PREFILL-BOUND if median input ≥8000 tok AND output ≤200, DECODE-BOUND if output ≥300, else mixed/contention; the eviction curve for the dominant speaking node (warm threshold default 1.5 s — a heuristic, tune it for slower boxes). Reading: warm throughout = cache fine; warm-then-cold-stays-cold = classic eviction (no mid-call re-warm); cold from turn 1 = never warmed.
Layer 3 — serving_config.py --endpoint <bare base URL, NO /v1> pulls /v1/models + vLLM /metrics and runs the Part 6 capacity math. Pasting the agent's LLM_URL env var verbatim yields /v1/v1/models.
Layer 4 (optional) — prompt composition: needs a dashboard FAK matching the call's workspace; without it, stop and say so — the Layer 0–3 diagnosis is complete (exactly what the eb4a83f7 report did for workspace 69016551).
Layer 5 (opt-in only) — live probes send real production load; never without explicit OK. Cold/warm/concurrent modes isolate prefill cost, cache savings, and contention. Caveat: an idle-box contention probe reading 1–2× inflation does NOT exonerate the live call — the ~4× collisions happen under campaign load a 3-request probe cannot recreate. The trace eviction curve remains the authoritative record.

PREFILL-BOUND: in ≥8000 & out ≤200 DECODE-BOUND: out ≥300 warm threshold 1.5 s

Try it — Layer-2 verdict & eviction curve

Mirrors latency_breakdown.py: the verdict thresholds and the warm/cold read

Set the dominant speaking node's median input/output tokens for the verdict, then read the per-turn TTFB curve against the warm threshold (default 1.5 s) for the eviction story.

Median input tokens

9000 tok

Median output tokens

120 tok

Warm threshold (tune for slower boxes)

1500 ms

Per-turn TTFB pattern

Per-turn speaking TTFB vs warm threshold

warm turn cold turn (> threshold)

Move the sliders.

Step 65

Separated recording tracks: the prerequisite for true-gap measurement TTS

What it is. GET /api/v2/call/{callId}/recording?track=user|assistant — track is the only recognized parameter (freya-dashboard src/app/api/v2/call/[callId]/recording/route.ts:38-47); no track, or wrong params like channel=, fall through to the same mixed file.

Mixed audio cannot measure the gapMixed audio cannot distinguish echo from speech or give clean per-speaker boundaries — using it when separated tracks exist is a named pitfall in the debug-call-audio skill. The fetch_call.py helper downloads user.wav, assistant.wav, mixed.wav plus call/transcript/workflow/agent JSON in one pass.

Symptom it fixes: a latency analysis that cannot locate the true caller-perceived gap because echo and speech are tangled in one waveform. Always pull the separated tracks before timing endpointing.

Step 66

Reading gotchas and statistical hygiene gotcha

What it is. The traps that produce wrong diagnoses:

Idle /metrics counters ≈ 0 prove nothing — they reflect the probe instant (Part 6, Step 32).
Never sum workflow.intent_matching + workflow.routing.batch — parent/child spans of one routing pass; summing double-counts.
input_tokens: 0 rows = Langfuse usage-logging gaps; ignore for warm/cold reads.
Cumulative prefix hit rate hides cold big prompts (86.5% during three cold turns).
The vad span includes user speech — its 1.7–4.5 s rows are not dead air; use the audio-anchored timeline for true endpointing.
timeToFirstToken is often absent, and when present sometimes in nanoseconds (divide by 1e6 when >1e6).
Hygiene rules: no latency stats on <3-turn calls; a turn is "slow" only if total >2× the call's own median AND >2 s absolute; a dominant segment must be ≥40% of the turn's total, else label it "general".

The nanosecond unit traptimeToFirstToken arrives in seconds, milliseconds, or nanoseconds depending on the row. The rule of thumb: if the value is > 1e6, it is nanoseconds — divide by 1e6 to get milliseconds. Applying it blindly to a value already in ms turns 1.4 s into 0.0000014 s.

Try it — timeToFirstToken unit normalizer

The >1e6 → divide-by-1e6 rule, made concrete

Paste a raw timeToFirstToken value as it appears in a Langfuse row. The normalizer applies the guide rule and shows the millisecond value you should actually reason about.

Raw timeToFirstToken value Stated unit (as the row claims, if any)

Enter a value.

Step 67

The symptom → fix playbook net

What it is. One complaint per row, the likely stage, the first thing to check, and the single lever that moves it (with the part that explains the lever).

Symptom	Likely stage	First check	Fix lever (part)
Constant ~1 s pause before every answer, all turns alike	endpointing timers	agent's `vadStopSecs` + `waitSeconds`	lower Stop Delay / Wait seconds within the cutoff trade-off (Part 2)
Mid-sentence cutoffs after lowering timers	endpointing	fragmentation rate in transcript	raise `stop_secs` back, or Smart Turn for pause-heavy flows (Part 2)
~3 s dead air then "could you repeat"	STT failure path	finalizeTimeout signature; STT server health/GPU contention	fix STT serving, not the LLM (Part 3)
Bot talks over interrupting callers for seconds	barge-in path	batch STT + numberOfWords ≥1?	numberOfWords 0 or streaming STT (Part 2)
Some turns instant, some 3–4 s, same node	prefix-cache eviction	eviction curve (Layer 2)	shrink prompt; FP8 KV; admission control (Parts 6, 8)
Every turn in one node slow, others fine	cold prefix / no warmup	does the node have an entry message? `llm_warmup.node_entry` present?	add entry message; merge nodes (Parts 5, 8)
Every turn slow, prompt huge, input ≥8K out ≤200	PREFILL-BOUND	Layer-2 verdict	shrink prompt (the lever board, Part 6)
Long replies feel slow despite fast TTFT	DECODE-BOUND / sentence shape	output tokens; `tts.text_aggregation`	shorten spoken turns; short first sentences (Parts 5, 9)
Routing >2 s with fat tail	GPU contention	warmup collisions; `num_preemptions`	reduce warmup pressure = shrink prompts; never warm harder (Part 5)
New customers slower than repeat customers (cloud)	cross-call cache	cached-token % per call	move dynamic vars to a trailing context block (Part 7)
Cloud agent 2.4 s TTFT after a model swap	reasoning default	`reasoning_effort` in request, `service_tier` in response	pin reasoning to none/minimal (Part 7)
Dead air mid-flow on a specific node	blocking api_call	node pre_actions + vendor RTT	entry message / pre-tool speech / non-blocking (Part 11)
Trace clean but caller feels lag	media path	nothing — it's not in the trace	accept the 150–300 ms floor; check carrier jitter, setup delays (Part 10)
Slow time-to-first-hello on web calls only	setup path	no prefetch + 1.1 s early-media delay	static first message; trim boot work (Parts 10, 11)
"Agent keeps talking after I interrupt" (Asterisk)	shipped-audio buffer	the missing FLUSH_MEDIA on the live serializer	open engineering question — escalate, don't tune (Part 10)

Appendix

The no-runtime-consumer list (latency edition) dead

What it is. Knobs that look like latency levers but do nothing today (verified against pipecat-agent@dev 5b29206c unless noted).

Knob	Status
`aicEnabled` (AIC Filter)	hard-disabled ("crate instability", `base_service.py:1467-1468`)
`webrtcApmEnabled`	merged then reverted (commit `dc90fd7f`, 2026-04-10); ImportError guard means it never loads on dev
`stopSpeakingPlan.voiceSeconds` / `backOffSeconds`	no runtime consumer
`userTurnStopTimeout` slider (voice mode)	floored to ≥30 s at runtime — slider effectively decorative
`timeout_ms` / `max_retries` on LLM-invoked tools	parsed, never consumed (only node actions honor them)
TTS `minTextLength` / `minWords`	read and logged, passed to nothing
`speed` on the native VoxCPM2 server	"accepted for compatibility; not yet applied"
`LanguageTextFilter` langdetect layer	dead code behind an early return, deliberately
`AsteriskWSServerTransport` (`initial_jitter_buffer_ms: 80`)	whole class has no runtime consumer; live path is un-buffered
`prompt_cache_key`	not sent anywhere in production code — measured worse
`serviceTier` on non-OpenAI / on-prem providers	no runtime consumer (OpenAI branch only)
`freya-235b-enhanced`	removed (PR #803) + catalog-collapsed; stale test fixtures only
DTMF `ignore_speech` / `ignore_speech_timeout`	defined; consumer not traced — possibly unwired (not verified either way)

Re-verify after a pullWhen in doubt whether a knob is wired in a newer branch, say so rather than guessing — and re-verify after a pull: features ship, and dead knobs go live (the confidence-reask group did exactly that).

Checkpoint: you're handed a tr-app call URL and the complaint "geç cevap veriyor". Name the first three commands/artifacts you produce, in order, and the one mistake that would waste an hour.

fetch_latency_data.py "<call-url>" — which resolves tr-app → the kkb Langfuse instance (the hour-wasting mistake is querying cloud.langfuse.com because the dashboard printed a cloud URL — "not found" there means wrong instance, not no trace).
latency_breakdown.py on the observations — per-call-type table, prefill/decode verdict, eviction curve.
If the verdict is PREFILL-BOUND, serving_config.py --endpoint <bare base URL> for the KV-capacity math. Then, only if needed and with explicit approval, live probes.

Ask Claude Code: "For this tr-app call URL, fetch the latency data from the correct Langfuse instance, print the per-call-type table and the prefill/decode verdict, and tell me whether the dominant speaking node's eviction curve shows classic mid-call eviction."