Part 12

Measurement, and the symptom → fix playbook

The two instruments, the observation-name dictionary, the layered diagnosis method, and a symptom-to-fix table that maps each complaint to the one knob that moves it.

Framing

Every fix is downstream of a measurement

Every lever in Parts 2–11 is downstream of a measurement. The platform gives you two complementary instruments; the discipline of this part is learning which question each answers, and not wasting an hour looking in the wrong place.

What dominates here: discipline. The instruments already exist; wrong-instance lookups, double-counted spans, and idle-metrics exoneration are how a 30-minute diagnosis becomes a day.

Step 61

Measure before fixing: the two instruments method

They compose, not competeThe targets table from the intro is the shared yardstick for both. Use the audio-anchored timeline to locate the slow stage, then the trace-anchored decomposition to explain it when the LLM is the culprit.
Step 62

The Langfuse instance gotcha (the #1 time-waster) gotcha

What it is. Every call is a Langfuse session keyed on the call_id (the trace id is different — lookup is session → trace → observations). Freya runs multiple Langfuse instances, resolved from the environment, never from the URL printed in the call record.

Dashboard hostLangfuse instance
staging-app.freyavoice.ai / app.freyavoice.aicloud (cloud.langfuse.com)
tr-app.freyavoice.aikkb self-hosted (kkb-langfuse.freyavoice.ai)
The traps(latency-analyzer references/langfuse-instances.md:3-19) The dashboard can print a cloud.langfuse.com session URL even for an on-prem call — it is wrong; the bundled langfuse-obs MCP is hardwired to cloud, so "trace not found" for a tr-app call means wrong instance, not no trace. Auth is HTTP Basic pk:sk; Cloudflare 403s the default urllib User-Agent on *.freyavoice.ai; cloud retention ~30 days.

Symptom it causes: "trace not found" on a real, recent call — almost always you queried cloud for a tr-app call. The fix is to resolve the instance from the dashboard host, not the printed URL.

Step 63

The observation-name dictionary LLM

What it is. You cannot read a trace without this map (emitters in pipecat-agent). Several spans are renamed at export or are two names for one wall-clock — this is where double-counting starts.

ObservationWhat it measuresSource
turn (attr turn.user_bot_latency_seconds)the felt gap: VAD-stop → bot audio startutils/tracing/turn_trace_observer.py:194; observers/user_bot_latency_observer.py:87-88
vadthe speech segment including user speech — NOT pure endpointingsrc/exts/diagnostics/pipeline_monitor.py:316-328
workflow.routing.batchthe whole extract+route DAG wall-clock (contains extraction)src/core/workflow/processor.py:1258-1261
workflow.intent_matchingthe single batched intent-leaf LLM call (child of routing.batch — same wall-clock, two names)src/core/workflow/router.py:1137
workflow.extraction / .batch / _retryextraction LLM calls / fan-out wall / retrysrc/core/workflow/extractor.py:215-216,362,449
llm.<node> / tts.<node> / stt.<node>speaking LLM / TTS / STT, renamed at export by appending workflow.nodesrc/core/tracing.py:110-111,222-227
tts.text_aggregationLLM first token → first full sentence (the TTFS span)src/exts/tts/freya_tts_v2.py:271-277
llm_warmup.node_entry / .router / .extractorbackground prefix primes — off the critical pathprocessor.py:2883-2895,2945-2991
Decomposition ruleturn.user_bot_latency_seconds ≈ (waitSeconds/STT overlap) + workflow.routing.batch + llm.<node> TTFB + tts.text_aggregation + tts TTFB. Because routing.batch already contains extraction, never add workflow.extraction.batch on top.

Try it — Decompose a turn (and avoid double-counting)

turn.user_bot_latency_seconds from its component spans

Set each span's wall-clock. The felt turn sums the components per the decomposition rule. Toggle the two classic double-count traps to see how they inflate the number — routing.batch already contains extraction, and intent_matching is the same pass as routing.batch.

400 ms
1200 ms
600 ms
500 ms
350 ms
Where the felt turn comes from
wait / STT overlap routing + LLM TTS double-counted
Move the sliders.
Step 64

The layered diagnosis method LLM

What it is. The latency-analyzer recipe (SKILL.md:53-72) — run in order, stop when one layer fully explains the latency.

PREFILL-BOUND: in ≥8000 & out ≤200 DECODE-BOUND: out ≥300 warm threshold 1.5 s

Try it — Layer-2 verdict & eviction curve

Mirrors latency_breakdown.py: the verdict thresholds and the warm/cold read

Set the dominant speaking node's median input/output tokens for the verdict, then read the per-turn TTFB curve against the warm threshold (default 1.5 s) for the eviction story.

9000 tok
120 tok
1500 ms
Per-turn speaking TTFB vs warm threshold
warm turn cold turn (> threshold)
Move the sliders.
Step 65

Separated recording tracks: the prerequisite for true-gap measurement TTS

What it is. GET /api/v2/call/{callId}/recording?track=user|assistanttrack is the only recognized parameter (freya-dashboard src/app/api/v2/call/[callId]/recording/route.ts:38-47); no track, or wrong params like channel=, fall through to the same mixed file.

Mixed audio cannot measure the gapMixed audio cannot distinguish echo from speech or give clean per-speaker boundaries — using it when separated tracks exist is a named pitfall in the debug-call-audio skill. The fetch_call.py helper downloads user.wav, assistant.wav, mixed.wav plus call/transcript/workflow/agent JSON in one pass.

Symptom it fixes: a latency analysis that cannot locate the true caller-perceived gap because echo and speech are tangled in one waveform. Always pull the separated tracks before timing endpointing.

Step 66

Reading gotchas and statistical hygiene gotcha

What it is. The traps that produce wrong diagnoses:

The nanosecond unit traptimeToFirstToken arrives in seconds, milliseconds, or nanoseconds depending on the row. The rule of thumb: if the value is > 1e6, it is nanoseconds — divide by 1e6 to get milliseconds. Applying it blindly to a value already in ms turns 1.4 s into 0.0000014 s.

Try it — timeToFirstToken unit normalizer

The >1e6 → divide-by-1e6 rule, made concrete

Paste a raw timeToFirstToken value as it appears in a Langfuse row. The normalizer applies the guide rule and shows the millisecond value you should actually reason about.

Enter a value.
Enter a value.
Step 67

The symptom → fix playbook net

What it is. One complaint per row, the likely stage, the first thing to check, and the single lever that moves it (with the part that explains the lever).

SymptomLikely stageFirst checkFix lever (part)
Constant ~1 s pause before every answer, all turns alikeendpointing timersagent's vadStopSecs + waitSecondslower Stop Delay / Wait seconds within the cutoff trade-off (Part 2)
Mid-sentence cutoffs after lowering timersendpointingfragmentation rate in transcriptraise stop_secs back, or Smart Turn for pause-heavy flows (Part 2)
~3 s dead air then "could you repeat"STT failure pathfinalizeTimeout signature; STT server health/GPU contentionfix STT serving, not the LLM (Part 3)
Bot talks over interrupting callers for secondsbarge-in pathbatch STT + numberOfWords ≥1?numberOfWords 0 or streaming STT (Part 2)
Some turns instant, some 3–4 s, same nodeprefix-cache evictioneviction curve (Layer 2)shrink prompt; FP8 KV; admission control (Parts 6, 8)
Every turn in one node slow, others finecold prefix / no warmupdoes the node have an entry message? llm_warmup.node_entry present?add entry message; merge nodes (Parts 5, 8)
Every turn slow, prompt huge, input ≥8K out ≤200PREFILL-BOUNDLayer-2 verdictshrink prompt (the lever board, Part 6)
Long replies feel slow despite fast TTFTDECODE-BOUND / sentence shapeoutput tokens; tts.text_aggregationshorten spoken turns; short first sentences (Parts 5, 9)
Routing >2 s with fat tailGPU contentionwarmup collisions; num_preemptionsreduce warmup pressure = shrink prompts; never warm harder (Part 5)
New customers slower than repeat customers (cloud)cross-call cachecached-token % per callmove dynamic vars to a trailing context block (Part 7)
Cloud agent 2.4 s TTFT after a model swapreasoning defaultreasoning_effort in request, service_tier in responsepin reasoning to none/minimal (Part 7)
Dead air mid-flow on a specific nodeblocking api_callnode pre_actions + vendor RTTentry message / pre-tool speech / non-blocking (Part 11)
Trace clean but caller feels lagmedia pathnothing — it's not in the traceaccept the 150–300 ms floor; check carrier jitter, setup delays (Part 10)
Slow time-to-first-hello on web calls onlysetup pathno prefetch + 1.1 s early-media delaystatic first message; trim boot work (Parts 10, 11)
"Agent keeps talking after I interrupt" (Asterisk)shipped-audio bufferthe missing FLUSH_MEDIA on the live serializeropen engineering question — escalate, don't tune (Part 10)
Appendix

The no-runtime-consumer list (latency edition) dead

What it is. Knobs that look like latency levers but do nothing today (verified against pipecat-agent@dev 5b29206c unless noted).

KnobStatus
aicEnabled (AIC Filter)hard-disabled ("crate instability", base_service.py:1467-1468)
webrtcApmEnabledmerged then reverted (commit dc90fd7f, 2026-04-10); ImportError guard means it never loads on dev
stopSpeakingPlan.voiceSeconds / backOffSecondsno runtime consumer
userTurnStopTimeout slider (voice mode)floored to ≥30 s at runtime — slider effectively decorative
timeout_ms / max_retries on LLM-invoked toolsparsed, never consumed (only node actions honor them)
TTS minTextLength / minWordsread and logged, passed to nothing
speed on the native VoxCPM2 server"accepted for compatibility; not yet applied"
LanguageTextFilter langdetect layerdead code behind an early return, deliberately
AsteriskWSServerTransport (initial_jitter_buffer_ms: 80)whole class has no runtime consumer; live path is un-buffered
prompt_cache_keynot sent anywhere in production code — measured worse
serviceTier on non-OpenAI / on-prem providersno runtime consumer (OpenAI branch only)
freya-235b-enhancedremoved (PR #803) + catalog-collapsed; stale test fixtures only
DTMF ignore_speech / ignore_speech_timeoutdefined; consumer not traced — possibly unwired (not verified either way)
Re-verify after a pullWhen in doubt whether a knob is wired in a newer branch, say so rather than guessing — and re-verify after a pull: features ship, and dead knobs go live (the confidence-reask group did exactly that).
Checkpoint: you're handed a tr-app call URL and the complaint "geç cevap veriyor". Name the first three commands/artifacts you produce, in order, and the one mistake that would waste an hour.
  1. fetch_latency_data.py "<call-url>" — which resolves tr-app → the kkb Langfuse instance (the hour-wasting mistake is querying cloud.langfuse.com because the dashboard printed a cloud URL — "not found" there means wrong instance, not no trace).
  2. latency_breakdown.py on the observations — per-call-type table, prefill/decode verdict, eviction curve.
  3. If the verdict is PREFILL-BOUND, serving_config.py --endpoint <bare base URL> for the KV-capacity math. Then, only if needed and with explicit approval, live probes.
Ask Claude Code: "For this tr-app call URL, fetch the latency data from the correct Langfuse instance, print the per-call-type table and the prefill/decode verdict, and tell me whether the dominant speaking node's eviction curve shows classic mid-call eviction."