Part 7 — The cloud provider path · Freya End-to-End Latency

Step 34

How model routing actually picks a model LLM

What it is. The chain for debugging "which model answered." The dashboard model catalog (models/providers tables) maps a dashboard-facing model id to a provider plus an upstream model string; the resolve endpoint enriches it with credentials and the context window; the agent's provider switch consumes the result.

freya-dashboard src/db/schema/models.ts resolve api/agent/v1/resolve/route.ts:18-91 switch base_service.py:355-545

Runtime effect. enrichServiceConfig() attaches catalogModel, baseUrl, credentials and maxContextWindow (resolve route.ts:18-91); the agent's provider switch reads those (base_service.py:355-545). Three silent overrides bend the rule that "the dashboard model picks the model":

On-prem master override — is_on_premise forces env LLM_URL/LLM_MODEL/LLM_API_KEY regardless of dashboard config (base_service.py:422-434). On an on-prem box the dashboard model choice is decorative.
Simulation override — sims force non-Freya providers to the self-hosted model (base_service.py:1131-1150), so sim latency measures the self-hosted stack, not the configured cloud model.
Fallback LLM race — with Fallback LLM (additionalSettings.fallbackLlmEnabled) on, a Gemini 2.5 Flash backup races the primary via ParallelPipeline; it is released only if the primary streams nothing within ttfbTimeoutSeconds (default 1.0 s, boot_steps.py:1965-1998). Converts provider P99 stalls into a bounded ~1 s + Gemini-TTFT worst case. Skipped on-prem.

Historical footnote — do not cargo-cult freya-235b-enhanced was a branded alias that rewrote to provider=openai, model=gpt-4.1 + priority tier (added PR #406 2026-03-17, removed PR #803 2026-05-19, catalog-collapsed by the 2026-05-22 models_unify migration). No runtime consumer exists today.

Symptom it explains: "the dashboard says model X but the latency / token rate looks like model Y." Check the three overrides before blaming the catalog.

Step 35

The model picker IS the latency dial LLM

What it is. Experiment 1 ran the same ~1.1K-token prompt, the same code, streaming, across 19 models. The spread is an order of magnitude purely from the model pick — geography (TR-hosted vs EU vs US) and reasoning behaviour dominate.

analysis/llm-latency-benchmark/1-model-comparison/findings.md ~1.1K-token prompt, streaming

Model	TTFT P50	P95	Verdict
Ubicloud Qwen3-235B (TR)	82 ms	243 ms	fastest; server retired 2026-04-24, valid as "same-region self-hosted vLLM"
Gemma 4 31B-it (TR self-hosted)	98 ms	112 ms	most consistent (StDev 29 ms)
Freya 235B (self-hosted EU)	453 ms	651 ms	best EU
GPT-5-chat	590 ms	832 ms	best external
GPT-4o	633 ms	987 ms	solid baseline
GPT-4.1	754 ms	1200 ms	—
GPT-5.x reasoning (low)	1700–1950 ms	3000–3500 ms	not viable for voice

Same prompt, same code: 82 ms → 2.4 s purely by model pick. Reasoning models are categorically out. Real-agent overhead above raw benchmark TTFT is ~160–350 ms (extraction/routing pre-calls, per-call HTTP client, event-loop contention; findings.md:38-77).

Symptom it fixes: "every speaking turn feels ~600 ms slower than the benchmark says." A model swap is the single biggest cloud lever — worth an order of magnitude 82 ms–2.4 s.

Step 36

The reasoning-effort trap landmine

What it is. Leaving reasoning_effort unset on GPT-5.3 maps to high, not medium.

default (unset = high) P50 2,395 ms (+224% vs GPT-4o) minimal P50 994 ms findings.md:89-97 (2026-03-03)

Runtime effect. Any model swap must pin reasoning to none/minimal and verify it in the response payload. GPT-5.3 also silently ignores the priority tier — the API echoes service_tier: "default" regardless of what you sent.

Symptom it causes: "cloud agent went from snappy to ~2.4 s TTFT after a model swap, nothing else changed." First check reasoning_effort in the request and the echoed service_tier in the response. +1.4 s just from a default.

Ask Claude Code: "Pull this call's generation from Langfuse and show me the reasoning_effort sent and the service_tier echoed in the response."

Step 37

Priority processing LLM

What it is. Processing tier (additionalSettings.serviceTier → service_tier), shown only for provider OpenAI in the config panel (llm-config-panel.tsx:226,515-547); consumed in the openai branch only (base_service.py:436-440 → injected into every chat request, pipecat services/openai/base_llm.py:332).

No runtime consumer on Freya-provider or on-prem vLLM has no priority queue. The tier is OpenAI-only — on a self-hosted config it is read by nothing.

Measured effect (exp 3, GPT-4o, 100 calls, 2026-02-28): TTFT P50 −59 ms, P95 −147 ms, StDev halved (154 → 73 ms), throughput +58%, cost premium 1.7×. The variance / tail win is the real value — callers feel P95.
The catch (exp 4): on 22K+-token prompts the TTFT edge evaporates (+24 ms, noise) while the P95 and throughput wins remain. Support is per-model: always check the echoed service_tier.

Symptom it fixes: "P50 is fine but the tail is ugly." Priority buys tail control on small prompts, not raw first-token speed on big ones.

Try it — the model picker as a latency dial

Turn timeline: pick a model, toggle priority

Numbers are Experiment 1 P50/P95 TTFT (findings.md). Priority is an OpenAI-only tier; for non-OpenAI models and GPT-5.3 the checkbox is disabled and the request is silently ignored.

Speaking model

Priority tier

One speaking turn (P50, priority applied where valid)

agent overhead (~0.25 s) LLM TTFT

Pick a model.

Step 38

Provider prompt caching: what it does and does not buy LLM

What it is. Four experiments, each killing an assumption about OpenAI prompt caching.

Below 1,024 prompt tokens caching never triggers — 0/50 hits on an 809-token prompt (exp 2, 2026-02-28).
Cache hits do not cut TTFT on OpenAI — 93% hit rate at 22,752 tokens, zero TTFT improvement (exp 4). It saves cost, not compute time. Do not promise TTFT wins from cached_tokens.
prompt_cache_key made things worse — +~500 ms P50 on Freya's workload despite equal-or-better cache rates; the plain 1-token warmup was the best condition (exp 6, GPT-4.1, ~36K-token real prompt, 2026-03-16). So production sends no prompt_cache_key (a grep over pipecat-agent src/ finds nothing) — warmups are the chosen mechanism.
Cross-call caching is governed by variable POSITION (exp 7, real 31.5K-token Allianz BES prompt, 2026-03-18/19). With inline {{variables}}, the earliest per-customer var sits at 5.4% into the prompt — every new customer pays a near-full cold prefill: GPT-4.1 inline new-customer TTFT avg 3,161 ms / 1.5% tokens cached, P95 9.8 s. Moving all dynamic data into a trailing <context> block (divergence at 98.8%) restored 96.1% token reuse and same-customer TTFT (1,439 ms avg). This is the single biggest cloud-LLM cache lever Freya has — and it is an authoring convention, not a platform feature (no code enforces the layout; substitution is still inline).

The metric trapgpt-4o inline showed a 93% "cache hit rate" with only 3% of tokens cached — a "hit" is just cached_tokens > 0, and the first ~1K shared tokens always match. Track cached-token percentage, never the hit-rate flag.

Symptom it fixes: "new customers are 2–3 s slower than repeat customers on the cloud." Move dynamic vars to a trailing context block. +1.7 s avg / +7.6 s P95 recovered on GPT-4.1.

Try it — prefill cost & the cross-call cache estimator

Where does the first dynamic variable sit?

Cloud prefill is ~15 ms per 1,000 prompt tokens. Cross-call reuse dies at the first divergent token — everything after the first dynamic variable re-prefills on every new customer. The exp-7 GPT-4.1 anchors: inline (var at 5.4%) = 1.5% cached / 3,161 ms new-customer TTFT; context block (divergence at 98.8%) = 96.1% cached / 1,439 ms.

Prompt tokens

31,500

First dynamic variable at X% of the prompt

5.4%

Prompt layout: cached prefix vs re-prefilled tail

cached prefix (reused) re-prefilled on new customer

Move the sliders.

Step 39

Cold start is mostly a myth (with one real case) LLM

What it is. Exp 2 ruled out the usual cold-start fears: a new HTTPS client per call costs nothing measurable (−23 ms, noise), and turn 1 was the fastest turn — no provider-level first-message penalty.

The real per-process cost is hidden by the greeting. Most agents use MODEL_GENERATED first messages, and that LLM call during setup warms the TCP connection (and the prefix cache on self-hosted). Of 6 production calls inspected, only the one without that warm path showed a 4.51 s turn-1 TTFT (findings.md:47-59).

Symptom it explains: "first turn is glacial but the rest are fine" usually means a static greeting bypassed the connection/cache warmup, not a provider cold start.

Try it — the cross-call cache-hit race

Two consecutive calls: Customer A, then a brand-new Customer B

Customer A's call warms a shared prefix on the provider. Customer B's prefill reuses that prefix only up to the first divergent token (the red marker). Drag the toggle to move divergence from inline (5%) to a trailing context block (99%) and watch B's prefill replay with a live TTFT counter.

Divergence point for the new customer

5% (inline)

Call A — cold, warms the prefix

Call B — new customer, replays from the divergence marker

TTFT: 0 ms

Drag the toggle and press Replay.

Checkpoint: a cloud agent's prompt embeds the customer's CIF at the top ("Müšteri No: {{data.cif}} …"). Calls for repeat customers are fast; every first-time customer's first turns are 2–3 s slower. Why, and what's the fix?

Cross-call prefix caching dies at the first divergent token. The inline CIF at ~5% of the prompt means a new customer shares only the first ~5% with any cached prefix — the remaining ~95% (30K tokens) re-prefills: +1.7 s avg / +7.6 s P95 measured on GPT-4.1 (exp 7).

Fix: replace inline per-customer variables with static markers and move all dynamic data into a single <context> JSON block at the end of the prompt — restores ~96% token reuse and new-customer TTFT equal to repeat customers (analysis/7-cross-call-cache/cross-call-experiment.py:228-281).

What dominates hereModel choice (an order of magnitude across the table) and prompt layout for cross-call reuse. Priority tier and caching flags are second-order; reasoning defaults are a landmine.

Ask Claude Code: "Compute the cached-token percentage for the last 20 calls of this cloud workspace and flag any with the first dynamic variable in the top 20% of the prompt."