Part 7

The cloud provider path

On cloud agents the serving box is someone else's fleet — prefill is ~15 ms/1K tokens and eviction is rare — so the levers move: model choice, priority tier, and prompt layout for cross-call caching.

Step 34

How model routing actually picks a model LLM

What it is. The chain for debugging "which model answered." The dashboard model catalog (models/providers tables) maps a dashboard-facing model id to a provider plus an upstream model string; the resolve endpoint enriches it with credentials and the context window; the agent's provider switch consumes the result.

freya-dashboard src/db/schema/models.ts resolve api/agent/v1/resolve/route.ts:18-91 switch base_service.py:355-545

Runtime effect. enrichServiceConfig() attaches catalogModel, baseUrl, credentials and maxContextWindow (resolve route.ts:18-91); the agent's provider switch reads those (base_service.py:355-545). Three silent overrides bend the rule that "the dashboard model picks the model":

Historical footnote — do not cargo-cult freya-235b-enhanced was a branded alias that rewrote to provider=openai, model=gpt-4.1 + priority tier (added PR #406 2026-03-17, removed PR #803 2026-05-19, catalog-collapsed by the 2026-05-22 models_unify migration). No runtime consumer exists today.

Symptom it explains: "the dashboard says model X but the latency / token rate looks like model Y." Check the three overrides before blaming the catalog.

Step 35

The model picker IS the latency dial LLM

What it is. Experiment 1 ran the same ~1.1K-token prompt, the same code, streaming, across 19 models. The spread is an order of magnitude purely from the model pick — geography (TR-hosted vs EU vs US) and reasoning behaviour dominate.

analysis/llm-latency-benchmark/1-model-comparison/findings.md ~1.1K-token prompt, streaming
ModelTTFT P50P95Verdict
Ubicloud Qwen3-235B (TR)82 ms243 msfastest; server retired 2026-04-24, valid as "same-region self-hosted vLLM"
Gemma 4 31B-it (TR self-hosted)98 ms112 msmost consistent (StDev 29 ms)
Freya 235B (self-hosted EU)453 ms651 msbest EU
GPT-5-chat590 ms832 msbest external
GPT-4o633 ms987 mssolid baseline
GPT-4.1754 ms1200 ms
GPT-5.x reasoning (low)1700–1950 ms3000–3500 msnot viable for voice

Same prompt, same code: 82 ms → 2.4 s purely by model pick. Reasoning models are categorically out. Real-agent overhead above raw benchmark TTFT is ~160–350 ms (extraction/routing pre-calls, per-call HTTP client, event-loop contention; findings.md:38-77).

Symptom it fixes: "every speaking turn feels ~600 ms slower than the benchmark says." A model swap is the single biggest cloud lever — worth an order of magnitude 82 ms–2.4 s.

Step 36

The reasoning-effort trap landmine

What it is. Leaving reasoning_effort unset on GPT-5.3 maps to high, not medium.

default (unset = high) P50 2,395 ms (+224% vs GPT-4o) minimal P50 994 ms findings.md:89-97 (2026-03-03)

Runtime effect. Any model swap must pin reasoning to none/minimal and verify it in the response payload. GPT-5.3 also silently ignores the priority tier — the API echoes service_tier: "default" regardless of what you sent.

Symptom it causes: "cloud agent went from snappy to ~2.4 s TTFT after a model swap, nothing else changed." First check reasoning_effort in the request and the echoed service_tier in the response. +1.4 s just from a default.

Ask Claude Code: "Pull this call's generation from Langfuse and show me the reasoning_effort sent and the service_tier echoed in the response."
Step 37

Priority processing LLM

What it is. Processing tier (additionalSettings.serviceTierservice_tier), shown only for provider OpenAI in the config panel (llm-config-panel.tsx:226,515-547); consumed in the openai branch only (base_service.py:436-440 → injected into every chat request, pipecat services/openai/base_llm.py:332).

No runtime consumer on Freya-provider or on-prem vLLM has no priority queue. The tier is OpenAI-only — on a self-hosted config it is read by nothing.

Symptom it fixes: "P50 is fine but the tail is ugly." Priority buys tail control on small prompts, not raw first-token speed on big ones.

Try it — the model picker as a latency dial

Turn timeline: pick a model, toggle priority

Numbers are Experiment 1 P50/P95 TTFT (findings.md). Priority is an OpenAI-only tier; for non-OpenAI models and GPT-5.3 the checkbox is disabled and the request is silently ignored.

One speaking turn (P50, priority applied where valid)
agent overhead (~0.25 s) LLM TTFT
Pick a model.
Step 38

Provider prompt caching: what it does and does not buy LLM

What it is. Four experiments, each killing an assumption about OpenAI prompt caching.

  1. Below 1,024 prompt tokens caching never triggers — 0/50 hits on an 809-token prompt (exp 2, 2026-02-28).
  2. Cache hits do not cut TTFT on OpenAI — 93% hit rate at 22,752 tokens, zero TTFT improvement (exp 4). It saves cost, not compute time. Do not promise TTFT wins from cached_tokens.
  3. prompt_cache_key made things worse — +~500 ms P50 on Freya's workload despite equal-or-better cache rates; the plain 1-token warmup was the best condition (exp 6, GPT-4.1, ~36K-token real prompt, 2026-03-16). So production sends no prompt_cache_key (a grep over pipecat-agent src/ finds nothing) — warmups are the chosen mechanism.
  4. Cross-call caching is governed by variable POSITION (exp 7, real 31.5K-token Allianz BES prompt, 2026-03-18/19). With inline {{variables}}, the earliest per-customer var sits at 5.4% into the prompt — every new customer pays a near-full cold prefill: GPT-4.1 inline new-customer TTFT avg 3,161 ms / 1.5% tokens cached, P95 9.8 s. Moving all dynamic data into a trailing <context> block (divergence at 98.8%) restored 96.1% token reuse and same-customer TTFT (1,439 ms avg). This is the single biggest cloud-LLM cache lever Freya has — and it is an authoring convention, not a platform feature (no code enforces the layout; substitution is still inline).
The metric trapgpt-4o inline showed a 93% "cache hit rate" with only 3% of tokens cached — a "hit" is just cached_tokens > 0, and the first ~1K shared tokens always match. Track cached-token percentage, never the hit-rate flag.

Symptom it fixes: "new customers are 2–3 s slower than repeat customers on the cloud." Move dynamic vars to a trailing context block. +1.7 s avg / +7.6 s P95 recovered on GPT-4.1.

Try it — prefill cost & the cross-call cache estimator

Where does the first dynamic variable sit?

Cloud prefill is ~15 ms per 1,000 prompt tokens. Cross-call reuse dies at the first divergent token — everything after the first dynamic variable re-prefills on every new customer. The exp-7 GPT-4.1 anchors: inline (var at 5.4%) = 1.5% cached / 3,161 ms new-customer TTFT; context block (divergence at 98.8%) = 96.1% cached / 1,439 ms.

31,500
5.4%
Prompt layout: cached prefix vs re-prefilled tail
cached prefix (reused) re-prefilled on new customer
Move the sliders.
Step 39

Cold start is mostly a myth (with one real case) LLM

What it is. Exp 2 ruled out the usual cold-start fears: a new HTTPS client per call costs nothing measurable (−23 ms, noise), and turn 1 was the fastest turn — no provider-level first-message penalty.

The real per-process cost is hidden by the greeting. Most agents use MODEL_GENERATED first messages, and that LLM call during setup warms the TCP connection (and the prefix cache on self-hosted). Of 6 production calls inspected, only the one without that warm path showed a 4.51 s turn-1 TTFT (findings.md:47-59).

Symptom it explains: "first turn is glacial but the rest are fine" usually means a static greeting bypassed the connection/cache warmup, not a provider cold start.

Try it — the cross-call cache-hit race

Two consecutive calls: Customer A, then a brand-new Customer B

Customer A's call warms a shared prefix on the provider. Customer B's prefill reuses that prefix only up to the first divergent token (the red marker). Drag the toggle to move divergence from inline (5%) to a trailing context block (99%) and watch B's prefill replay with a live TTFT counter.

5% (inline)
Call A — cold, warms the prefix
Call B — new customer, replays from the divergence marker
TTFT: 0 ms
Drag the toggle and press Replay.
Checkpoint: a cloud agent's prompt embeds the customer's CIF at the top ("Müšteri No: {{data.cif}} …"). Calls for repeat customers are fast; every first-time customer's first turns are 2–3 s slower. Why, and what's the fix?

Cross-call prefix caching dies at the first divergent token. The inline CIF at ~5% of the prompt means a new customer shares only the first ~5% with any cached prefix — the remaining ~95% (30K tokens) re-prefills: +1.7 s avg / +7.6 s P95 measured on GPT-4.1 (exp 7).

Fix: replace inline per-customer variables with static markers and move all dynamic data into a single <context> JSON block at the end of the prompt — restores ~96% token reuse and new-customer TTFT equal to repeat customers (analysis/7-cross-call-cache/cross-call-experiment.py:228-281).

What dominates hereModel choice (an order of magnitude across the table) and prompt layout for cross-call reuse. Priority tier and caching flags are second-order; reasoning defaults are a landmine.
Ask Claude Code: "Compute the cached-token percentage for the last 20 calls of this cloud workspace and flag any with the first dynamic variable in the top 20% of the prompt."