How model routing actually picks a model LLM
What it is. The chain for debugging "which model answered." The dashboard
model catalog (models/providers tables) maps a
dashboard-facing model id to a provider plus an upstream model string; the resolve
endpoint enriches it with credentials and the context window; the agent's provider
switch consumes the result.
src/db/schema/models.ts
resolve api/agent/v1/resolve/route.ts:18-91
switch base_service.py:355-545
Runtime effect. enrichServiceConfig() attaches
catalogModel, baseUrl, credentials and
maxContextWindow (resolve route.ts:18-91); the agent's
provider switch reads those (base_service.py:355-545). Three silent
overrides bend the rule that "the dashboard model picks the model":
- On-prem master override —
is_on_premiseforces envLLM_URL/LLM_MODEL/LLM_API_KEYregardless of dashboard config (base_service.py:422-434). On an on-prem box the dashboard model choice is decorative. - Simulation override — sims force non-Freya providers to the
self-hosted model (
base_service.py:1131-1150), so sim latency measures the self-hosted stack, not the configured cloud model. - Fallback LLM race — with Fallback LLM
(
additionalSettings.fallbackLlmEnabled) on, a Gemini 2.5 Flash backup races the primary viaParallelPipeline; it is released only if the primary streams nothing withinttfbTimeoutSeconds(default 1.0 s,boot_steps.py:1965-1998). Converts provider P99 stalls into a bounded ~1 s + Gemini-TTFT worst case. Skipped on-prem.
freya-235b-enhanced was a branded alias that rewrote to
provider=openai, model=gpt-4.1 + priority tier (added PR #406
2026-03-17, removed PR #803 2026-05-19, catalog-collapsed by the 2026-05-22
models_unify migration). No runtime consumer exists today.Symptom it explains: "the dashboard says model X but the latency / token rate looks like model Y." Check the three overrides before blaming the catalog.
The model picker IS the latency dial LLM
What it is. Experiment 1 ran the same ~1.1K-token prompt, the same code, streaming, across 19 models. The spread is an order of magnitude purely from the model pick — geography (TR-hosted vs EU vs US) and reasoning behaviour dominate.
analysis/llm-latency-benchmark/1-model-comparison/findings.md
~1.1K-token prompt, streaming
| Model | TTFT P50 | P95 | Verdict |
|---|---|---|---|
| Ubicloud Qwen3-235B (TR) | 82 ms | 243 ms | fastest; server retired 2026-04-24, valid as "same-region self-hosted vLLM" |
| Gemma 4 31B-it (TR self-hosted) | 98 ms | 112 ms | most consistent (StDev 29 ms) |
| Freya 235B (self-hosted EU) | 453 ms | 651 ms | best EU |
| GPT-5-chat | 590 ms | 832 ms | best external |
| GPT-4o | 633 ms | 987 ms | solid baseline |
| GPT-4.1 | 754 ms | 1200 ms | — |
| GPT-5.x reasoning (low) | 1700–1950 ms | 3000–3500 ms | not viable for voice |
Same prompt, same code: 82 ms → 2.4 s purely by model pick.
Reasoning models are categorically out. Real-agent overhead above raw benchmark TTFT is
~160–350 ms (extraction/routing pre-calls, per-call HTTP client, event-loop
contention; findings.md:38-77).
Symptom it fixes: "every speaking turn feels ~600 ms slower than the benchmark says." A model swap is the single biggest cloud lever — worth an order of magnitude 82 ms–2.4 s.
The reasoning-effort trap landmine
What it is. Leaving reasoning_effort unset on GPT-5.3 maps to
high, not medium.
findings.md:89-97 (2026-03-03)
Runtime effect. Any model swap must pin reasoning to none/minimal
and verify it in the response payload. GPT-5.3 also silently ignores the priority
tier — the API echoes service_tier: "default" regardless of what you
sent.
Symptom it causes: "cloud agent went from snappy to ~2.4 s
TTFT after a model swap, nothing else changed." First check
reasoning_effort in the request and the echoed service_tier in
the response. +1.4 s just from a default.
reasoning_effort sent and the
service_tier echoed in the response."Priority processing LLM
What it is. Processing tier
(additionalSettings.serviceTier → service_tier), shown
only for provider OpenAI in the config panel
(llm-config-panel.tsx:226,515-547); consumed in the openai branch only
(base_service.py:436-440 → injected into every chat request,
pipecat services/openai/base_llm.py:332).
- Measured effect (exp 3, GPT-4o, 100 calls, 2026-02-28): TTFT P50 −59 ms, P95 −147 ms, StDev halved (154 → 73 ms), throughput +58%, cost premium 1.7×. The variance / tail win is the real value — callers feel P95.
- The catch (exp 4): on 22K+-token prompts the TTFT edge evaporates
(+24 ms, noise) while the P95 and throughput wins remain. Support is per-model:
always check the echoed
service_tier.
Symptom it fixes: "P50 is fine but the tail is ugly." Priority buys tail control on small prompts, not raw first-token speed on big ones.
Try it — the model picker as a latency dial
Numbers are Experiment 1 P50/P95 TTFT (findings.md). Priority is an
OpenAI-only tier; for non-OpenAI models and GPT-5.3 the checkbox is disabled and the
request is silently ignored.
Provider prompt caching: what it does and does not buy LLM
What it is. Four experiments, each killing an assumption about OpenAI prompt caching.
- Below 1,024 prompt tokens caching never triggers — 0/50 hits on an 809-token prompt (exp 2, 2026-02-28).
- Cache hits do not cut TTFT on OpenAI — 93% hit rate at 22,752 tokens,
zero TTFT improvement (exp 4). It saves cost, not compute time. Do not
promise TTFT wins from
cached_tokens. prompt_cache_keymade things worse — +~500 ms P50 on Freya's workload despite equal-or-better cache rates; the plain 1-token warmup was the best condition (exp 6, GPT-4.1, ~36K-token real prompt, 2026-03-16). So production sends noprompt_cache_key(a grep over pipecat-agentsrc/finds nothing) — warmups are the chosen mechanism.- Cross-call caching is governed by variable POSITION (exp 7, real 31.5K-token
Allianz BES prompt, 2026-03-18/19). With inline
{{variables}}, the earliest per-customer var sits at 5.4% into the prompt — every new customer pays a near-full cold prefill: GPT-4.1 inline new-customer TTFT avg 3,161 ms / 1.5% tokens cached, P95 9.8 s. Moving all dynamic data into a trailing<context>block (divergence at 98.8%) restored 96.1% token reuse and same-customer TTFT (1,439 ms avg). This is the single biggest cloud-LLM cache lever Freya has — and it is an authoring convention, not a platform feature (no code enforces the layout; substitution is still inline).
cached_tokens > 0, and the first ~1K shared tokens always match. Track
cached-token percentage, never the hit-rate flag.Symptom it fixes: "new customers are 2–3 s slower than repeat customers on the cloud." Move dynamic vars to a trailing context block. +1.7 s avg / +7.6 s P95 recovered on GPT-4.1.
Try it — prefill cost & the cross-call cache estimator
Cloud prefill is ~15 ms per 1,000 prompt tokens. Cross-call reuse dies at the first divergent token — everything after the first dynamic variable re-prefills on every new customer. The exp-7 GPT-4.1 anchors: inline (var at 5.4%) = 1.5% cached / 3,161 ms new-customer TTFT; context block (divergence at 98.8%) = 96.1% cached / 1,439 ms.
Cold start is mostly a myth (with one real case) LLM
What it is. Exp 2 ruled out the usual cold-start fears: a new HTTPS client per call costs nothing measurable (−23 ms, noise), and turn 1 was the fastest turn — no provider-level first-message penalty.
The real per-process cost is hidden by the greeting. Most agents use
MODEL_GENERATED first messages, and that LLM call during setup warms the
TCP connection (and the prefix cache on self-hosted). Of 6 production calls inspected,
only the one without that warm path showed a 4.51 s turn-1 TTFT
(findings.md:47-59).
Symptom it explains: "first turn is glacial but the rest are fine" usually means a static greeting bypassed the connection/cache warmup, not a provider cold start.
Try it — the cross-call cache-hit race
Customer A's call warms a shared prefix on the provider. Customer B's prefill reuses that prefix only up to the first divergent token (the red marker). Drag the toggle to move divergence from inline (5%) to a trailing context block (99%) and watch B's prefill replay with a live TTFT counter.
Checkpoint: a cloud agent's prompt embeds the customer's CIF at the top ("Müšteri No: {{data.cif}} …"). Calls for repeat customers are fast; every first-time customer's first turns are 2–3 s slower. Why, and what's the fix?
Cross-call prefix caching dies at the first divergent token. The inline CIF at ~5% of the prompt means a new customer shares only the first ~5% with any cached prefix — the remaining ~95% (30K tokens) re-prefills: +1.7 s avg / +7.6 s P95 measured on GPT-4.1 (exp 7).
Fix: replace inline per-customer variables with static markers and move all
dynamic data into a single <context> JSON block at the
end of the prompt — restores ~96% token reuse and new-customer TTFT equal
to repeat customers (analysis/7-cross-call-cache/cross-call-experiment.py:228-281).