From "the prefill is slow" to "the cache can't hold it" LLM
Part 5 explained why a cold prefill is expensive and how the prefix cache and warmups hide it. This part answers the follow-up: on the KKB on-prem box, can the cache physically hold every concurrent call's prompt? When it can't, the latency you measured is not a model-speed problem and not a network problem — it is a capacity problem, and the math tells you exactly which lever fixes it.
The KV-cache capacity formulas LLM
What it is. Three formulas decide whether a prompt fits in the prefix cache and survives until the call's next turn. They come straight from the latency-analyzer reference.
KV cache total tokens = num_gpu_blocks × block_size
prompts cacheable = KV total tokens ÷ per-request prompt tokens
prompt % of context = prompt tokens ÷ max_model_len
The thrash condition. If prompts_cacheable < max_num_seqs, the cache cannot hold every admitted call's prefix. The LRU policy then thrashes: a call's prefix can evict before its own next turn, forcing a cold prefill mid-conversation.
- num_gpu_blocks — how many fixed-size KV blocks vLLM carved out of leftover VRAM after loading weights.
- block_size — tokens per block (16 on KKB).
- max_num_seqs — admission control: how many sequences vLLM will run concurrently. This is the "concurrent load" the cache must keep up with.
Symptom it predicts: when the ratio drops below 1, random calls pay +2.6–4.3 s cold turns that the cumulative hit-rate metric never reveals.
KKB live numbers and where the flags live LLM
What it is. The capacity formulas, instantiated with the real KKB serving config confirmed via live /metrics during the case study.
Confirmed config. google/gemma-4-31B-it, max_model_len=32,768, TP=2, max_num_seqs=32, gpu_memory_utilization=0.9, prefix caching ON, cache_dtype=auto (bf16). FP8 KV is NOT enabled in any Freya deploy — that is free capacity left on the table.
Where the flags live. freya-onprem kkb/docker-compose.yml:657–689 (the freya-llm service). The GPU layout comment at :643–651 spells out the 4×H100 split: GPU 0+1 = LLM (TP=2), GPU 2 = TTS, GPU 3 = STT+NC. gpu_memory_utilization is the master capacity knob: vLLM claims that VRAM fraction, loads weights (~58.9 GB for Gemma 31B bf16), and everything left becomes KV cache. The generic start script drops it to 0.70 when sharing a card (gpu/llm/gemma-4-31b/start.sh:18–21).
| Prompt size | Prompts cacheable | Rough cold prefill | % of 32K context |
|---|---|---|---|
| 30K (Anadolu anket) | ~22 | ~5 s | 92% |
| 15.5K (Garanti case) | ~44 | ~2.6–4.3 s | 47% |
| 10K | ~68 | ~1.7 s | 31% |
| 7.5K (halved Garanti prompt) | ~88 | ~1.8–2.2 s cold turn | 23% |
The structural guarantee at 30K: only ~22 prompts fit but max_num_seqs=32 calls are admitted — eviction is built in under campaign load. The compose file's own inline comment says "reduce to 4-8" (kkb/docker-compose.yml:680–681), not yet applied.
Try it: KV-cache capacity calculator
Move the knobs and watch the thrash condition flip. The presets snap you to the real KKB config, the same config with FP8 KV enabled, and the halved-prompt Garanti fix.
The worked example: call eb4a83f7's eviction curve LLM
What it is. Garanti tr-app call eb4a83f7 (2026-06-05), speaking node borc_sorgulama, 15.5K-token prompt. The single most teachable row in the corpus lives here.
| Turn | +start | Latency | in_tok | Gap since prev | State |
|---|---|---|---|---|---|
| 1 | 0.0 s | 2.58 s | 14,575 | – | COLD |
| 2 | 7.0 s | 0.77 s | 14,626 | 7 s | warm |
| 3 | 27.8 s | 0.96 s | 14,749 | 21 s | warm |
| 4 | 32.0 s | 4.33 s | 15,480 | 4 s | COLD |
| 5 | 38.6 s | 3.56 s | 15,542 | 7 s | COLD |
| 6 | 85.3 s | 0.33 s | 15,758 | 47 s | warm |
The lesson. The prefix survived a 21 s gap but evicted inside a 4 s gap. Eviction is driven by other traffic flushing the LRU cache, not by elapsed time — and the prefix came back warm after 47 s, so the load was bursty. Turns 4–5 paid back-to-back cold prefills, and (per Step 26) nothing re-warms a node mid-call. The growing in_tok column is conversation history accumulating as a cache-friendly suffix.
Per-call-type medians on the same call: speaking 1.77 s (p90 4.33), routing 0.85 s, extraction 0.52 s, TTS 0.73 s. Routing and extraction are healthy; the speaking LLM is the dominant variable cost.
Try it: turn-timeline eviction simulator
The six real eb4a83f7 turns. Drag the concurrent-load slider: as other traffic rises, warm green bars flip to cold red. The once-per-node warmup is a one-shot shield — it covers turn 1, but it never re-fires mid-call.
Try it: eviction-curve explorer
Same real eb4a83f7 latencies, plotted as a curve. Drag the warm-threshold slider (default 1.5 s): every turn above the line is relabelled COLD, every turn below is warm, and the warm↔cold transition marker moves live.
The two misleading metrics LLM
What it is. Two numbers a field engineer naturally reaches for — both of which actively hide this failure.
- Cumulative prefix-cache hit rate hides cold big prompts. KKB read 86.5% during the same window this call ate three cold turns. The counter is dominated by the small routing and extraction prompts, which hit nearly always (
serving-config.md:18–20;report.md:53). Track the eviction curve, not the hit rate. - Idle-time
/metricscounters prove nothing.num_preemptions,kv_cache_usage_perc,num_requests_runningreflect the probe instant; the box read zeros minutes after a call that demonstrably evicted at turn 4 (report.md:75–77). The trace eviction curve is the authoritative record.
tensor-parallel-size, max-num-seqs, and --dtype are not exposed. Get them via docker inspect over SSH (serving-config.md:23–29).docker inspect freya-llm — pull out tensor-parallel-size, max-num-seqs, --dtype, and --kv-cache-dtype so I can run the Part 6 capacity math."The lever board (ranked, quantified) LLM
What it is. The fixes, ordered by leverage. From serving-config.md:53–108 plus deploy-config grounding.
| # | Lever | Effect | Cost / risk |
|---|---|---|---|
| 1 | Shrink the prompt | Highest leverage, no infra change. Helps cold-prefill cost (~0.25 s/1K on the loaded box), cache residency (linear: 15.5K → 7.5K doubles ~44 → ~88), and context headroom (47% → 23%) all at once. | Authorial work. Part 8 is the how. |
| 2 | FP8 KV cache --kv-cache-dtype fp8 |
Halves KV bytes/token → ~2× more prompts resident (~44 → ~88). H100-native. | Near-zero quality risk. Enabled nowhere — free capacity. |
| 3 | FP8 weights | Gemma 31B 62 → 31 GB; freed VRAM becomes KV (~+30% capacity) and H100 FP8 tensor cores roughly double prefill throughput. | ~1–2% quality hit; must be re-validated for TR tool-calling (projection from a Blackwell Colab sweep, not validated on KKB H100). |
| 4 | CPU prefix-cache offload (LMCache-style) | Spill evicted prefixes to host RAM; for a 30K prefix, loading from CPU beats recomputing. | More moving parts. |
| 5 | Chunked prefill | Interleaves a big prefill with other requests' decode, softening contention spikes. | No deploy sets the flag; vLLM V1 default-on assumed but not verified against vllm/vllm-openai:v0.20.1. |
| 6 | TP 2 → 4 | ~2× KV cache. | Consumes the GPUs running STT/TTS — cross-stack tradeoff. |
| 7 | max_num_seqs |
Admission control; lowering toward prompts-cacheable trades queueing for cache stability. | Calls may queue under burst. |
| 8 | max_model_len |
Does NOT size the KV pool (the pool is leftover VRAM). Colab sweep grew 32K → 64K with single-stream TTFS only 5–15% slower and better conc-25 stability. | No capacity gain; analysis/llm-latency-benchmark/8-colab-fp8-ctx-sweep/findings.md:10–12,64–68. |
Context-ceiling side effect: the same 15.5K prompt that causes latency is 47% of the 32K window (30K is 92%). Bloat also caps conversation length as history grows.
Checkpoint: a campaign is running 25 concurrent calls against the KKB box; agents use the 30K anket prompt. Without touching the workflow, what single vLLM flag most directly attacks the eviction, and what does the math say?
--kv-cache-dtype fp8. Halving KV bytes per token doubles the pool's effective token capacity, so prompts-cacheable goes ~22 → ~45 — finally above the 25 concurrent calls, breaking the structural prompts_cacheable < admitted_load thrash condition (serving-config.md:67–73).
It is one flag, H100-native, and currently unused in every deploy. (The bigger long-term lever is still halving the prompt itself — Part 8.)