Part 6 — On-prem vLLM serving: capacity math and the Garanti case study

Where we are

From "the prefill is slow" to "the cache can't hold it" LLM

Part 5 explained why a cold prefill is expensive and how the prefix cache and warmups hide it. This part answers the follow-up: on the KKB on-prem box, can the cache physically hold every concurrent call's prompt? When it can't, the latency you measured is not a model-speed problem and not a network problem — it is a capacity problem, and the math tells you exactly which lever fixes it.

The one ratio that governs this partEverything below reduces to prompts_cacheable : concurrent load. Above 1, the box feels instant (TTFT ~0.2 s). Below 1, individual calls randomly eat 3–4 s cold turns while every aggregate metric still looks green.

Step 29

The KV-cache capacity formulas LLM

What it is. Three formulas decide whether a prompt fits in the prefix cache and survives until the call's next turn. They come straight from the latency-analyzer reference.

latency-analyzer references/serving-config.md:34–43

KV cache total tokens = num_gpu_blocks × block_size
prompts cacheable     = KV total tokens ÷ per-request prompt tokens
prompt % of context   = prompt tokens ÷ max_model_len

The thrash condition. If prompts_cacheable < max_num_seqs, the cache cannot hold every admitted call's prefix. The LRU policy then thrashes: a call's prefix can evict before its own next turn, forcing a cold prefill mid-conversation.

num_gpu_blocks — how many fixed-size KV blocks vLLM carved out of leftover VRAM after loading weights.
block_size — tokens per block (16 on KKB).
max_num_seqs — admission control: how many sequences vLLM will run concurrently. This is the "concurrent load" the cache must keep up with.

Symptom it predicts: when the ratio drops below 1, random calls pay +2.6–4.3 s cold turns that the cumulative hit-rate metric never reveals.

Step 30

KKB live numbers and where the flags live LLM

What it is. The capacity formulas, instantiated with the real KKB serving config confirmed via live /metrics during the case study.

report.md:49–57 num_gpu_blocks 42,678 block_size 16 KV total 682,848 tokens

Confirmed config. google/gemma-4-31B-it, max_model_len=32,768, TP=2, max_num_seqs=32, gpu_memory_utilization=0.9, prefix caching ON, cache_dtype=auto (bf16). FP8 KV is NOT enabled in any Freya deploy — that is free capacity left on the table.

Where the flags live. freya-onprem kkb/docker-compose.yml:657–689 (the freya-llm service). The GPU layout comment at :643–651 spells out the 4×H100 split: GPU 0+1 = LLM (TP=2), GPU 2 = TTS, GPU 3 = STT+NC. gpu_memory_utilization is the master capacity knob: vLLM claims that VRAM fraction, loads weights (~58.9 GB for Gemma 31B bf16), and everything left becomes KV cache. The generic start script drops it to 0.70 when sharing a card (gpu/llm/gemma-4-31b/start.sh:18–21).

Capacity table that falls out of the 682,848-token cache

Prompt size	Prompts cacheable	Rough cold prefill	% of 32K context
30K (Anadolu anket)	~22	~5 s	92%
15.5K (Garanti case)	~44	~2.6–4.3 s	47%
10K	~68	~1.7 s	31%
7.5K (halved Garanti prompt)	~88	~1.8–2.2 s cold turn	23%

The structural guarantee at 30K: only ~22 prompts fit but max_num_seqs=32 calls are admitted — eviction is built in under campaign load. The compose file's own inline comment says "reduce to 4-8" (kkb/docker-compose.yml:680–681), not yet applied.

Try it: KV-cache capacity calculator

Move the knobs and watch the thrash condition flip. The presets snap you to the real KKB config, the same config with FP8 KV enabled, and the halved-prompt Garanti fix.

Prompts-cacheable vs concurrent load

Per-request prompt tokens

15,500

num_gpu_blocks

42,678

block_size (tokens per block)

max_num_seqs (concurrent load admitted)

KV dtype FP8 (halves bytes/token → doubles effective token pool)

KV total tokens

682,848

Prompts cacheable

Concurrent load

Prompt % of 32K context 47%

Move a knob.

Step 31

The worked example: call eb4a83f7's eviction curve LLM

What it is. Garanti tr-app call eb4a83f7 (2026-06-05), speaking node borc_sorgulama, 15.5K-token prompt. The single most teachable row in the corpus lives here.

/Users/alpsencer/latency-debug/eb4a83f7-4869-4288-8cf2-14d24a7dc4fc/report.md

Six real speaking turns on call eb4a83f7

Turn	+start	Latency	in_tok	Gap since prev	State
1	0.0 s	2.58 s	14,575	–	COLD
2	7.0 s	0.77 s	14,626	7 s	warm
3	27.8 s	0.96 s	14,749	21 s	warm
4	32.0 s	4.33 s	15,480	4 s	COLD
5	38.6 s	3.56 s	15,542	7 s	COLD
6	85.3 s	0.33 s	15,758	47 s	warm

The lesson. The prefix survived a 21 s gap but evicted inside a 4 s gap. Eviction is driven by other traffic flushing the LRU cache, not by elapsed time — and the prefix came back warm after 47 s, so the load was bursty. Turns 4–5 paid back-to-back cold prefills, and (per Step 26) nothing re-warms a node mid-call. The growing in_tok column is conversation history accumulating as a cache-friendly suffix.

Per-call-type medians on the same call: speaking 1.77 s (p90 4.33), routing 0.85 s, extraction 0.52 s, TTS 0.73 s. Routing and extraction are healthy; the speaking LLM is the dominant variable cost.

Attribution caveat (from the report itself)The 4-second-eviction → concurrent-load link is inference from the LRU mechanism plus the bursty pattern. No per-second KV telemetry exists for the call window.

Try it: turn-timeline eviction simulator

The six real eb4a83f7 turns. Drag the concurrent-load slider: as other traffic rises, warm green bars flip to cold red. The once-per-node warmup is a one-shot shield — it covers turn 1, but it never re-fires mid-call.

Eviction under concurrent load

Concurrent load on the box (other calls competing for the LRU cache)

warm (0.3–1.0 s) cold (2.6–4.3 s) once-per-node warmup shield (turn 1)

Drag the load slider.

Try it: eviction-curve explorer

Same real eb4a83f7 latencies, plotted as a curve. Drag the warm-threshold slider (default 1.5 s): every turn above the line is relabelled COLD, every turn below is warm, and the warm↔cold transition marker moves live.

Per-turn speaking latency vs a warm threshold

Warm threshold (s) — turns above this count as a cold prefill

1.50

warm COLD warm threshold

Drag the threshold.

Step 32

The two misleading metrics LLM

What it is. Two numbers a field engineer naturally reaches for — both of which actively hide this failure.

Cumulative prefix-cache hit rate hides cold big prompts. KKB read 86.5% during the same window this call ate three cold turns. The counter is dominated by the small routing and extraction prompts, which hit nearly always (serving-config.md:18–20; report.md:53). Track the eviction curve, not the hit rate.
Idle-time /metrics counters prove nothing. num_preemptions, kv_cache_usage_perc, num_requests_running reflect the probe instant; the box read zeros minutes after a call that demonstrably evicted at turn 4 (report.md:75–77). The trace eviction curve is the authoritative record.

What you can't read from /metricstensor-parallel-size, max-num-seqs, and --dtype are not exposed. Get them via docker inspect over SSH (serving-config.md:23–29).

Ask Claude Code: "SSH to the KKB box and run docker inspect freya-llm — pull out tensor-parallel-size, max-num-seqs, --dtype, and --kv-cache-dtype so I can run the Part 6 capacity math."

Step 33

The lever board (ranked, quantified) LLM

What it is. The fixes, ordered by leverage. From serving-config.md:53–108 plus deploy-config grounding.

Levers, highest leverage first

#	Lever	Effect	Cost / risk
1	Shrink the prompt	Highest leverage, no infra change. Helps cold-prefill cost (~0.25 s/1K on the loaded box), cache residency (linear: 15.5K → 7.5K doubles ~44 → ~88), and context headroom (47% → 23%) all at once.	Authorial work. Part 8 is the how.
2	FP8 KV cache `--kv-cache-dtype fp8`	Halves KV bytes/token → ~2× more prompts resident (~44 → ~88). H100-native.	Near-zero quality risk. Enabled nowhere — free capacity.
3	FP8 weights	Gemma 31B 62 → 31 GB; freed VRAM becomes KV (~+30% capacity) and H100 FP8 tensor cores roughly double prefill throughput.	~1–2% quality hit; must be re-validated for TR tool-calling (projection from a Blackwell Colab sweep, not validated on KKB H100).
4	CPU prefix-cache offload (LMCache-style)	Spill evicted prefixes to host RAM; for a 30K prefix, loading from CPU beats recomputing.	More moving parts.
5	Chunked prefill	Interleaves a big prefill with other requests' decode, softening contention spikes.	No deploy sets the flag; vLLM V1 default-on assumed but not verified against `vllm/vllm-openai:v0.20.1`.
6	TP 2 → 4	~2× KV cache.	Consumes the GPUs running STT/TTS — cross-stack tradeoff.
7	`max_num_seqs`	Admission control; lowering toward prompts-cacheable trades queueing for cache stability.	Calls may queue under burst.
8	`max_model_len`	Does NOT size the KV pool (the pool is leftover VRAM). Colab sweep grew 32K → 64K with single-stream TTFS only 5–15% slower and better conc-25 stability.	No capacity gain; `analysis/llm-latency-benchmark/8-colab-fp8-ctx-sweep/findings.md:10–12,64–68`.

Context-ceiling side effect: the same 15.5K prompt that causes latency is 47% of the 32K window (30K is 92%). Bloat also caps conversation length as history grows.

Checkpoint: a campaign is running 25 concurrent calls against the KKB box; agents use the 30K anket prompt. Without touching the workflow, what single vLLM flag most directly attacks the eviction, and what does the math say?

--kv-cache-dtype fp8. Halving KV bytes per token doubles the pool's effective token capacity, so prompts-cacheable goes ~22 → ~45 — finally above the 25 concurrent calls, breaking the structural prompts_cacheable < admitted_load thrash condition (serving-config.md:67–73).

It is one flag, H100-native, and currently unused in every deploy. (The bigger long-term lever is still halving the prompt itself — Part 8.)

What dominates hereThe ratio prompts_cacheable : concurrent load. Above 1 the box feels instant (TTFT ~0.2 s); below 1, individual calls randomly eat 3–4 s cold turns while every aggregate metric still looks green. Part 7 takes the same turn to the cloud, where prefill is cheap but a different set of traps waits.