Fundamentals6 min read

KV Cache

Every token your model has seen is stored in the KV cache. At long contexts, this cache dominates your memory budget — and determines what fits on your GPU.

What is KV cache?

Transformer models use attention to relate every token to every other token. During inference, this means that for each new token generated, the model needs to attend to all previous tokens. Computing this from scratch every time would be extremely expensive.

The KV cache stores the Key and Value projections from attention for all previous tokens. When generating the next token, the model only needs to compute the Query for the new token and look up the cached Keys and Values. This avoids recomputing attention over the entire sequence.

// Without KV cache: recompute everything

Token 1000: attend to tokens 1..999 (compute K,V for all)

Token 1001: attend to tokens 1..1000 (recompute K,V for all)

// With KV cache: only compute new token

Token 1000: compute K,V for token 1000, look up cache for 1..999

Token 1001: compute K,V for token 1001, look up cache for 1..1000

Why it matters for memory

KV cache memory scales linearly with context length and the number of concurrent requests. The formula is:

KV cache = 2 x layers x heads x head_dim x seq_len x batch x bytes_per_element

For a model like Qwen3-coder-30B with 64 layers, 40 KV heads, and head dimension 128:

Context Length	KV Cache (FP16)	KV Cache (FP8)
8K tokens	~5 GB	~2.5 GB
32K tokens	~20 GB	~10 GB
128K tokens	~80 GB	~40 GB
1M tokens	~640 GB	~320 GB

These numbers are per request. With multiple concurrent requests, multiply accordingly. This is why KV cache management is the central constraint in LLM serving.

FP8 vs FP16 KV cache

Just like model weights, the KV cache can be quantized. FP8 KV cache cuts memory in half with minimal quality impact.

FP16 KV cache (default)

Full precision. Best quality. Uses 2 bytes per element. Fine when memory is abundant and context is short.

FP8 KV cache

Half the memory, near-lossless quality. Supported on Blackwell/Hopper. The practical default for any long-context serving. In vLLM, enable with kv_cache_dtype: fp8.

Always use FP8 KV cache on supported hardware

If your GPU has FP8 support (Blackwell, Hopper, Ada Lovelace), there is almost no reason to use FP16 KV cache. The quality difference is imperceptible and you get 2x the effective context capacity.

TurboQuant: below FP8

TurboQuant pushes KV cache compression even further with 2.5-bit and 3.5-bit quantization. This is experimental but promising for extreme context lengths.

3.5-bit KV cache

~2.3x less memory than FP8. Quality degradation is measurable but often acceptable for tasks like code search and document QA where exact wording matters less.

2.5-bit KV cache

~3.2x less memory than FP8. More aggressive, with noticeable quality impact on reasoning-heavy tasks. Best suited for retrieval and summarization where you need massive context windows.

With TurboQuant at 2.5-bit, a 30B model could theoretically handle 1M+ context on a single GB10 (128 GB). The KV cache for 1M tokens at 2.5-bit is approximately 100 GB, leaving enough headroom for the model weights in FP8.

Practical implications

Context length vs concurrency

KV cache forces a direct tradeoff: longer context means fewer concurrent requests. A GB10 with 128 GB can serve one request at 128K context or sixteen requests at 8K context (roughly). Plan your deployment based on which matters more.

PagedAttention

vLLM uses PagedAttention to manage KV cache like virtual memory. Instead of pre-allocating max context for every request, it allocates cache pages on demand. This dramatically reduces wasted memory when requests have varying context lengths.

Prefix caching

When multiple requests share a common prefix (same system prompt, same document), vLLM can reuse the KV cache for that prefix. This is free performance — enable enable_prefix_caching: true if your workload has shared prefixes.

Related serving cards

See KV cache configurations on real hardware:

Browse all configs