Fundamentals6 min read

Context Length

The model says it supports 128K context. Your GPU disagrees. Understanding the memory budget is the difference between a working deployment and a mysterious OOM crash.

What max_model_len means

In vLLM, max_model_len sets the maximum sequence length the server will accept. This includes both the input prompt and the generated output tokens.

A model might be trained with a 128K context window, but that does not mean your GPU can serve it at 128K. The context window determines what the model can attend to. The max_model_len determines what your hardware will serve.

# Model supports 128K but GPU can only handle 32K
max_model_len: 32768

# vLLM will reject requests longer than this
# Better to reject than to OOM mid-generation

The memory budget

GPU memory is divided between three components during inference:

Model weights

Fixed

The parameters of the model. For Qwen3-coder-30B in FP8, this is ~30 GB. This is constant regardless of context length or batch size.

KV cache

Variable

Scales with context length and concurrent requests. This is the dominant variable cost. See the KV Cache article for detailed calculations.

Activations + overhead

Variable

Temporary memory for intermediate computations, CUDA kernels, and vLLM's internal structures. Typically 2-5 GB. Often underestimated.

Available for KV cache = Total GPU memory - Weights - Activations - Safety margin

For GB10 (128 GB) with Qwen3-coder FP8: 128 - 30 - 4 - 10 = ~84 GB for KV cache

The OOM trap

This is the mistake that catches everyone at least once: the model starts fine, serves short requests perfectly, and then OOMs when a long request arrives. Or worse — it OOMs when concurrency spikes.

The trap

vLLM pre-allocates KV cache based on max_model_len and gpu_memory_utilization. If you set max_model_len: 131072 but your GPU cannot support that, vLLM will either fail to start or crash under load.

Real example: Qwen3.5-27B AWQ on GB10

We tried running Qwen3.5-27B-A3B-AWQ with max_model_len: 131072 on GB10.

The math looked fine on paper: 15 GB (AWQ weights) + ~40 GB (FP8 KV at 131K) = 55 GB. Plenty of headroom in 128 GB.

In practice, it OOMed. The activations and overhead at 131K context are much larger than at shorter contexts. CUDA fragmentation, vLLM's block allocator overhead, and the attention computation itself all eat into that "headroom."

The fix: max_model_len: 32768. Works perfectly, serves the vast majority of real-world requests.

How to calculate what fits

A practical approach to sizing context length:

Start with available memory

Total GPU memory minus model weights. For GB10 + Qwen3-coder FP8: 128 - 30 = 98 GB available.

Subtract overhead

Reserve 15-20% for activations, CUDA overhead, and safety margin. 98 x 0.8 = ~78 GB for KV cache.

Calculate per-token KV cost

For Qwen3-coder: 2 x 64 layers x 40 heads x 128 dim x 1 byte (FP8) = ~655 KB per token. In FP16 it would be ~1.3 MB per token.

Divide to get max context

78 GB / 655 KB per token = ~125K tokens for a single request. For 4 concurrent requests: ~31K each. In practice, set max_model_len conservatively below this.

The gpu_memory_utilization setting

vLLM's gpu_memory_utilization controls what fraction of GPU memory vLLM is allowed to use. The default is 0.9 (90%).

# Conservative — leaves headroom for other processes
gpu_memory_utilization: 0.85

# Aggressive — maximizes available KV cache
gpu_memory_utilization: 0.95

# On GB10 with UMA, we use 0.95 because there is
# no separate system RAM to worry about

UMA changes the math

On discrete GPUs, you need to leave memory for the OS and other processes. On unified memory architectures like GB10, the GPU and CPU share the same memory pool, so you can push gpu_memory_utilization higher. We run 0.95 on GB10 without issues.

Practical guidelines

Start conservative. Set max_model_len to 32K or 64K and increase only after confirming stability under load.

Test under concurrency. A context length that works for 1 request may OOM with 4 concurrent requests.

Use FP8 KV cache. It halves your KV memory cost with negligible quality impact. There is no reason not to use it on supported hardware.

Monitor actual usage. Most real-world requests use far less context than the maximum. If 95% of your requests are under 8K tokens, setting max_model_len: 131072 wastes memory that could serve more concurrent requests.

Consider your use case. Coding agents typically need 16-64K context. RAG pipelines need 4-16K. Chat applications need 4-8K. Only set 128K+ if you have a real need for it.

Related serving cards

See context length configurations benchmarked on real hardware:

KV Cache deep dive Hardware guide Browse all configs