Hardware for LLM Inference
The GPU determines everything: what models fit, how fast they run, and how many users you can serve. Here is a practical guide to the hardware landscape.
NVIDIA GB10 (Grace Blackwell)
The GB10 is NVIDIA's desktop AI workstation chip. It pairs a Blackwell GPU with a Grace ARM CPU, connected by a high-bandwidth NVLink — and critically, they share 128 GB of unified memory (UMA).
GPU Memory
128 GB (unified)
Memory Bandwidth
~273 GB/s
Tensor Cores
Blackwell (FP8 native)
Form Factor
Desktop (DGX Spark)
Why GB10 is interesting for local inference
128 GB of unified memory means you can run 30B+ parameter models that would not fit on any consumer GPU. An RTX 4090 has 24 GB — a Qwen3-coder-30B in FP8 needs ~30 GB for weights alone. GB10 fits the model with 98 GB of headroom for KV cache, activations, and concurrent requests.
GPU comparison for inference
| GPU | VRAM | Bandwidth | Max Model (FP8) | Price |
|---|---|---|---|---|
| RTX 4090 | 24 GB | 1,008 GB/s | ~14B | ~$1,600 |
| A100 80GB | 80 GB | 2,039 GB/s | ~70B | ~$2/hr cloud |
| H100 80GB | 80 GB | 3,350 GB/s | ~70B | ~$3/hr cloud |
| GB10 | 128 GB (UMA) | ~273 GB/s | ~120B | ~$3,000 |
Notice the tradeoff: GB10 has the most memory but the lowest bandwidth. This means it can fit the largest models but generates tokens slower per request. For throughput-sensitive APIs, H100 wins. For local development where you need a big model on your desk, GB10 wins.
UMA vs discrete memory
Traditional GPUs (RTX 4090, A100, H100) have discrete memory — separate RAM chips on the GPU card. The CPU has its own system RAM. Data must be copied between them over PCIe, which is slow.
Discrete memory (RTX 4090, A100, H100)
- GPU has its own dedicated VRAM (24-80 GB)
- High bandwidth within the GPU (1-3.3 TB/s)
- Model must fit entirely in VRAM for good performance
- CPU offloading is possible but ~100x slower for the offloaded layers
Unified memory (GB10, Apple Silicon)
- GPU and CPU share the same memory pool
- Lower peak bandwidth (~273 GB/s on GB10, ~400 GB/s on M4 Max)
- No PCIe bottleneck — the entire memory pool is "VRAM"
- Bigger models fit, but each token takes longer to generate
The bandwidth tradeoff
An RTX 4090 reads memory at 1,008 GB/s but only has 24 GB. GB10 reads at 273 GB/s but has 128 GB. For a 30B FP8 model, the 4090 cannot run it at all (does not fit), while GB10 generates ~35 tok/s single-stream. Memory capacity wins over bandwidth when the model does not fit otherwise.
What gpu_memory_utilization means
In vLLM, gpu_memory_utilization tells vLLM what fraction of GPU memory it is allowed to use. The remaining memory is left for the OS, CUDA runtime, and other processes.
| Setting | Use Case | Risk |
|---|---|---|
| 0.80 | Shared GPU, other processes running | Low — plenty of headroom |
| 0.90 | Default — dedicated inference GPU | Low — standard setting |
| 0.95 | UMA systems (GB10), squeeze max KV cache | Medium — monitor for OOM |
| 0.99 | Absolute maximum | High — fragile, any spike OOMs |
Higher utilization means more KV cache capacity (more concurrent requests or longer contexts). But it also means less headroom for memory spikes. On UMA systems like GB10, 0.95 is safe because there is no separate system RAM competing for the same memory.
What models fit on what GPUs
A practical sizing guide. Remember: you need room for both model weights and KV cache.
RTX 4090 (24 GB)
7B models in FP16 (~14 GB weights, ~10 GB for KV cache)
14B models in INT4/AWQ (~7 GB weights, ~17 GB for KV cache)
30B+ models: does not fit, even with INT4
A100 / H100 (80 GB)
30B models in FP8 (~30 GB weights, ~50 GB for KV cache)
70B models in INT4/AWQ (~35 GB weights, ~45 GB for KV cache)
70B in FP8: tight, limited KV cache, short context only
GB10 (128 GB UMA)
30B models in FP8 (~30 GB weights, ~98 GB for KV cache) — generous headroom
70B models in FP8 (~70 GB weights, ~58 GB for KV cache) — workable
120B+ models in INT4 (~60 GB weights, ~68 GB for KV cache) — fits but slow
This is where GB10 shines: models that do not fit on any 80 GB GPU run comfortably here with room for long contexts.
Choosing hardware
Local development agent? GB10. You need a big model (30B+) for quality coding, and you need it on your desk, not in the cloud. The lower bandwidth is fine for single-developer use.
Team API server (1-10 users)? A100 or H100 in the cloud. Higher bandwidth means more tok/s per request and better concurrency. Rent by the hour.
Production API (100+ users)? Multiple H100s with tensor parallelism. This is where bandwidth dominates and you need raw throughput.
Budget-constrained experiments? RTX 4090 with 7-14B models in AWQ. Surprisingly capable for prototyping, and you can buy one for $1,600.
Apple Silicon? M4 Max (128 GB) is similar to GB10 in memory capacity with ~400 GB/s bandwidth. Use MLX or llama.cpp, not vLLM (which requires CUDA).
Related serving cards
See hardware-specific configurations with real benchmark data: