Infrastructure8 min read

Hardware for LLM Inference

The GPU determines everything: what models fit, how fast they run, and how many users you can serve. Here is a practical guide to the hardware landscape.

NVIDIA GB10 (Grace Blackwell)

The GB10 is NVIDIA's desktop AI workstation chip. It pairs a Blackwell GPU with a Grace ARM CPU, connected by a high-bandwidth NVLink — and critically, they share 128 GB of unified memory (UMA).

GPU Memory

128 GB (unified)

Memory Bandwidth

~273 GB/s

Tensor Cores

Blackwell (FP8 native)

Form Factor

Desktop (DGX Spark)

Why GB10 is interesting for local inference

128 GB of unified memory means you can run 30B+ parameter models that would not fit on any consumer GPU. An RTX 4090 has 24 GB — a Qwen3-coder-30B in FP8 needs ~30 GB for weights alone. GB10 fits the model with 98 GB of headroom for KV cache, activations, and concurrent requests.

GPU comparison for inference

GPU	VRAM	Bandwidth	Max Model (FP8)	Price
RTX 4090	24 GB	1,008 GB/s	~14B	~$1,600
A100 80GB	80 GB	2,039 GB/s	~70B	~$2/hr cloud
H100 80GB	80 GB	3,350 GB/s	~70B	~$3/hr cloud
GB10	128 GB (UMA)	~273 GB/s	~120B	~$3,000

Notice the tradeoff: GB10 has the most memory but the lowest bandwidth. This means it can fit the largest models but generates tokens slower per request. For throughput-sensitive APIs, H100 wins. For local development where you need a big model on your desk, GB10 wins.

UMA vs discrete memory

Traditional GPUs (RTX 4090, A100, H100) have discrete memory — separate RAM chips on the GPU card. The CPU has its own system RAM. Data must be copied between them over PCIe, which is slow.

Discrete memory (RTX 4090, A100, H100)

GPU has its own dedicated VRAM (24-80 GB)
High bandwidth within the GPU (1-3.3 TB/s)
Model must fit entirely in VRAM for good performance
CPU offloading is possible but ~100x slower for the offloaded layers

Unified memory (GB10, Apple Silicon)

GPU and CPU share the same memory pool
Lower peak bandwidth (~273 GB/s on GB10, ~400 GB/s on M4 Max)
No PCIe bottleneck — the entire memory pool is "VRAM"
Bigger models fit, but each token takes longer to generate

The bandwidth tradeoff

An RTX 4090 reads memory at 1,008 GB/s but only has 24 GB. GB10 reads at 273 GB/s but has 128 GB. For a 30B FP8 model, the 4090 cannot run it at all (does not fit), while GB10 generates ~35 tok/s single-stream. Memory capacity wins over bandwidth when the model does not fit otherwise.

What gpu_memory_utilization means

In vLLM, gpu_memory_utilization tells vLLM what fraction of GPU memory it is allowed to use. The remaining memory is left for the OS, CUDA runtime, and other processes.

Setting	Use Case	Risk
0.80	Shared GPU, other processes running	Low — plenty of headroom
0.90	Default — dedicated inference GPU	Low — standard setting
0.95	UMA systems (GB10), squeeze max KV cache	Medium — monitor for OOM
0.99	Absolute maximum	High — fragile, any spike OOMs

Higher utilization means more KV cache capacity (more concurrent requests or longer contexts). But it also means less headroom for memory spikes. On UMA systems like GB10, 0.95 is safe because there is no separate system RAM competing for the same memory.

What models fit on what GPUs

A practical sizing guide. Remember: you need room for both model weights and KV cache.

RTX 4090 (24 GB)

7B models in FP16 (~14 GB weights, ~10 GB for KV cache)

14B models in INT4/AWQ (~7 GB weights, ~17 GB for KV cache)

30B+ models: does not fit, even with INT4

A100 / H100 (80 GB)

30B models in FP8 (~30 GB weights, ~50 GB for KV cache)

70B models in INT4/AWQ (~35 GB weights, ~45 GB for KV cache)

70B in FP8: tight, limited KV cache, short context only

GB10 (128 GB UMA)

30B models in FP8 (~30 GB weights, ~98 GB for KV cache) — generous headroom

70B models in FP8 (~70 GB weights, ~58 GB for KV cache) — workable

120B+ models in INT4 (~60 GB weights, ~68 GB for KV cache) — fits but slow

This is where GB10 shines: models that do not fit on any 80 GB GPU run comfortably here with room for long contexts.

Choosing hardware

Local development agent? GB10. You need a big model (30B+) for quality coding, and you need it on your desk, not in the cloud. The lower bandwidth is fine for single-developer use.

Team API server (1-10 users)? A100 or H100 in the cloud. Higher bandwidth means more tok/s per request and better concurrency. Rent by the hour.

Production API (100+ users)? Multiple H100s with tensor parallelism. This is where bandwidth dominates and you need raw throughput.

Budget-constrained experiments? RTX 4090 with 7-14B models in AWQ. Surprisingly capable for prototyping, and you can buy one for $1,600.

Apple Silicon? M4 Max (128 GB) is similar to GB10 in memory capacity with ~400 GB/s bandwidth. Use MLX or llama.cpp, not vLLM (which requires CUDA).

Related serving cards

See hardware-specific configurations with real benchmark data:

Context length guide Quantization guide Browse all configs