LLM Serving Fundamentals

Everything you need to know to serve models efficiently. Real numbers from real hardware, not textbook theory.

Quantization

FP32 to INT4 and everything in between. When to use FP8, AWQ, GPTQ, and GGUF — and what each costs you in quality and throughput.

Draft-and-verify acceleration. How Eagle3 gets +40% single-stream throughput — and why it can hurt at high concurrency.

The hidden memory cost of long contexts. FP8 vs FP16 caching, TurboQuant compression, and why your 128K context OOMs.

What max_model_len really means. How to calculate memory budgets and avoid the OOM trap that catches everyone.

Function calling for LLMs. vLLM tool parsers, agent loops, structured I/O with CACP, and why some models fail at it.

GB10, RTX 4090, A100, H100 — what fits where. UMA vs discrete memory, gpu_memory_utilization, and practical sizing.

10x token savings for agent communication. Structured I/O that replaces 2,000 tokens of prose with 200 tokens of typed fields.

378 iterations overnight. How Karpathy's autoresearch pattern found the optimal serving config — and what we learned about tuning LLMs.