LLM Serving Fundamentals
Everything you need to know to serve models efficiently. Real numbers from real hardware, not textbook theory.
Quantization
FP32 to INT4 and everything in between. When to use FP8, AWQ, GPTQ, and GGUF — and what each costs you in quality and throughput.
Speculative Decoding
Draft-and-verify acceleration. How Eagle3 gets +40% single-stream throughput — and why it can hurt at high concurrency.
KV Cache
The hidden memory cost of long contexts. FP8 vs FP16 caching, TurboQuant compression, and why your 128K context OOMs.
Context Length
What max_model_len really means. How to calculate memory budgets and avoid the OOM trap that catches everyone.
Tool Calling
Function calling for LLMs. vLLM tool parsers, agent loops, structured I/O with CACP, and why some models fail at it.
Hardware
GB10, RTX 4090, A100, H100 — what fits where. UMA vs discrete memory, gpu_memory_utilization, and practical sizing.
CACP Protocol
10x token savings for agent communication. Structured I/O that replaces 2,000 tokens of prose with 200 tokens of typed fields.
Autoresearch
378 iterations overnight. How Karpathy's autoresearch pattern found the optimal serving config — and what we learned about tuning LLMs.