ServingCard logoServingCard
RegistryCompareLearnConfigure
Home

LLM Serving Fundamentals

Everything you need to know to serve models efficiently. Real numbers from real hardware, not textbook theory.

8 min read

Quantization

FP32 to INT4 and everything in between. When to use FP8, AWQ, GPTQ, and GGUF — and what each costs you in quality and throughput.

Read article
7 min read

Speculative Decoding

Draft-and-verify acceleration. How Eagle3 gets +40% single-stream throughput — and why it can hurt at high concurrency.

Read article
6 min read

KV Cache

The hidden memory cost of long contexts. FP8 vs FP16 caching, TurboQuant compression, and why your 128K context OOMs.

Read article
6 min read

Context Length

What max_model_len really means. How to calculate memory budgets and avoid the OOM trap that catches everyone.

Read article
7 min read

Tool Calling

Function calling for LLMs. vLLM tool parsers, agent loops, structured I/O with CACP, and why some models fail at it.

Read article
8 min read

Hardware

GB10, RTX 4090, A100, H100 — what fits where. UMA vs discrete memory, gpu_memory_utilization, and practical sizing.

Read article
5 min read

CACP Protocol

10x token savings for agent communication. Structured I/O that replaces 2,000 tokens of prose with 200 tokens of typed fields.

Read article
8 min read

Autoresearch

378 iterations overnight. How Karpathy's autoresearch pattern found the optimal serving config — and what we learned about tuning LLMs.

Read article
ServingCardServingCard

Built by Zen Labs. The missing layer between model weights and production inference.

Inspired by HuggingFace Model Cards, TOON format, and Karpathy's autoresearch.

Project

  • GitHub
  • PawBench
  • Spec

Contribute

  • Contributing Guide
  • Report an Issue
  • Publish Your Config — Coming Soon
ServingCard is open source. View on GitHub