From 39 to 540 tok/s.

One config file.

We ran 378 autoresearchAutoresearch pattern by Andrej Karpathy iterations on Qwen3-coder. Eagle3 speculative decoding. FP8 quantization. NVIDIA GB10. The result: a 1285% throughput increase with no quality loss.

Now it's yours in one command.

GB10
Qwen3-coder
+1285%
Baseline (FP8)39 tok/s
Eagle3 (spec=3)540 tok/s
NVFP4 (262K context)39 tok/s
378 iterations benchmarkedPawBench PawBench verified

378 iterations. One winner.

We started with Qwen3-coder running at 39 tok/s on an NVIDIA GB10. Good, but not great. We knew the hardware could do more.

Baseline: FP8, no speculation

39 tok/s

+ Eagle3 draft head

180 tok/s+362%

+ 3 draft tokens (sweet spot)

540 tok/s+1285%

+ 5 draft tokens (tried, rejected)

310 tok/s-43%

Acceptance rate too low

+ NVFP4 quantization

39 tok/sSame

But 262K context window

We used Karpathy's autoresearchAutoresearch pattern by Andrej Karpathy pattern: modify one parameter, benchmark for 5 minutes, keep improvements, discard regressions. 378 iterations overnight. The system prompt alone took 378 tries to find the optimal 6-line version with 100% CACPCompressed Agent Communication Protocol — structured I/O format for LLM agents (~200 vs ~2000 tokens) compliance.

The configs that won? They're in the registry. Apply them in one command.

How we benchmark

Every serving card is backed by PawBench — a multi-dimensional benchmark that measures what actually matters for deployment.

Throughput

Single-stream tok/s and parallel saturation curves at N=1,2,4,8 concurrent.

540 tok/s (Eagle3) vs 39 tok/s (baseline)

Quality

Tool call correctness, code keyword accuracy, CACPCompressed Agent Communication Protocol — structured I/O format for LLM agents (~200 vs ~2000 tokens) protocol compliance.

81% quality score, 100% CACPCompressed Agent Communication Protocol — structured I/O format for LLM agents (~200 vs ~2000 tokens) compliance

Efficiency

Useful token ratio — how much of the model's output is actual code vs filler.

73% useful tokens (higher = less waste)

Adaptability

How well the model responds to mid-task context injection and requirement changes.

Tested across independent, steered, and nudged scenarios

PawBench runs in 25 minutes. SWE-bench takes hours and costs $50+. We designed PawBench for the autoresearchAutoresearch pattern by Andrej Karpathy loop — fast enough to iterate 100+ times overnight.

Run PawBench on your hardware

Get running in 60 seconds

1

Find your config

$servingcard search qwen3-coder

Browse the registry for your model + hardware combo.

2

Get the launch command

$servingcard apply qwen3-coder/gb10-fp8-eagle3-spec3

Generates the optimized vLLM command with all the right flags.

3

Run it

$vllm serve qwen3-coder --quantization fp8 --speculative-model ...

540 tok/s. Done.

Deploy anywhere

Our benchmarks use vLLM, but ServingCard configs will support any inference engine. Community contributions for TGI and Triton are welcome.

Why we built this

Finding optimal serving parameters for your LLM shouldn't require reading 47 Reddit threads. HuggingFace tells you what a model is. We tell you how to serve it — on your specific hardware.

ServingCard is the missing layer between model weights and production inference. A community-driven registry of verified, hardware-specific configurations that anyone can contribute to and apply in one command.

No more guessing gpu-memory-utilization. No more trial-and-error with speculative decoding tokens. No more copying config snippets from year-old forum posts.

For model tuners

Run PawBench. Publish your results. Build your reputation as a verified optimizer.

For deployers

Find the config for your hardware. Apply it. Ship to production.

For the community

Every config published makes the next person's deployment faster. We all win.