From 39 to 540 tok/s.

One config file.

We ran 378 autoresearchAutoresearch pattern by Andrej Karpathy iterations on Qwen3-coder. Eagle3 speculative decoding. FP8 quantization. NVIDIA GB10. The result: a 1285% throughput increase with no quality loss.

Now it's yours in one command.

GB10

Qwen3-coder

+1285%

Baseline (FP8)39 tok/s

Eagle3 (spec=3)540 tok/s

NVFP4 (262K context)39 tok/s

378 iterations benchmarked

PawBench verified

378 iterations. One winner.

We started with Qwen3-coder running at 39 tok/s on an NVIDIA GB10. Good, but not great. We knew the hardware could do more.

Baseline: FP8, no speculation

39 tok/s

+ Eagle3 draft head

180 tok/s+362%

+ 3 draft tokens (sweet spot)

540 tok/s+1285%

+ 5 draft tokens (tried, rejected)

310 tok/s-43%

Acceptance rate too low

+ NVFP4 quantization

39 tok/sSame

But 262K context window

We used Karpathy's autoresearchAutoresearch pattern by Andrej Karpathy pattern: modify one parameter, benchmark for 5 minutes, keep improvements, discard regressions. 378 iterations overnight. The system prompt alone took 378 tries to find the optimal 6-line version with 100% CACPCompressed Agent Communication Protocol — structured I/O format for LLM agents (~200 vs ~2000 tokens) compliance.

The configs that won? They're in the registry. Apply them in one command.

How we benchmark

Every serving card is backed by PawBench — a multi-dimensional benchmark that measures what actually matters for deployment.

Throughput

Single-stream tok/s and parallel saturation curves at N=1,2,4,8 concurrent.

540 tok/s (Eagle3) vs 39 tok/s (baseline)

Quality

Tool call correctness, code keyword accuracy, CACPCompressed Agent Communication Protocol — structured I/O format for LLM agents (~200 vs ~2000 tokens) protocol compliance.

81% quality score, 100% CACPCompressed Agent Communication Protocol — structured I/O format for LLM agents (~200 vs ~2000 tokens) compliance

Efficiency

Useful token ratio — how much of the model's output is actual code vs filler.

73% useful tokens (higher = less waste)

Adaptability

How well the model responds to mid-task context injection and requirement changes.

Tested across independent, steered, and nudged scenarios

PawBench runs in 25 minutes. SWE-bench takes hours and costs $50+. We designed PawBench for the autoresearchAutoresearch pattern by Andrej Karpathy loop — fast enough to iterate 100+ times overnight.

Run PawBench on your hardware

The configs that won

Every config in the registry is benchmarked on real hardware with PawBench. Here's what the autoresearchAutoresearch pattern by Andrej Karpathy found for Qwen3-coder on GB10.

qwen3-coder

Best

nvidia-gb10 · vllm>=0.8.0 · fp8 · fp8-baseline

Throughput

26.7 tok/s

Latency

10350ms TTFT

View on GitHub

qwen3.5-27b-awq

nvidia-gb10 · vllm>=0.18.0rc1 · Unknown · turboquant-3.5-triton-native

Throughput

27.9 tok/s

Latency

10267ms TTFT

View on GitHub

qwen3-coder

nvidia-gb10 · vllm>=0.8.0 · fp8 · fp8-eagle3-spec3

Throughput

35.1 tok/s

Latency

10736ms TTFT

View on GitHub

devstral-small-24b

nvidia-gb10 · vllm>=0.8.0 · Unknown · baseline

Throughput

53.6 tok/s

Latency

74ms TTFT

View on GitHub

deepseek-coder-v2-lite

nvidia-gb10 · vllm>=0.8.0 · Unknown · fp8-baseline

Throughput

58.1 tok/s

Latency

3850ms TTFT

View on GitHub

Browse all configs →

Get running in 60 seconds

Find your config

$servingcard search qwen3-coder

Browse the registry for your model + hardware combo.

Get the launch command

$servingcard apply qwen3-coder/gb10-fp8-eagle3-spec3

Generates the optimized vLLM command with all the right flags.

Run it

$vllm serve qwen3-coder --quantization fp8 --speculative-model ...

540 tok/s. Done.

Deploy anywhere

Our benchmarks use vLLM, but ServingCard configs will support any inference engine. Community contributions for TGI and Triton are welcome.

Why we built this

Finding optimal serving parameters for your LLM shouldn't require reading 47 Reddit threads. HuggingFace tells you what a model is. We tell you how to serve it — on your specific hardware.

ServingCard is the missing layer between model weights and production inference. A community-driven registry of verified, hardware-specific configurations that anyone can contribute to and apply in one command.

No more guessing gpu-memory-utilization. No more trial-and-error with speculative decoding tokens. No more copying config snippets from year-old forum posts.

For model tuners

Run PawBench. Publish your results. Build your reputation as a verified optimizer.

For deployers

Find the config for your hardware. Apply it. Ship to production.

For the community

Every config published makes the next person's deployment faster. We all win.

From 39 to 540 tok/s.

378 iterations. One winner.

How we benchmark

Throughput

Quality

Efficiency

Adaptability

The configs that won

qwen3-coder

qwen3.5-27b-awq

qwen3-coder

devstral-small-24b

deepseek-coder-v2-lite

Get running in 60 seconds

Find your config

Get the launch command

Run it

Deploy anywhere

vLLM

Text Generation Inference

Triton Inference Server

Why we built this

For model tuners

For deployers

For the community