378 iterations. One winner.
We started with Qwen3-coder running at 39 tok/s on an NVIDIA GB10. Good, but not great. We knew the hardware could do more.
Baseline: FP8, no speculation
+ Eagle3 draft head
+ 3 draft tokens (sweet spot)
+ 5 draft tokens (tried, rejected)
Acceptance rate too low
+ NVFP4 quantization
But 262K context window
We used Karpathy's autoresearchAutoresearch pattern by Andrej Karpathy pattern: modify one parameter, benchmark for 5 minutes, keep improvements, discard regressions. 378 iterations overnight. The system prompt alone took 378 tries to find the optimal 6-line version with 100% CACPCompressed Agent Communication Protocol — structured I/O format for LLM agents (~200 vs ~2000 tokens) compliance.
The configs that won? They're in the registry. Apply them in one command.
