378 Iterations Overnight. One Winner.
8 min read · March 2026
We needed the optimal system prompt for Qwen3-coder on our NVIDIA GB10. Not "good enough" — the optimal one. The one that produces 100% CACP compliance with the shortest possible prompt.
We used Andrej Karpathy's autoresearch pattern: modify one thing, test for 5 minutes, keep improvements, discard regressions. Let the loop run overnight.
The Pattern
while True:
1. Mutate one parameter (system prompt, temperature, etc.)
2. Run benchmark (PawBench, 5 minutes)
3. Compare metric (CACP compliance, quality, tok/s)
4. If better → keep. If worse → git reset.
5. Log to results.tsv
6. RepeatThe key insight from Karpathy: fixed time budget + single metric + autonomous iteration. No human in the loop. No "let me think about this." Just brute-force exploration of the parameter space while you sleep.
What We Tuned
The system prompt for Hermes/Qwen3-coder. This is the instruction that tells the model how to format its response after using tools. Without a good system prompt, the model outputs 2,000 tokens of prose instead of CACP structured fields.
The Iterations
Long system prompts with detailed formatting rules, examples, and explanations. The model ignored most of it.
"MANDATORY OUTPUT FORMAT — violation causes automatic rejection." The threat framing helped but wasn't enough alone.
Showing the exact format with example values. Better, but the model still sometimes added prose before the STATUS line.
6 lines. Example format + two rules: "First line MUST start with STATUS:" and "Do NOT write any text before STATUS:". That's it.
The Winning Prompt
You are a coding agent. Use tools to build what is asked.
MANDATORY OUTPUT FORMAT — violation causes automatic rejection:
When you are done using tools, your response MUST be:
STATUS:ok
FILES_CREATED:file1,file2
FILES_MODIFIED:
TESTS:pass:0
BUILD:pass
LEARNED:one sentence
Rules:
- First line MUST start with STATUS:
- Do NOT write any text before STATUS:
- Do NOT summarize your work
- Do NOT explain what you did
- ONLY the CACP fields, nothing elseThe surprise: The baseline prompt (iteration 0) was already close to optimal. 378 iterations confirmed it couldn't be improved — which is itself a valuable finding. You don't know if your prompt is optimal until you've tried hundreds of alternatives.
What We Learned About Autoresearch
Shorter prompts win
Long explanations get ignored. Short, imperative rules get followed. "First line MUST start with STATUS:" is worth more than a paragraph of formatting guidelines.
Threat framing works
"violation causes automatic rejection" isn't just for show. The model demonstrably produces better format compliance with penalty language.
Examples > descriptions
Showing STATUS:ok is more effective than explaining "return a status field with value ok." Models learn from patterns, not instructions.
Negative rules matter
"Do NOT write any text before STATUS:" prevents the most common failure: the model adding a preamble before the structured output.
Running Your Own Autoresearch
The autoresearch loop is built into servingcard's CLI. Point it at your vLLM endpoint and let it run overnight:
$ servingcard benchmark \
--model qwen3-coder \
--hardware nvidia-gb10 \
--endpoint http://localhost:8000
# Runs PawBench, produces a servingcard YAML
# with your hardware-specific resultsThe results become a serving card that others can apply. Your overnight experiment saves everyone else the same work.
The Model Comparison Run
We also ran autoresearch across multiple models to find which ones work best for coding agent dispatch on GB10:
| Model | tok/s | Quality | Verdict |
|---|---|---|---|
| Qwen3-coder (Eagle3) | 35.1 | 59% | Best quality |
| Qwen3-coder (baseline) | 26.7 | 71% | Best parallel |
| DeepSeek-Coder-V2-Lite | 58.1 | 29% | Fast but bad quality |
| Devstral-Small-24B | 53.6 | 29% | 74ms TTFT, bad quality |
Qwen3-coder wins on quality by a wide margin. Smaller models are 2x faster but unusable for production coding dispatch — 29% quality means the agent produces broken code most of the time.
Compare these models side-by-side →The autoresearch pattern is from Andrej Karpathy. Our implementation is in servingcard's CLI.