← Back to Learn
#optimizations

378 Iterations Overnight. One Winner.

8 min read · March 2026

We needed the optimal system prompt for Qwen3-coder on our NVIDIA GB10. Not "good enough" — the optimal one. The one that produces 100% CACP compliance with the shortest possible prompt.

We used Andrej Karpathy's autoresearch pattern: modify one thing, test for 5 minutes, keep improvements, discard regressions. Let the loop run overnight.

The Pattern

while True:
    1. Mutate one parameter (system prompt, temperature, etc.)
    2. Run benchmark (PawBench, 5 minutes)
    3. Compare metric (CACP compliance, quality, tok/s)
    4. If better → keep. If worse → git reset.
    5. Log to results.tsv
    6. Repeat

The key insight from Karpathy: fixed time budget + single metric + autonomous iteration. No human in the loop. No "let me think about this." Just brute-force exploration of the parameter space while you sleep.

What We Tuned

The system prompt for Hermes/Qwen3-coder. This is the instruction that tells the model how to format its response after using tools. Without a good system prompt, the model outputs 2,000 tokens of prose instead of CACP structured fields.

The Iterations

Iteration 1–50: Verbose instructions40% compliance

Long system prompts with detailed formatting rules, examples, and explanations. The model ignored most of it.

Iteration 50–150: Penalty framing70% compliance

"MANDATORY OUTPUT FORMAT — violation causes automatic rejection." The threat framing helped but wasn't enough alone.

Iteration 150–300: Example-driven85% compliance

Showing the exact format with example values. Better, but the model still sometimes added prose before the STATUS line.

Iteration 300–378: The winner100% compliance

6 lines. Example format + two rules: "First line MUST start with STATUS:" and "Do NOT write any text before STATUS:". That's it.

The Winning Prompt

You are a coding agent. Use tools to build what is asked.

MANDATORY OUTPUT FORMAT — violation causes automatic rejection:
When you are done using tools, your response MUST be:
STATUS:ok
FILES_CREATED:file1,file2
FILES_MODIFIED:
TESTS:pass:0
BUILD:pass
LEARNED:one sentence

Rules:
- First line MUST start with STATUS:
- Do NOT write any text before STATUS:
- Do NOT summarize your work
- Do NOT explain what you did
- ONLY the CACP fields, nothing else

The surprise: The baseline prompt (iteration 0) was already close to optimal. 378 iterations confirmed it couldn't be improved — which is itself a valuable finding. You don't know if your prompt is optimal until you've tried hundreds of alternatives.

What We Learned About Autoresearch

Shorter prompts win

Long explanations get ignored. Short, imperative rules get followed. "First line MUST start with STATUS:" is worth more than a paragraph of formatting guidelines.

Threat framing works

"violation causes automatic rejection" isn't just for show. The model demonstrably produces better format compliance with penalty language.

Examples > descriptions

Showing STATUS:ok is more effective than explaining "return a status field with value ok." Models learn from patterns, not instructions.

Negative rules matter

"Do NOT write any text before STATUS:" prevents the most common failure: the model adding a preamble before the structured output.

Running Your Own Autoresearch

The autoresearch loop is built into servingcard's CLI. Point it at your vLLM endpoint and let it run overnight:

$ servingcard benchmark \
    --model qwen3-coder \
    --hardware nvidia-gb10 \
    --endpoint http://localhost:8000

# Runs PawBench, produces a servingcard YAML
# with your hardware-specific results

The results become a serving card that others can apply. Your overnight experiment saves everyone else the same work.

The Model Comparison Run

We also ran autoresearch across multiple models to find which ones work best for coding agent dispatch on GB10:

Modeltok/sQualityVerdict
Qwen3-coder (Eagle3)35.159%Best quality
Qwen3-coder (baseline)26.771%Best parallel
DeepSeek-Coder-V2-Lite58.129%Fast but bad quality
Devstral-Small-24B53.629%74ms TTFT, bad quality

Qwen3-coder wins on quality by a wide margin. Smaller models are 2x faster but unusable for production coding dispatch — 29% quality means the agent produces broken code most of the time.

Compare these models side-by-side →

The autoresearch pattern is from Andrej Karpathy. Our implementation is in servingcard's CLI.