Tool Calling
LLMs cannot read files, run commands, or search the web on their own. Tool calling gives them structured access to the outside world — and it is the foundation of every coding agent.
What is tool calling?
Tool calling (also called function calling) is a protocol where the LLM outputs a structured request to invoke a function instead of generating text. The runtime executes the function and feeds the result back to the model.
// 1. Model receives a task
"Read the file src/main.py and fix the bug on line 42"
// 2. Model calls a tool (structured JSON)
{"name": "read_file", "arguments": {"path": "src/main.py"}}
// 3. Runtime executes, returns result
{"content": "def main():\n x = 1 / 0 # bug here\n..."}
// 4. Model calls another tool to fix it
{"name": "write_file", "arguments": {"path": "src/main.py", ...}}
This loop — think, call tool, observe result, think again — is the core of every AI coding agent. The model never executes code directly. It issues structured tool calls, and the runtime handles execution.
vLLM tool parsers
Different models format tool calls differently. vLLM provides tool parsers that translate each model's output format into a standard OpenAI-compatible tool call structure. Choosing the right parser is critical.
The most widely compatible parser. Works with models fine-tuned on the Hermes function-calling format. Uses XML-like tags to delimit tool calls.
<tool_call>{"name": "read_file", "arguments": {"path": "main.py"}}</tool_call>Specialized parser for Qwen3-coder models. Handles Qwen's specific tool calling format which differs from Hermes in escaping and structure.
Use this parser specifically with Qwen3-coder models. Using the hermes parser with Qwen3-coder will cause parsing failures on complex tool arguments.
For Mistral and Mixtral models. Uses a different JSON structure with tool call IDs for multi-turn conversations.
# serving_card.yaml — tool parser configuration tool_call_parser: qwen3_coder # For Hermes-format models: tool_call_parser: hermes # This is set in the vLLM serve command: # --tool-call-parser qwen3_coder
How agents use tools
A coding agent typically has access to a small set of powerful tools:
| Tool | Purpose | Example |
|---|---|---|
| read_file | Read a file from the worktree | read_file("src/app.py") |
| write_file | Create or overwrite a file | write_file("src/app.py", content) |
| search | Search codebase with grep/ripgrep | search("def authenticate") |
| execute | Run a shell command | execute("pytest tests/") |
| list_files | List directory contents | list_files("src/") |
The agent loops: read code to understand the problem, write fixes, run tests to verify, iterate if tests fail. A typical task takes 10-40 tool calls. The quality of tool calling — how reliably the model formats calls, handles errors, and uses results — determines agent effectiveness.
CACP: structured I/O for agents
Free-form prose wastes tokens. When dispatching a coding task, you do not need a paragraph — you need structured fields. CACP (Compact Agent Communication Protocol) replaces verbose prompts and responses with typed fields.
# CACP Dispatch (input to agent) TASK: Fix the authentication bug in login handler CONTEXT: src/auth/login.py:42 raises TypeError on None email ACCEPTANCE: Login works with None/empty email, tests pass SCOPE: src/auth/ only, do not touch API layer VERIFY: pytest tests/auth/ DONE: Commit with descriptive message
# CACP Response (output from agent) STATUS:ok FILES_MODIFIED:src/auth/login.py TESTS:pass:14 BUILD:pass
Token savings
A typical free-form response uses ~2000 tokens to say "I fixed the bug in login.py, ran the tests, they pass." CACP says the same thing in ~200 tokens. Over hundreds of agent dispatches, this adds up to significant cost and latency savings.
When tool calling fails
Not all models are good at tool calling. Common failure modes:
Malformed JSON
The model generates invalid JSON in tool arguments — missing quotes, extra commas, unescaped characters. Smaller models and aggressive quantization make this worse.
Hallucinated tools
The model calls a tool that does not exist in its tool list. This wastes a turn and the agent has to recover. More common with models that were not specifically trained on tool calling.
Wrong parser
Using the hermes parser with a model that outputs Qwen-format tool calls (or vice versa). The parser cannot find the tool call in the output and the agent stalls. Always match the parser to the model.
Tool call in thinking block
Some models emit tool calls inside their chain-of-thought or thinking tags instead of in the designated tool call section. The parser misses them. This is a model-level issue that requires prompt engineering or model quirks to work around.
Choosing a model for tool calling
Qwen3-coder is our top pick for local tool-calling agents. It was specifically trained for agentic coding with tools. For API-based agents, Claude with native tool calling is the gold standard — no parser needed, the API handles structured output directly.
Related serving cards
See tool calling configurations benchmarked with PawBench: