Back to Learn
Agents7 min read

Tool Calling

LLMs cannot read files, run commands, or search the web on their own. Tool calling gives them structured access to the outside world — and it is the foundation of every coding agent.

What is tool calling?

Tool calling (also called function calling) is a protocol where the LLM outputs a structured request to invoke a function instead of generating text. The runtime executes the function and feeds the result back to the model.

// 1. Model receives a task

"Read the file src/main.py and fix the bug on line 42"

// 2. Model calls a tool (structured JSON)

{"name": "read_file", "arguments": {"path": "src/main.py"}}

// 3. Runtime executes, returns result

{"content": "def main():\n x = 1 / 0 # bug here\n..."}

// 4. Model calls another tool to fix it

{"name": "write_file", "arguments": {"path": "src/main.py", ...}}

This loop — think, call tool, observe result, think again — is the core of every AI coding agent. The model never executes code directly. It issues structured tool calls, and the runtime handles execution.

vLLM tool parsers

Different models format tool calls differently. vLLM provides tool parsers that translate each model's output format into a standard OpenAI-compatible tool call structure. Choosing the right parser is critical.

hermes

The most widely compatible parser. Works with models fine-tuned on the Hermes function-calling format. Uses XML-like tags to delimit tool calls.

<tool_call>{"name": "read_file", "arguments": {"path": "main.py"}}</tool_call>
qwen3_coder

Specialized parser for Qwen3-coder models. Handles Qwen's specific tool calling format which differs from Hermes in escaping and structure.

Use this parser specifically with Qwen3-coder models. Using the hermes parser with Qwen3-coder will cause parsing failures on complex tool arguments.

mistral

For Mistral and Mixtral models. Uses a different JSON structure with tool call IDs for multi-turn conversations.

# serving_card.yaml — tool parser configuration
tool_call_parser: qwen3_coder

# For Hermes-format models:
tool_call_parser: hermes

# This is set in the vLLM serve command:
# --tool-call-parser qwen3_coder

How agents use tools

A coding agent typically has access to a small set of powerful tools:

ToolPurposeExample
read_fileRead a file from the worktreeread_file("src/app.py")
write_fileCreate or overwrite a filewrite_file("src/app.py", content)
searchSearch codebase with grep/ripgrepsearch("def authenticate")
executeRun a shell commandexecute("pytest tests/")
list_filesList directory contentslist_files("src/")

The agent loops: read code to understand the problem, write fixes, run tests to verify, iterate if tests fail. A typical task takes 10-40 tool calls. The quality of tool calling — how reliably the model formats calls, handles errors, and uses results — determines agent effectiveness.

CACP: structured I/O for agents

Free-form prose wastes tokens. When dispatching a coding task, you do not need a paragraph — you need structured fields. CACP (Compact Agent Communication Protocol) replaces verbose prompts and responses with typed fields.

# CACP Dispatch (input to agent)
TASK: Fix the authentication bug in login handler
CONTEXT: src/auth/login.py:42 raises TypeError on None email
ACCEPTANCE: Login works with None/empty email, tests pass
SCOPE: src/auth/ only, do not touch API layer
VERIFY: pytest tests/auth/
DONE: Commit with descriptive message
# CACP Response (output from agent)
STATUS:ok
FILES_MODIFIED:src/auth/login.py
TESTS:pass:14
BUILD:pass

Token savings

A typical free-form response uses ~2000 tokens to say "I fixed the bug in login.py, ran the tests, they pass." CACP says the same thing in ~200 tokens. Over hundreds of agent dispatches, this adds up to significant cost and latency savings.

When tool calling fails

Not all models are good at tool calling. Common failure modes:

Malformed JSON

The model generates invalid JSON in tool arguments — missing quotes, extra commas, unescaped characters. Smaller models and aggressive quantization make this worse.

Hallucinated tools

The model calls a tool that does not exist in its tool list. This wastes a turn and the agent has to recover. More common with models that were not specifically trained on tool calling.

Wrong parser

Using the hermes parser with a model that outputs Qwen-format tool calls (or vice versa). The parser cannot find the tool call in the output and the agent stalls. Always match the parser to the model.

Tool call in thinking block

Some models emit tool calls inside their chain-of-thought or thinking tags instead of in the designated tool call section. The parser misses them. This is a model-level issue that requires prompt engineering or model quirks to work around.

Choosing a model for tool calling

Qwen3-coder is our top pick for local tool-calling agents. It was specifically trained for agentic coding with tools. For API-based agents, Claude with native tool calling is the gold standard — no parser needed, the API handles structured output directly.

Related serving cards

See tool calling configurations benchmarked with PawBench: