Skip to content

Docs: troubleshooting guide for XML/tool-call markup leaking into chat sessions (self-hosted llama.cpp / Qwen3 etc.) #16

@Aaronontheweb

Description

@Aaronontheweb

What the user sees

When chatting with a self-hosted Netclaw connected to a llama.cpp / llama-server backend, the assistant occasionally produces messages or tool calls where raw XML-style markup leaks into visible content. Symptoms include:

  • <tool_call>, <function=…>, <parameter=…> tags appearing as plain text in the chat
  • Stray </think> (or <think>) tags appearing in the assistant's reply or inside tool-call arguments
  • A tool call whose arguments JSON value contains the literal text of another tool call concatenated onto the end
  • Tool calls whose argument fields are partially or completely empty (e.g. args={}, {"Path": ""}) when the model clearly intended to populate them
  • The same prompt working fine in one session but corrupting in another with longer history

This is almost always a chat-template mismatch on the inference server, not a Netclaw bug. Netclaw faithfully assembles the streaming deltas it receives — if the server emits <tool_call> literal text instead of structured tool-call deltas, that's what Netclaw sees.

Why it happens

Reasoning-capable open-weight models emit tool calls and reasoning blocks with model-specific delimiters. Common patterns:

  • Qwen3 family — <tool_call><function=…><parameter=…>…</parameter></tool_call>, reasoning in <think>…</think>
  • DeepSeek-R1 family — reasoning in <think>…</think>
  • Hermes / Mistral / others — JSON-shaped tool calls

llama-server only knows how to parse these correctly when it's told to use the embedded chat template via --jinja (or an explicit --chat-template-file). Without that, it falls back to a heuristic parser that does not recognize the model's tool-call delimiters and lets the literal markup leak through into streaming output as plain text.

Models known to require --jinja for clean tool calling

This is not exhaustive, but the following families have been reported to exhibit XML / tool-call leakage when run through llama-server without --jinja:

  • Qwen3 / Qwen3.5 / Qwen3-Coder — confirmed; <tool_call> XML and </think> leak into content and tool-call args
  • Qwen2.5-Instruct (with tool calling) — covered explicitly by llama.cpp's function-calling docs as requiring --jinja
  • DeepSeek-R1 distills — reasoning leakage if --reasoning-format is wrong for the consumer

Other reasoning-capable models likely behave similarly. As a rule of thumb, any model whose Hugging Face card describes a tool-call format using XML-style markup or that ships its own chat_template.jinja should be served with --jinja.

Recommended llama-server flags (Qwen3 example)

llama-server \
  --model <path/to/qwen3-gguf> \
  --jinja \
  --reasoning-format deepseek \
  --flash-attn on \
  --ctx-size <N> \
  --parallel <K> \
  --port <P>
  • --jinja — mandatory; uses the GGUF's embedded chat template (knows the model's tool-call delimiters)
  • --reasoning-format deepseek — correct for Qwen3; the model uses the same <think>/</think> delimiters as DeepSeek
  • For some buggy GGUF templates a corrected external template via --chat-template-file <path> is also published by the community (search the model's Hugging Face discussions)

Diagnostic checklist

If a self-hosted Netclaw instance is producing corrupted tool calls or visible XML markup:

  1. Check the inference server's launch arguments for --jinja. If absent and the model is Qwen3 / DeepSeek-R1 / similar, that's almost certainly the issue.
  2. Check the model card on Hugging Face for the recommended llama-server / vLLM / Ollama command line. If it lists --jinja or a custom chat template file, follow it.
  3. Check the llama.cpp build commit for known parser-related regressions if argument fields are empty rather than corrupted with extra text. Check upstream issue/PR history for recent tool-call parser fixes.
  4. Try a higher quantization (Q5_K_XL or Q6_K_XL instead of Q4_*). Tool-call structure is documented as quantization-sensitive — sub-4-bit quants frequently produce malformed tool calls even with the right template.
  5. Confirm the chat template embedded in the GGUF isn't itself broken — community-corrected templates exist for several Qwen3 quants on Hugging Face.

What Netclaw provides to help diagnose this

Netclaw emits diagnostic counters at three layers around every LLM streaming call:

  • SSE layer — what came off the wire from the server (delta counts, suppressed deltas, finish reason)
  • Middleware layer — what the chat-client decorator saw before the actor consumed it
  • Actor layer — the assembled ChatResponse content breakdown (text chars, thinking chars, tool calls, finish reason)

These show up in the per-session log at ~/.netclaw/logs/sessions/<channel>_<thread>/session.log. If counts match across all three layers but a tool call's arguments field is corrupted, the corruption originates upstream of Netclaw — almost always the inference server's chat template.

What this issue tracks

A short troubleshooting article (FAQ entry or docs page) covering:

  • The symptoms above with one or two anonymized example fragments
  • A short list of model families known to require --jinja (or equivalent template flag)
  • The recommended diagnostic flow when a user reports XML leakage
  • A pointer to llama.cpp's function-calling.md and the official Qwen llama.cpp guide

The article shouldn't try to be exhaustive — the goal is to short-circuit the obvious case ("user is on Qwen3 without --jinja") and point the rest at upstream documentation.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions