Docs: troubleshooting guide for XML/tool-call markup leaking into chat sessions (self-hosted llama.cpp / Qwen3 etc.)

## What the user sees

When chatting with a self-hosted Netclaw connected to a llama.cpp / `llama-server` backend, the assistant occasionally produces messages or tool calls where raw XML-style markup leaks into visible content. Symptoms include:

- `<tool_call>`, `<function=…>`, `<parameter=…>` tags appearing as plain text in the chat
- Stray `</think>` (or `<think>`) tags appearing in the assistant's reply or inside tool-call arguments
- A tool call whose `arguments` JSON value contains the literal text of *another* tool call concatenated onto the end
- Tool calls whose argument fields are partially or completely empty (e.g. `args={}`, `{"Path": ""}`) when the model clearly intended to populate them
- The same prompt working fine in one session but corrupting in another with longer history

This is almost always a **chat-template mismatch on the inference server**, not a Netclaw bug. Netclaw faithfully assembles the streaming deltas it receives — if the server emits `<tool_call>` literal text instead of structured tool-call deltas, that's what Netclaw sees.

## Why it happens

Reasoning-capable open-weight models emit tool calls and reasoning blocks with model-specific delimiters. Common patterns:

- Qwen3 family — `<tool_call><function=…><parameter=…>…</parameter></tool_call>`, reasoning in `<think>…</think>`
- DeepSeek-R1 family — reasoning in `<think>…</think>`
- Hermes / Mistral / others — JSON-shaped tool calls

`llama-server` only knows how to parse these correctly when it's told to use the embedded chat template via `--jinja` (or an explicit `--chat-template-file`). Without that, it falls back to a heuristic parser that does not recognize the model's tool-call delimiters and lets the literal markup leak through into streaming output as plain text.

## Models known to require `--jinja` for clean tool calling

This is not exhaustive, but the following families have been reported to exhibit XML / tool-call leakage when run through `llama-server` *without* `--jinja`:

- **Qwen3 / Qwen3.5 / Qwen3-Coder** — confirmed; `<tool_call>` XML and `</think>` leak into content and tool-call args
- **Qwen2.5-Instruct (with tool calling)** — covered explicitly by llama.cpp's function-calling docs as requiring `--jinja`
- **DeepSeek-R1 distills** — reasoning leakage if `--reasoning-format` is wrong for the consumer

Other reasoning-capable models likely behave similarly. As a rule of thumb, **any model whose Hugging Face card describes a tool-call format using XML-style markup or that ships its own `chat_template.jinja`** should be served with `--jinja`.

## Recommended `llama-server` flags (Qwen3 example)

```
llama-server \
  --model <path/to/qwen3-gguf> \
  --jinja \
  --reasoning-format deepseek \
  --flash-attn on \
  --ctx-size <N> \
  --parallel <K> \
  --port <P>
```

- `--jinja` — mandatory; uses the GGUF's embedded chat template (knows the model's tool-call delimiters)
- `--reasoning-format deepseek` — correct for Qwen3; the model uses the same `<think>/</think>` delimiters as DeepSeek
- For some buggy GGUF templates a corrected external template via `--chat-template-file <path>` is also published by the community (search the model's Hugging Face discussions)

## Diagnostic checklist

If a self-hosted Netclaw instance is producing corrupted tool calls or visible XML markup:

1. **Check the inference server's launch arguments** for `--jinja`. If absent and the model is Qwen3 / DeepSeek-R1 / similar, that's almost certainly the issue.
2. **Check the model card on Hugging Face** for the recommended `llama-server` / vLLM / Ollama command line. If it lists `--jinja` or a custom chat template file, follow it.
3. **Check the llama.cpp build commit** for known parser-related regressions if argument fields are empty rather than corrupted with extra text. Check upstream issue/PR history for recent tool-call parser fixes.
4. **Try a higher quantization** (Q5_K_XL or Q6_K_XL instead of Q4_*). Tool-call structure is documented as quantization-sensitive — sub-4-bit quants frequently produce malformed tool calls even with the right template.
5. **Confirm the chat template embedded in the GGUF** isn't itself broken — community-corrected templates exist for several Qwen3 quants on Hugging Face.

## What Netclaw provides to help diagnose this

Netclaw emits diagnostic counters at three layers around every LLM streaming call:

- SSE layer — what came off the wire from the server (delta counts, suppressed deltas, finish reason)
- Middleware layer — what the chat-client decorator saw before the actor consumed it
- Actor layer — the assembled `ChatResponse` content breakdown (text chars, thinking chars, tool calls, finish reason)

These show up in the per-session log at `~/.netclaw/logs/sessions/<channel>_<thread>/session.log`. If counts match across all three layers but a tool call's `arguments` field is corrupted, the corruption originates upstream of Netclaw — almost always the inference server's chat template.

## What this issue tracks

A short troubleshooting article (FAQ entry or docs page) covering:

- The symptoms above with one or two anonymized example fragments
- A short list of model families known to require `--jinja` (or equivalent template flag)
- The recommended diagnostic flow when a user reports XML leakage
- A pointer to llama.cpp's `function-calling.md` and the official Qwen `llama.cpp` guide

The article shouldn't try to be exhaustive — the goal is to short-circuit the obvious case ("user is on Qwen3 without `--jinja`") and point the rest at upstream documentation.

## References

- [llama.cpp `docs/function-calling.md`](https://github.com/ggml-org/llama.cpp/blob/master/docs/function-calling.md)
- [Qwen `llama.cpp` guide](https://qwen.readthedocs.io/en/latest/run_locally/llama.cpp.html)
- [Unsloth Qwen3-Coder template fixes thread](https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF/discussions/10) — documents the exact failure modes and references community-corrected templates

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs: troubleshooting guide for XML/tool-call markup leaking into chat sessions (self-hosted llama.cpp / Qwen3 etc.) #16

What the user sees

Why it happens

Models known to require `--jinja` for clean tool calling

Recommended `llama-server` flags (Qwen3 example)

Diagnostic checklist

What Netclaw provides to help diagnose this

What this issue tracks

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Docs: troubleshooting guide for XML/tool-call markup leaking into chat sessions (self-hosted llama.cpp / Qwen3 etc.) #16

Description

What the user sees

Why it happens

Models known to require --jinja for clean tool calling

Recommended llama-server flags (Qwen3 example)

Diagnostic checklist

What Netclaw provides to help diagnose this

What this issue tracks

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Models known to require `--jinja` for clean tool calling

Recommended `llama-server` flags (Qwen3 example)