investigation: measure tool schema token weight in system prompt

## Summary

Every LLM call sends the JSON schema for every available tool in the request. Netclaw currently has ~30-40 tools registered, each with parameter schemas, descriptions, and examples. **Measure how many tokens of each turn's input are actually tool definitions vs conversation content.** The measurement itself is the deliverable — the decision on what to do with it depends on the numbers.

## Why

Tool schemas are stable per-session — they don't change between turns unless the tool set itself changes (progressive disclosure, skill auto-load, etc.). That means they *should* live inside the cacheable prefix and ride along the #608 fix for free. But there are two ways they could secretly be costing us:

1. **If the schemas are large enough** (say, 5k+ tokens of pure schema), the **uncached** cost on the FIRST turn of every session is non-trivial. Even with caching, you pay the full cost once per session + any turn where the tool set changes.

2. **If the schemas sit in a position where they break prefix cache stability** (e.g., if the tool list re-serializes with different ordering or whitespace each turn), they could be poisoning the cache we just worked so hard to make stable.

## Method

Use the eval suite + Multi-Turn Cache Evolution table from the post-#608 baseline as a starting point. `multi_turn_text_growth` has 5 short chit-chat turns — minimal conversation payload, so the input size is almost entirely {persisted prompt + tool schemas + session block}.

From the current post-fix baseline (memory `6b42a0e4-8210-4e55-b9ca-8ff65c527cac`):

```
multi_turn_text_growth  1  input=5380  cached=4707  uncached=673
multi_turn_text_growth  2  input=5038  cached=4864  uncached=174
```

That's roughly 5000 tokens of "static baseline" on each turn. SOUL.md + AGENTS.md + TOOLING.md account for some of it, tool schemas account for some of it. The question is **what fraction**.

### Concrete steps

1. **Tap the Netclaw-side serialization point** where tool schemas get added to the outgoing request. Log or capture the serialized JSON length of the `tools` field on one representative call.
2. **Compare to the persisted system prompt length** from `ISystemPromptProvider.GetSystemPrompt()`.
3. **Compute the ratio**: how much of a 5000-token static prefix is tool schemas? If it's >40%, trimming has measurable impact. If it's <10%, it's not worth optimizing.

## Potential follow-up actions (depend on the measurement)

Not filing as separate issues until we have numbers, but candidates are:

- **Trim tool descriptions** — if descriptions are verbose and repetitive, condensing them saves the same tokens on every single turn.
- **More aggressive progressive disclosure** — Netclaw already has `search_tools` for dynamic discovery. If the token savings are large, we could move more tools behind progressive discovery and let the agent pull them on demand.
- **Tool schema compression** — some providers let you use a shorter format (OpenAI has `tool_choice: "auto"` with a parallel-friendly format); not available on all providers but worth checking.

## Out of scope

- Any code changes based on the measurement — this issue is **just the measurement**. Filing decisions on subsequent actions will happen after we see real numbers.
- Non-measurement changes to the tool serialization pipeline.

## Related

- #608 — cache-prefix stability (the work that makes this question interesting)
- Memory reference: `6b42a0e4-8210-4e55-b9ca-8ff65c527cac` (post-fix baseline eval numbers, when the memorizer box comes back online)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigation: measure tool schema token weight in system prompt #622

Summary

Why

Method

Concrete steps

Potential follow-up actions (depend on the measurement)

Out of scope

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

investigation: measure tool schema token weight in system prompt #622

Description

Summary

Why

Method

Concrete steps

Potential follow-up actions (depend on the measurement)

Out of scope

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions