Skip to content

investigation: measure tool schema token weight in system prompt #622

@Aaronontheweb

Description

@Aaronontheweb

Summary

Every LLM call sends the JSON schema for every available tool in the request. Netclaw currently has ~30-40 tools registered, each with parameter schemas, descriptions, and examples. Measure how many tokens of each turn's input are actually tool definitions vs conversation content. The measurement itself is the deliverable — the decision on what to do with it depends on the numbers.

Why

Tool schemas are stable per-session — they don't change between turns unless the tool set itself changes (progressive disclosure, skill auto-load, etc.). That means they should live inside the cacheable prefix and ride along the #608 fix for free. But there are two ways they could secretly be costing us:

  1. If the schemas are large enough (say, 5k+ tokens of pure schema), the uncached cost on the FIRST turn of every session is non-trivial. Even with caching, you pay the full cost once per session + any turn where the tool set changes.

  2. If the schemas sit in a position where they break prefix cache stability (e.g., if the tool list re-serializes with different ordering or whitespace each turn), they could be poisoning the cache we just worked so hard to make stable.

Method

Use the eval suite + Multi-Turn Cache Evolution table from the post-#608 baseline as a starting point. multi_turn_text_growth has 5 short chit-chat turns — minimal conversation payload, so the input size is almost entirely {persisted prompt + tool schemas + session block}.

From the current post-fix baseline (memory 6b42a0e4-8210-4e55-b9ca-8ff65c527cac):

multi_turn_text_growth  1  input=5380  cached=4707  uncached=673
multi_turn_text_growth  2  input=5038  cached=4864  uncached=174

That's roughly 5000 tokens of "static baseline" on each turn. SOUL.md + AGENTS.md + TOOLING.md account for some of it, tool schemas account for some of it. The question is what fraction.

Concrete steps

  1. Tap the Netclaw-side serialization point where tool schemas get added to the outgoing request. Log or capture the serialized JSON length of the tools field on one representative call.
  2. Compare to the persisted system prompt length from ISystemPromptProvider.GetSystemPrompt().
  3. Compute the ratio: how much of a 5000-token static prefix is tool schemas? If it's >40%, trimming has measurable impact. If it's <10%, it's not worth optimizing.

Potential follow-up actions (depend on the measurement)

Not filing as separate issues until we have numbers, but candidates are:

  • Trim tool descriptions — if descriptions are verbose and repetitive, condensing them saves the same tokens on every single turn.
  • More aggressive progressive disclosure — Netclaw already has search_tools for dynamic discovery. If the token savings are large, we could move more tools behind progressive discovery and let the agent pull them on demand.
  • Tool schema compression — some providers let you use a shorter format (OpenAI has tool_choice: "auto" with a parallel-friendly format); not available on all providers but worth checking.

Out of scope

  • Any code changes based on the measurement — this issue is just the measurement. Filing decisions on subsequent actions will happen after we see real numbers.
  • Non-measurement changes to the tool serialization pipeline.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions