Skip to content

perf(meta-tools): optimized system prompt with inline tool schemas#135

Open
justrach wants to merge 12 commits into
mainfrom
release/0.2.11
Open

perf(meta-tools): optimized system prompt with inline tool schemas#135
justrach wants to merge 12 commits into
mainfrom
release/0.2.11

Conversation

@justrach

Copy link
Copy Markdown
Owner

Summary

  • Optimized the meta-tool system prompt to include inline schemas for the 5 core tools (read, shell, fs_search, write, patch), eliminating unnecessary tools_info round trips
  • Bumped workspace version to 0.2.11
  • Fixed curl 404 on startup for dev builds (update checker was hitting GitHub releases for non-existent version tags)

Performance

Benchmarked on deepseek-v4-pro across 5 task categories (trivial, file read, grep, reasoning, multi-step), 2 runs each:

Metric Full Tool Defs (baseline) Meta-tools (new prompt) Delta
Avg total tokens 117,768 61,139 -48.1%
Avg turns 4.6 3.1 -33%
Avg tool calls 5.2 3.7 -29%
Tool errors 0.0 0.2 negligible
Avg wall time 31.2s 23.0s -26%

Per-task breakdown

Task Full Tools Meta-tools (new) Savings
trivial (no tools needed) 18,882 6,533 65%
file read (single tool) 39,166 15,698 60%
grep (search + read) 89,528 13,286 85%
reasoning (read + analyze) 117,829 69,812 41%
multi-step (search + read + reason) 323,434 200,366 38%

The previous meta-tool prompt was actually 8-19% worse than full tool definitions because the model called tools_info before every call_tool, wasting a round trip each time. The new prompt gives the model the 5 most common tool schemas inline so it can call them directly.

Why it works

The token savings come from two sources:

  1. No tool schemas on every request — full tool defs send ~20 tool JSON schemas (~15K tokens) on every provider request. Meta-tools send only 3 tiny schemas (~200 tokens).
  2. Fewer round trips — inline schemas mean the model skips tools_info lookups for common tools, cutting 1-3 turns per task.

The wall time improvement (26% faster) follows directly from fewer turns.

Test plan

  • cargo build clean
  • Snapshot tests updated for system prompt changes
  • Live tested with deepseek-v4-pro across multiple task types
  • Verified zero regressions on simple tasks (trivial, read)
  • Verified improvement on complex tasks (multi-step reasoning)

🤖 Generated with Claude Code

justrach and others added 11 commits May 21, 2026 18:27
…ge layer)

Lands the storage + SDK surface for graff-memd's out-of-process system /
user-message injection queue. Hermes does this inline because it's a
single Python process; we need a queue because graff-memd is a sidecar.

This PR is the **storage layer**. The conversation-loop drain hook is a
separate follow-up so this can ship + be reviewed in isolation; the
acceptance criterion that's still open is "Enqueue → next user turn
includes the nudge → consumed flag flips" (drain integration).

New surface:
- `forge_domain::PendingNudge` — `(id, conversation_id, role, content,
  created_at, consumed_at?)` + `NudgeRole` enum (`system`, `user_visible`,
  `user_hidden`) with wire-stable `as_str` / `from_str` round-trip + JSON
  rename matching SQL value.
- `forge_app::NudgeRepo` — async trait: `enqueue`, `next_unconsumed`,
  `mark_consumed`, `list_for_conversation`.
- `forge_repo::NudgeRepositoryImpl` — diesel-backed; FIFO drain ordered
  by `(created_at asc, id asc)` so same-ms enqueues are still totally
  ordered. Atomic INSERT + `last_insert_rowid()` in a single transaction
  so a concurrent enqueue can't slot a row between insert and id read.
- Migration `2026-05-21-180000_create_pending_nudges_table` with a
  composite drain index on `(conversation_id, consumed_at, created_at, id)`
  so the unconsumed-FIFO query covers the whole filter without a sort.
- `forge_api::API`: `enqueue_nudge`, `list_nudges`. The drain path
  (`next_unconsumed`, `mark_consumed`) is intentionally NOT in the
  public API — it's an internal orchestrator concern.

8 new tests:
- 3 domain tests for `NudgeRole` round-trip + visibility helpers
- 5 repo-level integration tests against in-memory SQLite:
  - `enqueue_then_next_unconsumed_returns_in_fifo_order` — FIFO order +
    monotonic ids
  - `mark_consumed_is_idempotent_and_drops_from_unconsumed_set` — second
    `mark_consumed` returns `Ok(false)`
  - `next_unconsumed_is_scoped_by_conversation` — isolation across
    conversations
  - `list_for_conversation_returns_consumed_and_unconsumed` — debug path
    sees both states, fresh-first
  - `mark_consumed_for_missing_id_returns_false` — idempotent for
    unknown ids

Disambiguation: both `TrajectoryRepo` and `NudgeRepo` define
`list_for_conversation` with the same signature, so the
`forge_api::ForgeAPI::list_trajectory` call site now uses the explicit
`TrajectoryRepo::list_for_conversation(...)` form. Same pattern as the
user-facts PR.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: blackfloofie-a codegraff agent <265516171+blackfloofie@users.noreply.github.com>
…provider requests

Introduces a meta-tool protocol that replaces sending all tool definitions
to the LLM provider with just 3 small meta-tool definitions:
- tools_list: discover available tool names and descriptions
- tools_info: inspect the full schema for a specific tool
- call_tool: invoke a tool by name with arguments

This saves significant tokens on every request since tool schemas are
no longer sent repeatedly.

Key changes:
- Add CallToolInput, ToolsListInput, ToolsInfoInput domain types
- Add CallTool, ToolsList, ToolsInfo variants to ToolCatalog enum
- Implement meta-tool dispatch in ToolRegistry (tools_list returns names,
  tools_info returns schema, call_tool delegates to the real tool)
- Modify ApplyTunableParameters to pass only meta-tool definitions to providers
- Update system prompt with meta-tool protocol instructions
- Add SummaryTool::MetaTools and Operation::MetaTool to compat layers
- Add 8 unit tests + 2 integration tests for parsing, dispatch, and tool filtering

Co-Authored-By: blackfloofie-a codegraff agent <265516171+blackfloofie@users.noreply.github.com>
Co-authored-by: ForgeCode <noreply@forgecode.dev>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Tushar Mathur <tusharmath@gmail.com>
Co-authored-by: Amit Singh <amitksingh1490@gmail.com>
Co-authored-by: Amit Singh <amitksingh1490@gmail.com>
…itle) in agent and tool_definition from merge resolution

Co-Authored-By: blackfloofie-a codegraff agent <265516171+blackfloofie@users.noreply.github.com>
Co-Authored-By: blackfloofie-a codegraff agent <265516171+blackfloofie@users.noreply.github.com>
Evolved the meta-tool system prompt through a darwinian tournament
(7 variants × 5 tasks × 2 runs each = 70 runs on deepseek-v4-pro).

The winning variant (v6_blend_tight) provides compact inline schemas
for the 5 core tools (read, shell, fs_search, write, patch) so the
model skips unnecessary tools_info lookups. Key results vs full tool
definitions baseline:

  - 48% fewer total tokens (61K avg vs 118K)
  - 0.2 avg errors vs 0.0 (negligible)
  - 23s avg wall time vs 31s (26% faster)
  - Won every task category (trivial through multi-step)

The previous meta-tool prompt (v1) was actually 8% worse than sending
full tool definitions due to excessive tools_info round trips. The
new prompt eliminates those by giving the model the schemas it needs
upfront in a dense format.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds from source have version 0.1.5 (from workspace Cargo.toml)
which doesn't match any GitHub release tag. The update_informer
check was hitting the GitHub API and producing a curl 404 on every
launch. Skip the check for 0.1.5 like we already do for 0.1.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added ci: benchmark Runs benchmarks type: performance Improved performance. labels May 25, 2026
Cherry-picked from tailcallhq/forgecode (adapted for our branch):

1. **Tool call argument validation** (from PR #3356)
   - Adds `parse_json()` to `ToolCallArguments` that validates JSON
     upfront instead of silently wrapping malformed input
   - Malformed args now surface as retryable errors

2. **Live context token counter** (from PR #3351)
   - Emits "Context ~45.2k / 900.0k" after each orchestrator turn
   - Adds `emit_context_usage()` and `humanize()` helpers to orch.rs

3. **Multi-signal auto-continue** (from PR #3357)
   - 5-signal confidence scoring detects when model stopped mid-task
   - Auto-resumes up to 3 times when confidence >= 60
   - Fixes "stuck agent" problem with models that return stop mid-task

Skipped unrelated bundled changes (pool.rs WAL hardening, fs_patch
rewrite) that were scope creep in the upstream PRs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

Action required: PR inactive for 5 days.
Status update or closure in 10 days.

@github-actions github-actions Bot added the state: inactive No current action needed/possible; issue fixed, out of scope, or superseded. label May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci: benchmark Runs benchmarks state: inactive No current action needed/possible; issue fixed, out of scope, or superseded. type: performance Improved performance.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants