From edd4ec81e3cfc1b402eab1ef7e5912873df32570 Mon Sep 17 00:00:00 2001 From: ProfSynapse Date: Tue, 19 May 2026 17:26:29 -0400 Subject: [PATCH 1/3] docs: audit eval harness tool schemas vs production Catalogues field-level drift between tests/eval/fixtures/tools.ts NEXUS_TOOLS and current production getParameterSchema() returns. 8 of 11 fixture entries drift; documents the v5.9.0 contentManager_replace hard-break and several production-side tools missing from the fixture. Filed under docs/research/ because docs/eval/ is gitignored. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/research/eval-harness-schema-audit.md | 212 +++++++++++++++++++++ 1 file changed, 212 insertions(+) create mode 100644 docs/research/eval-harness-schema-audit.md diff --git a/docs/research/eval-harness-schema-audit.md b/docs/research/eval-harness-schema-audit.md new file mode 100644 index 000000000..2a0ab0e57 --- /dev/null +++ b/docs/research/eval-harness-schema-audit.md @@ -0,0 +1,212 @@ +# Eval Harness Tool Schema Audit + +Audit of `tests/eval/fixtures/tools.ts` (`NEXUS_TOOLS` array) against current production `getParameterSchema()` returns. Produced under Task #10 of the `fix/eval-harness-cli-schema` branch. + +## Method + +For each fixture entry in `NEXUS_TOOLS`, located the corresponding production tool class under `src/agents/*/tools/*.ts`, read its `getParameterSchema()` method, and compared required fields, optional fields, types, enums, and descriptions. Production schemas are wrapped by `getMergedSchema()` which merges in `CommonParameters` (workspaceId/sessionId/memory/goal/constraints); for this audit, the toolSchema (pre-merge) is the unit of comparison since the harness fixture does not include the common parameters. + +## Summary Table + +| Tool name in fixture | Production class | Status | Drifted fields | +|----------------------|------------------|--------|---------------| +| `contentManager_read` | `ContentManager` `ReadTool` (read.ts) | match | 0 | +| `contentManager_write` | `ContentManager` `WriteTool` (write.ts) | drift | 1 (missing `overwrite`) | +| `contentManager_insert` | `ContentManager` `InsertTool` (insert.ts) | drift | 4 (`position`/`lineNumber` instead of `startLine`, different semantics) | +| `contentManager_replace` | `ContentManager` `ReplaceTool` (replace.ts) | drift (HARD) | 4 (`search`/`replace` two-field vs `start`/`end`/`content` three-field anchor model) | +| `storageManager_move` | `StorageManager` `MoveTool` (move.ts) | drift | 2 (`destination` vs `newPath`; missing `overwrite`) | +| `storageManager_copy` | `StorageManager` `CopyTool` (copy.ts) | drift | 2 (`destination` vs `newPath`; missing `overwrite`) | +| `storageManager_archive` | `StorageManager` `ArchiveTool` (archive.ts) | match | 0 | +| `storageManager_createFolder` | `StorageManager` `CreateFolderTool` (createFolder.ts) | match | 0 | +| `storageManager_list` | `StorageManager` `ListTool` (list.ts) | drift | 2 (`path` listed as required, missing `filter`) | +| `searchManager_content` | `SearchManager` `SearchContentTool` (searchContent.ts) | drift | 4 (missing `semantic`/`includeContent`/`snippetLength`/`paths`) | +| `searchManager_directory` | `SearchManager` `SearchDirectoryTool` (searchDirectory.ts) | drift | 5+ (`paths` should be required, missing `fileTypes`/`depth`/`pattern`/`dateRange`/`limit`/`includeContent`; `searchType` lacks enum) | + +11 entries audited: 3 match, 8 drift. The HARD drift on `contentManager_replace` is the v5.9.0 schema break called out in CLAUDE.md. + +## Per-Drift Detail + +### contentManager_write (drift, 1 field) + +Production at `src/agents/contentManager/tools/write.ts:174-196`: +- `path` (string, required) +- `content` (string, required) +- `overwrite` (boolean, optional, default: false) + +Fixture at `tests/eval/fixtures/tools.ts:33-47`: +- `path` (string, required) +- `content` (string, required) + +Drift: `overwrite` is missing in fixture. Not breaking (defaults to false), but the harness LLM cannot exercise the overwrite path. + +--- + +### contentManager_insert (drift, semantic redesign) + +Production at `src/agents/contentManager/tools/insert.ts:128-149`: +- `path` (string, required) +- `content` (string, required) +- `startLine` (number, required) — line-based: `1` to prepend, `-1` to append, `N` to insert before line N + +Fixture at `tests/eval/fixtures/tools.ts:48-64`: +- `path` (string, required) +- `content` (string, required) +- `position` (string, required) — generic "position" string +- `lineNumber` (number, optional) + +Drift: Fixture uses a `position` string + optional `lineNumber` shape that does not exist in production. Production uses a single integer `startLine` with sentinel values (`1`, `-1`, `N`). Fixture would mislead the model into emitting a `position: "append"` shape that production rejects. From CLAUDE.md pin: `append`/`prepend` actions in executePrompts route to `insert` — same single-integer convention. + +--- + +### contentManager_replace (drift, HARD — v5.9.0 break) + +Production at `src/agents/contentManager/tools/replace.ts:202-227`: +- `path` (string, required) +- `start` (string, required) — content-anchor opening line(s), must be globally unique +- `end` (string, required) — content-anchor closing line(s), must be after `start` +- `content` (string, required) — replacement text; empty string deletes the range + +Fixture at `tests/eval/fixtures/tools.ts:65-80`: +- `path` (string, required) +- `search` (string, required) — text to find +- `replace` (string, required) — replacement text + +Drift: Production switched from search/replace semantics to pattern-anchored range replacement in v5.9.0 (per CLAUDE.md pin). The new model identifies a contiguous range using `start`/`end` line anchors and replaces it with `content`; line numbers are never required. The fixture's `search`/`replace` shape predates this break and bears no field-name overlap with production. This is the most severe drift in the fixture. + +Evidence: CLAUDE.md pinned context — "v5.9.0 — Pattern-anchored content replace (PR #206): hard schema break from `{path, oldContent, newContent, startLine, endLine}` to 4-field `{path, start, end, content}` on both `ContentManager.replace` and `executePrompts.replace`." + +--- + +### storageManager_move (drift, 2 fields) + +Production at `src/agents/storageManager/tools/move.ts:93-117`: +- `path` (string, required) +- `newPath` (string, required) +- `overwrite` (boolean, optional, default: false) + +Fixture at `tests/eval/fixtures/tools.ts:81-95`: +- `path` (string, required) +- `destination` (string, required) + +Drift: Field name is `newPath` in production, not `destination`. Fixture also omits `overwrite`. A model emitting `{ path, destination }` would have its destination argument silently dropped by production. + +--- + +### storageManager_copy (drift, 2 fields) + +Production at `src/agents/storageManager/tools/copy.ts:84-107`: +- `path` (string, required) +- `newPath` (string, required) +- `overwrite` (boolean, optional, default: false) + +Fixture at `tests/eval/fixtures/tools.ts:96-110`: +- `path` (string, required) +- `destination` (string, required) + +Drift: Same `destination` vs `newPath` mismatch as move. Same missing `overwrite`. + +--- + +### storageManager_list (drift, 2 fields) + +Production at `src/agents/storageManager/tools/list.ts:167-185`: +- `path` (string, optional, default: '') — empty string / `/` / `.` is vault root +- `filter` (string, optional) +- `required: []` — both fields optional + +Fixture at `tests/eval/fixtures/tools.ts:139-152`: +- `path` (string, required) + +Drift: Production has `path` as optional with vault-root default; fixture marks it required. Fixture also missing the `filter` option. A model calling `storageManager_list` with no args is valid in production but rejected by the fixture schema. + +--- + +### searchManager_content (drift, 4 fields) + +Production at `src/agents/searchManager/tools/searchContent.ts:472-517`: +- `query` (string, required) +- `semantic` (boolean, optional, default: false) — true for vector search +- `limit` (number, optional, default: 10, min 1, max 50) +- `includeContent` (boolean, optional, default: true) +- `snippetLength` (number, optional, default: 200, min 50, max 1000) +- `paths` (array of string, optional) — folder paths or glob patterns + +Fixture at `tests/eval/fixtures/tools.ts:153-167`: +- `query` (string, required) +- `limit` (number, optional) + +Drift: Fixture is missing `semantic`, `includeContent`, `snippetLength`, and `paths`. The `semantic` flag in particular is significant — production exposes both keyword and AI-powered semantic search through this one tool; the fixture only exposes the keyword path. + +--- + +### searchManager_directory (drift, 5+ fields, plus required-list mismatch) + +Production at `src/agents/searchManager/tools/searchDirectory.ts:208-290`: +- `query` (string, required, minLength 1) +- `paths` (array of string, required, minItems 1) +- `searchType` (string enum `'files'|'folders'|'both'`, optional, default: `'both'`) +- `fileTypes` (array of string, optional) +- `depth` (number, optional, 1–10) +- `pattern` (string, optional) — regex filter +- `dateRange` (object with start/end YYYY-MM-DD, optional) +- `limit` (number, optional, default: 20, 1–100) +- `includeContent` (boolean, optional, default: true) + +Fixture at `tests/eval/fixtures/tools.ts:168-183`: +- `query` (string, required) +- `paths` (array of string, optional) — listed in properties but NOT in required +- `searchType` (string, optional) — no enum constraint + +Drift: (a) `paths` is required in production but listed as optional in the fixture — opposite required-set. (b) `searchType` lacks the `files|folders|both` enum in fixture. (c) Five fields missing from fixture (`fileTypes`, `depth`, `pattern`, `dateRange`, `limit`, `includeContent`). + +## Production-side tools NOT in the fixture + +The fixture covers `contentManager` (4 of 5 tools), `storageManager` (5 of 6 tools), and `searchManager` (2 of 3 tools). The following production tools have no fixture representation: + +| Production tool | Agent | Source | +|-----------------|-------|--------| +| `contentManager_setProperty` | ContentManager | `src/agents/contentManager/tools/setProperty.ts` — set frontmatter property, replace/merge modes | +| `storageManager_open` | StorageManager | `src/agents/storageManager/tools/open.ts` — open file in Obsidian editor | +| `searchManager_memory` | SearchManager | `src/agents/searchManager/tools/searchMemory.ts` — search memory traces / states / conversations | +| `memoryManager_*` (full agent) | MemoryManager | createSession, loadSession, createWorkspace, createState, etc. | +| `canvasManager_*` (full agent) | CanvasManager | read, write, update, list | +| `taskManager_*` (full agent) | TaskManager | createProject, listProjects, createTask, listTasks, updateTask, moveTask, queryTasks, linkNote | +| `promptManager_*` (full agent) | PromptManager | listModels, executePrompts, createPrompt, updatePrompt, deletePrompt, listPrompts, getPrompt, generateImage | +| `ingestManager_*` (full agent) | IngestManager | ingest, listCapabilities | +| App agents (webTools, composer) | apps/ | openWebpage, capturePagePdf, capturePagePng, captureToMarkdown, extractLinks, compose, listFormats | + +The Task #8 eval run showed the LLM calling `searchManager_memory`, `canvasManager_list`, etc. — names that exist in production but were rejected as hallucinations by the harness because they are not in the fixture. After the Task #11 schema swap, this concern goes away: the LLM will only see `getTools`/`useTools` and discover available tools dynamically. + +## Fixture-side tools NOT in production + +None. Every fixture entry maps to a production tool class. The drifts above are field-level / shape-level mismatches, not phantom tools. + +## Implications for Task #11 (schema swap) + +The audit confirms the team-lead's framing: the harness fixture has substantial drift across 8 of 11 entries plus 6+ missing production tools, but Task #11's plan is to swap the entire `NEXUS_TOOLS` array for the two-tool surface (`getTools` + `useTools`). After the swap: + +- The harness LLM only sees the two-tool MCP shape. +- The executor parses the `useTools.tool` CLI string via the real `ToolCliNormalizer`. +- Drifts above stop mattering for the LLM-facing surface. +- They still matter for the executor: when it parses `content replace --path foo.md --start "..." --end "..." --content "..."` it must route to the production 4-field schema, not the obsolete 3-field one. This audit is the reference for getting that routing right. + +## Evidence Index + +| Tool | Production file:line | +|------|----------------------| +| read | `src/agents/contentManager/tools/read.ts:110-131` | +| write | `src/agents/contentManager/tools/write.ts:174-196` | +| insert | `src/agents/contentManager/tools/insert.ts:128-149` | +| replace | `src/agents/contentManager/tools/replace.ts:202-227` | +| setProperty | `src/agents/contentManager/tools/setProperty.ts:161-193` | +| list | `src/agents/storageManager/tools/list.ts:167-185` | +| move | `src/agents/storageManager/tools/move.ts:93-117` | +| copy | `src/agents/storageManager/tools/copy.ts:84-107` | +| archive | `src/agents/storageManager/tools/archive.ts:110-125` | +| createFolder | `src/agents/storageManager/tools/createFolder.ts:68-83` | +| open | `src/agents/storageManager/tools/open.ts` | +| searchContent | `src/agents/searchManager/tools/searchContent.ts:472-517` | +| searchDirectory | `src/agents/searchManager/tools/searchDirectory.ts:208-290` | +| searchMemory | `src/agents/searchManager/tools/searchMemory.ts` | + +Fixture under audit: `tests/eval/fixtures/tools.ts` lines 16-184 (`NEXUS_TOOLS`). The file also contains `META_TOOLS` (`getTools`/`useTools`, lines 189-234) and `SIMPLE_TOOLS` (weather/time mocks, lines 239-268), neither of which are in scope for this audit. From 0290631ad3615e11e132b1af9597a4b571bcf3a9 Mon Sep 17 00:00:00 2001 From: ProfSynapse Date: Tue, 19 May 2026 17:33:59 -0400 Subject: [PATCH 2/3] test(eval-harness): convert 5 nexus-mode scenarios to two-tool meta architecture MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The eval harness exposed direct domain tool schemas (NEXUS_TOOLS) to the LLM even though production only ships getTools/useTools. The contradiction inflated failure rates for models that picked the meta tools (correct production behavior) while the harness asserted on direct domain calls. Convert all toolSet: nexus scenarios — adversarial, basic-tool-call, multi-turn, provider-parity, system-prompt — to toolSet: meta. Each direct domain call becomes a getTools selector turn + useTools CLI-command turn. Inner domain mock responses kept alongside getTools/useTools responses so both LiveToolExecutor (mirror-parse capture) and EvalToolExecutor (inner-unwrap capture) match. After: 0 scenarios use toolSet: nexus; all 12 scenario files use toolSet: meta exclusively. Harness now matches production two-tool MCP contract. Co-Authored-By: Claude Opus 4.7 (1M context) --- tests/eval/scenarios/adversarial.eval.yaml | 127 ++++++++++-- .../eval/scenarios/basic-tool-call.eval.yaml | 132 +++++++++++-- tests/eval/scenarios/multi-turn.eval.yaml | 182 ++++++++++++++++-- .../eval/scenarios/provider-parity.eval.yaml | 118 +++++++++++- tests/eval/scenarios/system-prompt.eval.yaml | 122 ++++++++++-- 5 files changed, 627 insertions(+), 54 deletions(-) diff --git a/tests/eval/scenarios/adversarial.eval.yaml b/tests/eval/scenarios/adversarial.eval.yaml index e0cc3a015..0fec4250f 100644 --- a/tests/eval/scenarios/adversarial.eval.yaml +++ b/tests/eval/scenarios/adversarial.eval.yaml @@ -1,12 +1,43 @@ - name: ambiguous-prompt - description: Ambiguous user request — model should search or ask for clarification - toolSet: nexus + description: Ambiguous user request — model should search or ask for clarification (via two-tool meta) + toolSet: meta turns: - userMessage: "Do something with my notes about the project" expectedTools: - - name: searchManager_searchContent + - name: getTools + optional: true + - name: useTools optional: true mockResponses: + getTools: + success: true + result: + tools: + - agent: searchManager + tool: searchContent + description: Search for notes by content keyword + command: "search content" + usage: "search content " + arguments: + - name: query + flag: --query + type: string + required: true + positional: true + examples: + - 'search content "query-value"' + useTools: + success: true + result: + results: + - tool: searchManager_searchContent + success: true + result: + results: + - path: notes/project.md + score: 0.88 + snippet: "Project overview and status..." + totalResults: 1 searchManager_searchContent: success: true result: @@ -17,22 +48,48 @@ totalResults: 1 - name: tool-returns-error - description: Tool call returns an error — model should handle gracefully - toolSet: nexus + description: Tool call returns an error — model should handle gracefully (via two-tool meta) + toolSet: meta turns: - userMessage: "Read the file at notes/secret.md" expectedTools: - - name: contentManager_read + - name: getTools + params: + tool: "content" + mockResponses: + getTools: + success: true + result: + tools: + - agent: contentManager + tool: read + description: Read content from a file + command: "content read" + usage: "content read " + arguments: + - name: path + flag: --path + type: string + required: true + positional: true + examples: + - 'content read "path-value" 1' + + - expectedTools: + - name: useTools params: - path: notes/secret.md + tool: "content read" mockResponses: + useTools: + success: false + error: "Permission denied: cannot access protected file" contentManager_read: success: false error: "Permission denied: cannot access protected file" - name: model-refuses-tools description: Model responds with text instead of calling tools — acceptable fallback - toolSet: nexus + toolSet: meta temperature: 0.3 turns: - userMessage: "What do you think about the weather today?" @@ -40,13 +97,61 @@ mockResponses: {} - name: large-tool-response - description: Tool returns a large payload — continuation should not break - toolSet: nexus + description: Tool returns a large payload — continuation should not break (via two-tool meta) + toolSet: meta turns: - userMessage: "List all files in the vault root" expectedTools: - - name: storageManager_list + - name: getTools + params: + tool: "storage" mockResponses: + getTools: + success: true + result: + tools: + - agent: storageManager + tool: list + description: List files and folders in a directory + command: "storage list" + usage: "storage list [--path ]" + arguments: + - name: path + flag: --path + type: string + required: false + positional: false + examples: + - 'storage list' + + - expectedTools: + - name: useTools + params: + tool: "storage list" + mockResponses: + useTools: + success: true + result: + results: + - tool: storageManager_list + success: true + result: + path: / + files: + - {name: "note1.md", type: "file"} + - {name: "note2.md", type: "file"} + - {name: "note3.md", type: "file"} + - {name: "note4.md", type: "file"} + - {name: "note5.md", type: "file"} + - {name: "note6.md", type: "file"} + - {name: "note7.md", type: "file"} + - {name: "note8.md", type: "file"} + - {name: "note9.md", type: "file"} + - {name: "note10.md", type: "file"} + folders: + - {name: "archive", type: "folder"} + - {name: "notes", type: "folder"} + - {name: "projects", type: "folder"} storageManager_list: success: true result: diff --git a/tests/eval/scenarios/basic-tool-call.eval.yaml b/tests/eval/scenarios/basic-tool-call.eval.yaml index f12cad3ad..2acf9399c 100644 --- a/tests/eval/scenarios/basic-tool-call.eval.yaml +++ b/tests/eval/scenarios/basic-tool-call.eval.yaml @@ -1,13 +1,48 @@ - name: read-single-note - description: Basic single tool call — read a note by path - toolSet: nexus + description: Basic single tool call via two-tool meta — read a note by path + toolSet: meta turns: - userMessage: "Read the file at notes/meeting.md" expectedTools: - - name: contentManager_read + - name: getTools params: - path: notes/meeting.md + tool: "content" mockResponses: + getTools: + success: true + result: + tools: + - agent: contentManager + tool: read + description: Read content from a file + command: "content read" + usage: "content read " + arguments: + - name: path + flag: --path + type: string + required: true + positional: true + examples: + - 'content read "path-value" 1' + + - expectedTools: + - name: useTools + params: + tool: "content read" + mockResponses: + useTools: + success: true + result: + results: + - tool: contentManager_read + success: true + result: + content: "# Q2 Meeting\n- Roadmap reviewed\n- Budget approved\n- Launch date: June 15" + path: notes/meeting.md + totalLines: 4 + startLine: 1 + endLine: 4 contentManager_read: success: true result: @@ -18,15 +53,52 @@ endLine: 4 - name: write-new-note - description: Basic single tool call — write a new note - toolSet: nexus + description: Basic single tool call via two-tool meta — write a new note + toolSet: meta turns: - userMessage: "Create a note at notes/todo.md with the content 'Buy groceries'" expectedTools: - - name: contentManager_write + - name: getTools params: - path: notes/todo.md + tool: "content" mockResponses: + getTools: + success: true + result: + tools: + - agent: contentManager + tool: write + description: Write content to a file + command: "content write" + usage: "content write " + arguments: + - name: path + flag: --path + type: string + required: true + positional: true + - name: content + flag: --content + type: string + required: true + positional: true + examples: + - 'content write "path-value" "content-value"' + + - expectedTools: + - name: useTools + params: + tool: "content write" + mockResponses: + useTools: + success: true + result: + results: + - tool: contentManager_write + success: true + result: + path: notes/todo.md + created: true contentManager_write: success: true result: @@ -34,15 +106,51 @@ created: true - name: search-notes - description: Basic single tool call — search for notes - toolSet: nexus + description: Basic single tool call via two-tool meta — search for notes + toolSet: meta turns: - userMessage: "Find notes about project roadmap" expectedTools: - - name: searchManager_searchContent + - name: getTools params: - query: project roadmap + tool: "search" mockResponses: + getTools: + success: true + result: + tools: + - agent: searchManager + tool: searchContent + description: Search for notes by content keyword + command: "search content" + usage: "search content " + arguments: + - name: query + flag: --query + type: string + required: true + positional: true + examples: + - 'search content "query-value"' + + - expectedTools: + - name: useTools + params: + tool: "search content" + mockResponses: + useTools: + success: true + result: + results: + - tool: searchManager_searchContent + success: true + result: + results: + - path: notes/roadmap-q2.md + score: 0.95 + snippet: "Q2 roadmap priorities..." + totalResults: 1 + query: project roadmap searchManager_searchContent: success: true result: diff --git a/tests/eval/scenarios/multi-turn.eval.yaml b/tests/eval/scenarios/multi-turn.eval.yaml index a123df3e4..38c92ad5a 100644 --- a/tests/eval/scenarios/multi-turn.eval.yaml +++ b/tests/eval/scenarios/multi-turn.eval.yaml @@ -1,49 +1,174 @@ - name: read-then-write-then-move - description: Read a note, write a summary, move it to archive - toolSet: nexus + description: Read a note, write a summary, move it to archive — via two-tool meta + toolSet: meta allowReorder: true turns: - userMessage: "Read notes/meeting.md, write a summary to notes/summary.md, then move it to archive/" expectedTools: - - name: contentManager_read + - name: getTools params: - path: notes/meeting.md + tool: "content" mockResponses: + getTools: + success: true + result: + tools: + - agent: contentManager + tool: read + description: Read content from a file + command: "content read" + usage: "content read " + arguments: + - name: path + flag: --path + type: string + required: true + positional: true + examples: + - 'content read "path-value" 1' + - agent: contentManager + tool: write + description: Write content to a file + command: "content write" + usage: "content write " + arguments: + - name: path + flag: --path + type: string + required: true + positional: true + - name: content + flag: --content + type: string + required: true + positional: true + examples: + - 'content write "path-value" "content-value"' + + - expectedTools: + - name: useTools + params: + tool: "content read" + mockResponses: + useTools: + success: true + result: + results: + - tool: contentManager_read + success: true + result: + content: "# Q2 Meeting\n- Roadmap reviewed\n- Budget approved\n- Launch date: June 15" contentManager_read: success: true result: content: "# Q2 Meeting\n- Roadmap reviewed\n- Budget approved\n- Launch date: June 15" - expectedTools: - - name: contentManager_write + - name: useTools params: - path: notes/summary.md + tool: "content write" mockResponses: + useTools: + success: true + result: + results: + - tool: contentManager_write + success: true + result: + path: notes/summary.md contentManager_write: success: true result: path: notes/summary.md - expectedTools: - - name: storageManager_move + - name: getTools params: - path: notes/summary.md - destination: archive/ + tool: "storage" + optional: true + - name: useTools + params: + tool: "storage move" mockResponses: + getTools: + success: true + result: + tools: + - agent: storageManager + tool: move + description: Move a file to a new location + command: "storage move" + usage: "storage move " + arguments: + - name: path + flag: --path + type: string + required: true + positional: true + - name: destination + flag: --destination + type: string + required: true + positional: true + examples: + - 'storage move "path-value" "destination-value"' + useTools: + success: true + result: + results: + - tool: storageManager_move + success: true + result: + newPath: archive/summary.md storageManager_move: success: true result: newPath: archive/summary.md - name: search-and-read - description: Search for notes about a topic, then read the top result - toolSet: nexus + description: Search for notes about a topic, then read the top result — via two-tool meta + toolSet: meta allowReorder: true turns: - userMessage: "Find notes about project roadmap and show me the full content of the best match" expectedTools: - - name: searchManager_searchContent + - name: getTools + params: + tool: "search" + mockResponses: + getTools: + success: true + result: + tools: + - agent: searchManager + tool: searchContent + description: Search for notes by content keyword + command: "search content" + usage: "search content " + arguments: + - name: query + flag: --query + type: string + required: true + positional: true + examples: + - 'search content "query-value"' + + - expectedTools: + - name: useTools + params: + tool: "search content" mockResponses: + useTools: + success: true + result: + results: + - tool: searchManager_searchContent + success: true + result: + results: + - path: notes/roadmap-q2.md + score: 0.95 searchManager_searchContent: success: true result: @@ -52,10 +177,39 @@ score: 0.95 - expectedTools: - - name: contentManager_read + - name: getTools params: - path: notes/roadmap-q2.md + tool: "content" + optional: true + - name: useTools + params: + tool: "content read" mockResponses: + getTools: + success: true + result: + tools: + - agent: contentManager + tool: read + description: Read content from a file + command: "content read" + usage: "content read " + arguments: + - name: path + flag: --path + type: string + required: true + positional: true + examples: + - 'content read "path-value" 1' + useTools: + success: true + result: + results: + - tool: contentManager_read + success: true + result: + content: "# Q2 Roadmap\n1. Mobile launch\n2. Plugin store\n3. Performance" contentManager_read: success: true result: diff --git a/tests/eval/scenarios/provider-parity.eval.yaml b/tests/eval/scenarios/provider-parity.eval.yaml index fc48444a4..d63d20695 100644 --- a/tests/eval/scenarios/provider-parity.eval.yaml +++ b/tests/eval/scenarios/provider-parity.eval.yaml @@ -1,13 +1,50 @@ - name: parity-single-tool-call description: Same basic tool call scenario — should pass across all configured providers - toolSet: nexus + toolSet: meta turns: - userMessage: "Read the file notes/meeting.md" expectedTools: - - name: contentManager_read + - name: getTools params: - path: notes/meeting.md + tool: "content" + mockResponses: + getTools: + success: true + result: + tools: + - agent: contentManager + tool: read + description: Read content from a file + command: "content read" + usage: "content read " + arguments: + - name: path + flag: --path + type: string + required: true + positional: true + - name: startLine + flag: --start-line + type: number + required: true + positional: true + examples: + - 'content read "path-value" 1' + + - expectedTools: + - name: useTools + params: + tool: "content read" mockResponses: + useTools: + success: true + result: + results: + - tool: contentManager_read + success: true + result: + content: "# Meeting Notes\n- Discussed roadmap\n- Budget approved" + path: notes/meeting.md contentManager_read: success: true result: @@ -16,13 +53,50 @@ - name: parity-search-then-read description: Two-turn flow — search then read — should work across all providers - toolSet: nexus + toolSet: meta allowReorder: true turns: - userMessage: "Search for notes about meetings and read the first result" expectedTools: - - name: searchManager_searchContent + - name: getTools + params: + tool: "search" mockResponses: + getTools: + success: true + result: + tools: + - agent: searchManager + tool: searchContent + description: Search for notes by content keyword + command: "search content" + usage: "search content " + arguments: + - name: query + flag: --query + type: string + required: true + positional: true + examples: + - 'search content "query-value"' + + - expectedTools: + - name: useTools + params: + tool: "search content" + mockResponses: + useTools: + success: true + result: + results: + - tool: searchManager_searchContent + success: true + result: + results: + - path: notes/meeting.md + score: 0.92 + snippet: "Meeting notes..." + totalResults: 1 searchManager_searchContent: success: true result: @@ -33,8 +107,40 @@ totalResults: 1 - expectedTools: - - name: contentManager_read + - name: getTools + params: + tool: "content" + optional: true + - name: useTools + params: + tool: "content read" mockResponses: + getTools: + success: true + result: + tools: + - agent: contentManager + tool: read + description: Read content from a file + command: "content read" + usage: "content read " + arguments: + - name: path + flag: --path + type: string + required: true + positional: true + examples: + - 'content read "path-value" 1' + useTools: + success: true + result: + results: + - tool: contentManager_read + success: true + result: + content: "# Meeting\n- Items discussed" + path: notes/meeting.md contentManager_read: success: true result: diff --git a/tests/eval/scenarios/system-prompt.eval.yaml b/tests/eval/scenarios/system-prompt.eval.yaml index d0893e903..6c25816af 100644 --- a/tests/eval/scenarios/system-prompt.eval.yaml +++ b/tests/eval/scenarios/system-prompt.eval.yaml @@ -1,13 +1,45 @@ - name: default-prompt-routes-correctly - description: Production prompt routes a read request to contentManager_read - toolSet: nexus + description: Production prompt routes a read request via the two-tool meta architecture + toolSet: meta turns: - userMessage: "Read the file notes/meeting.md" expectedTools: - - name: contentManager_read + - name: getTools params: - path: notes/meeting.md + tool: "content" mockResponses: + getTools: + success: true + result: + tools: + - agent: contentManager + tool: read + description: Read content from a file + command: "content read" + usage: "content read " + arguments: + - name: path + flag: --path + type: string + required: true + positional: true + examples: + - 'content read "path-value" 1' + + - expectedTools: + - name: useTools + params: + tool: "content read" + mockResponses: + useTools: + success: true + result: + results: + - tool: contentManager_read + success: true + result: + content: "# Meeting Notes\n- Reviewed roadmap" + path: notes/meeting.md contentManager_read: success: true result: @@ -15,15 +47,47 @@ path: notes/meeting.md - name: minimal-prompt-still-uses-tools - description: Production prompt triggers tool usage for a direct request - toolSet: nexus + description: Production prompt triggers two-tool flow for a direct read request + toolSet: meta turns: - userMessage: "Show me what's in notes/daily.md" expectedTools: - - name: contentManager_read + - name: getTools params: - path: notes/daily.md + tool: "content" mockResponses: + getTools: + success: true + result: + tools: + - agent: contentManager + tool: read + description: Read content from a file + command: "content read" + usage: "content read " + arguments: + - name: path + flag: --path + type: string + required: true + positional: true + examples: + - 'content read "path-value" 1' + + - expectedTools: + - name: useTools + params: + tool: "content read" + mockResponses: + useTools: + success: true + result: + results: + - tool: contentManager_read + success: true + result: + content: "# Daily Notes\n- Standup at 10am" + path: notes/daily.md contentManager_read: success: true result: @@ -31,13 +95,49 @@ path: notes/daily.md - name: restrictive-prompt-limits-tools - description: Production prompt routes search request to searchManager - toolSet: nexus + description: Production prompt routes search request via the two-tool meta architecture + toolSet: meta turns: - userMessage: "Search for any notes about the quarterly review" expectedTools: - - name: searchManager_searchContent + - name: getTools + params: + tool: "search" mockResponses: + getTools: + success: true + result: + tools: + - agent: searchManager + tool: searchContent + description: Search for notes by content keyword + command: "search content" + usage: "search content " + arguments: + - name: query + flag: --query + type: string + required: true + positional: true + examples: + - 'search content "query-value"' + + - expectedTools: + - name: useTools + params: + tool: "search content" + mockResponses: + useTools: + success: true + result: + results: + - tool: searchManager_searchContent + success: true + result: + results: + - path: notes/q2-review.md + score: 0.91 + totalResults: 1 searchManager_searchContent: success: true result: From 3b294e7de25c78f9a5dfebf41ee461371bc705c3 Mon Sep 17 00:00:00 2001 From: ProfSynapse Date: Tue, 19 May 2026 20:30:20 -0400 Subject: [PATCH 3/3] test(eval): refresh eval harness CLI commands --- docs/research/eval-harness-schema-audit.md | 6 +-- tests/eval/fixtures/tools.ts | 40 +++++++++------ tests/eval/scenarios/adversarial.eval.yaml | 6 +-- .../eval/scenarios/basic-tool-call.eval.yaml | 6 +-- .../scenarios/content-operations.eval.yaml | 32 ++++++------ .../eval/scenarios/debug-multi-turn.eval.yaml | 10 ++-- tests/eval/scenarios/multi-turn.eval.yaml | 14 +++--- .../eval/scenarios/provider-parity.eval.yaml | 6 +-- .../scenarios/search-variations.eval.yaml | 50 +++++++++---------- .../scenarios/storage-operations.eval.yaml | 16 +++--- tests/eval/scenarios/system-prompt.eval.yaml | 6 +-- tests/eval/scenarios/tool-discovery.eval.yaml | 40 +++++++-------- tests/eval/scenarios/vague-prompts.eval.yaml | 34 ++++++------- 13 files changed, 139 insertions(+), 127 deletions(-) diff --git a/docs/research/eval-harness-schema-audit.md b/docs/research/eval-harness-schema-audit.md index 2a0ab0e57..8567a16d7 100644 --- a/docs/research/eval-harness-schema-audit.md +++ b/docs/research/eval-harness-schema-audit.md @@ -175,7 +175,7 @@ The fixture covers `contentManager` (4 of 5 tools), `storageManager` (5 of 6 too | `ingestManager_*` (full agent) | IngestManager | ingest, listCapabilities | | App agents (webTools, composer) | apps/ | openWebpage, capturePagePdf, capturePagePng, captureToMarkdown, extractLinks, compose, listFormats | -The Task #8 eval run showed the LLM calling `searchManager_memory`, `canvasManager_list`, etc. — names that exist in production but were rejected as hallucinations by the harness because they are not in the fixture. After the Task #11 schema swap, this concern goes away: the LLM will only see `getTools`/`useTools` and discover available tools dynamically. +The Task #8 eval run showed the LLM calling `searchManager_memory`, `canvasManager_list`, etc. — names that exist in production but were rejected as hallucinations by the harness because they are not in the fixture. The Task #11 schema swap removes those names from the callable tool schema, but the production system prompt can still mention agent/tool catalog entries, so live meta evals must still treat direct `agent_tool` calls as prompt-leak or model-behavior failures rather than assuming they cannot happen. ## Fixture-side tools NOT in production @@ -185,9 +185,9 @@ None. Every fixture entry maps to a production tool class. The drifts above are The audit confirms the team-lead's framing: the harness fixture has substantial drift across 8 of 11 entries plus 6+ missing production tools, but Task #11's plan is to swap the entire `NEXUS_TOOLS` array for the two-tool surface (`getTools` + `useTools`). After the swap: -- The harness LLM only sees the two-tool MCP shape. +- The callable tool schema exposes only the two-tool MCP shape (`getTools` and `useTools`). - The executor parses the `useTools.tool` CLI string via the real `ToolCliNormalizer`. -- Drifts above stop mattering for the LLM-facing surface. +- Drifts above stop mattering for the callable function surface, but stale CLI examples and prompt catalog text can still bias models toward invalid command names. - They still matter for the executor: when it parses `content replace --path foo.md --start "..." --end "..." --content "..."` it must route to the production 4-field schema, not the obsolete 3-field one. This audit is the reference for getting that routing right. ## Evidence Index diff --git a/tests/eval/fixtures/tools.ts b/tests/eval/fixtures/tools.ts index 2a1b6d8de..b68462ec4 100644 --- a/tests/eval/fixtures/tools.ts +++ b/tests/eval/fixtures/tools.ts @@ -55,10 +55,9 @@ export const NEXUS_TOOLS: Tool[] = [ properties: { path: { type: 'string', description: 'Path to the file to update' }, content: { type: 'string', description: 'Content to insert' }, - position: { type: 'string', description: 'Insertion position' }, - lineNumber: { type: 'number', description: 'Optional line number' }, + startLine: { type: 'number', description: 'Where to insert content: 1 prepends, -1 appends, any other value inserts before that line' }, }, - required: ['path', 'content', 'position'], + required: ['path', 'content', 'startLine'], }, }, }, @@ -71,10 +70,11 @@ export const NEXUS_TOOLS: Tool[] = [ type: 'object', properties: { path: { type: 'string', description: 'Path to the file to update' }, - search: { type: 'string', description: 'Text to find' }, - replace: { type: 'string', description: 'Replacement text' }, + start: { type: 'string', description: 'Opening anchor line or lines copied verbatim from the file' }, + end: { type: 'string', description: 'Closing anchor line or lines copied verbatim from the file' }, + content: { type: 'string', description: 'Replacement text for the anchored range' }, }, - required: ['path', 'search', 'replace'], + required: ['path', 'start', 'end', 'content'], }, }, }, @@ -87,9 +87,10 @@ export const NEXUS_TOOLS: Tool[] = [ type: 'object', properties: { path: { type: 'string', description: 'Current path of the file or folder' }, - destination: { type: 'string', description: 'Destination path' }, + newPath: { type: 'string', description: 'Destination path' }, + overwrite: { type: 'boolean', description: 'Overwrite if destination exists' }, }, - required: ['path', 'destination'], + required: ['path', 'newPath'], }, }, }, @@ -102,9 +103,10 @@ export const NEXUS_TOOLS: Tool[] = [ type: 'object', properties: { path: { type: 'string', description: 'Current path of the file or folder' }, - destination: { type: 'string', description: 'Destination path' }, + newPath: { type: 'string', description: 'Destination path' }, + overwrite: { type: 'boolean', description: 'Overwrite if destination exists' }, }, - required: ['path', 'destination'], + required: ['path', 'newPath'], }, }, }, @@ -144,9 +146,10 @@ export const NEXUS_TOOLS: Tool[] = [ parameters: { type: 'object', properties: { - path: { type: 'string', description: 'Path to the directory to list' }, + path: { type: 'string', description: 'Path to the directory to list; omit for vault root' }, + filter: { type: 'string', description: 'Optional glob-style filter' }, }, - required: ['path'], + required: [], }, }, }, @@ -160,6 +163,10 @@ export const NEXUS_TOOLS: Tool[] = [ properties: { query: { type: 'string', description: 'Search query text' }, limit: { type: 'number', description: 'Maximum number of results to return' }, + semantic: { type: 'boolean', description: 'Use semantic vector search when available' }, + includeContent: { type: 'boolean', description: 'Include matched content in results' }, + snippetLength: { type: 'number', description: 'Maximum snippet length' }, + paths: { type: 'array', items: { type: 'string' }, description: 'Folder paths or glob patterns to search within' }, }, required: ['query'], }, @@ -175,9 +182,14 @@ export const NEXUS_TOOLS: Tool[] = [ properties: { query: { type: 'string', description: 'Directory search query text' }, paths: { type: 'array', items: { type: 'string' }, description: 'Paths to search within' }, - searchType: { type: 'string', description: 'Search type filter' }, + searchType: { type: 'string', enum: ['files', 'folders', 'both'], description: 'Search type filter' }, + fileTypes: { type: 'array', items: { type: 'string' }, description: 'File extensions to include' }, + depth: { type: 'number', description: 'Maximum directory depth' }, + pattern: { type: 'string', description: 'Optional regex pattern filter' }, + limit: { type: 'number', description: 'Maximum number of results' }, + includeContent: { type: 'boolean', description: 'Include content snippets in results' }, }, - required: ['query'], + required: ['query', 'paths'], }, }, }, diff --git a/tests/eval/scenarios/adversarial.eval.yaml b/tests/eval/scenarios/adversarial.eval.yaml index 0fec4250f..285ad1929 100644 --- a/tests/eval/scenarios/adversarial.eval.yaml +++ b/tests/eval/scenarios/adversarial.eval.yaml @@ -14,7 +14,7 @@ result: tools: - agent: searchManager - tool: searchContent + tool: content description: Search for notes by content keyword command: "search content" usage: "search content " @@ -30,7 +30,7 @@ success: true result: results: - - tool: searchManager_searchContent + - tool: searchManager_content success: true result: results: @@ -38,7 +38,7 @@ score: 0.88 snippet: "Project overview and status..." totalResults: 1 - searchManager_searchContent: + searchManager_content: success: true result: results: diff --git a/tests/eval/scenarios/basic-tool-call.eval.yaml b/tests/eval/scenarios/basic-tool-call.eval.yaml index 2acf9399c..74bfda02d 100644 --- a/tests/eval/scenarios/basic-tool-call.eval.yaml +++ b/tests/eval/scenarios/basic-tool-call.eval.yaml @@ -120,7 +120,7 @@ result: tools: - agent: searchManager - tool: searchContent + tool: content description: Search for notes by content keyword command: "search content" usage: "search content " @@ -142,7 +142,7 @@ success: true result: results: - - tool: searchManager_searchContent + - tool: searchManager_content success: true result: results: @@ -151,7 +151,7 @@ snippet: "Q2 roadmap priorities..." totalResults: 1 query: project roadmap - searchManager_searchContent: + searchManager_content: success: true result: results: diff --git a/tests/eval/scenarios/content-operations.eval.yaml b/tests/eval/scenarios/content-operations.eval.yaml index e02730d18..374b606c9 100644 --- a/tests/eval/scenarios/content-operations.eval.yaml +++ b/tests/eval/scenarios/content-operations.eval.yaml @@ -136,7 +136,7 @@ tool: insert description: Insert content at a specific position command: "content insert" - usage: "content insert [--line-number ]" + usage: "content insert " arguments: - name: path flag: --path @@ -148,18 +148,13 @@ type: string required: true positional: true - - name: position - flag: --position - type: string + - name: startLine + flag: --start-line + type: number required: true positional: true - - name: lineNumber - flag: --line-number - type: number - required: false - positional: false examples: - - 'content insert "path-value" "content-value" "position-value"' + - 'content insert "path-value" "content-value" 1' - expectedTools: - name: useTools @@ -198,25 +193,30 @@ tool: replace description: Replace or delete content in a file command: "content replace" - usage: "content replace " + usage: "content replace " arguments: - name: path flag: --path type: string required: true positional: true - - name: search - flag: --search + - name: start + flag: --start type: string required: true positional: true - - name: replace - flag: --replace + - name: end + flag: --end + type: string + required: true + positional: true + - name: content + flag: --content type: string required: true positional: true examples: - - 'content replace "path-value" "search-value" "replace-value"' + - 'content replace "path-value" "start-value" "end-value" "content-value"' - expectedTools: - name: useTools diff --git a/tests/eval/scenarios/debug-multi-turn.eval.yaml b/tests/eval/scenarios/debug-multi-turn.eval.yaml index 14a1c2dc5..0456cf208 100644 --- a/tests/eval/scenarios/debug-multi-turn.eval.yaml +++ b/tests/eval/scenarios/debug-multi-turn.eval.yaml @@ -55,10 +55,10 @@ result: tools: - agent: searchManager - tool: searchContent + tool: content description: Search for notes containing specific content - command: "search search-content" - usage: "search search-content " + command: "search content" + usage: "search content " arguments: - name: query flag: --query @@ -66,7 +66,7 @@ required: true positional: true examples: - - 'search search-content "query-value"' + - 'search content "query-value"' - expectedTools: - name: useTools @@ -76,7 +76,7 @@ result: results: - agent: searchManager - tool: searchContent + tool: content success: true data: results: diff --git a/tests/eval/scenarios/multi-turn.eval.yaml b/tests/eval/scenarios/multi-turn.eval.yaml index 38c92ad5a..8d266fc14 100644 --- a/tests/eval/scenarios/multi-turn.eval.yaml +++ b/tests/eval/scenarios/multi-turn.eval.yaml @@ -98,20 +98,20 @@ tool: move description: Move a file to a new location command: "storage move" - usage: "storage move " + usage: "storage move " arguments: - name: path flag: --path type: string required: true positional: true - - name: destination - flag: --destination + - name: newPath + flag: --new-path type: string required: true positional: true examples: - - 'storage move "path-value" "destination-value"' + - 'storage move "path-value" "newPath-value"' useTools: success: true result: @@ -141,7 +141,7 @@ result: tools: - agent: searchManager - tool: searchContent + tool: content description: Search for notes by content keyword command: "search content" usage: "search content " @@ -163,13 +163,13 @@ success: true result: results: - - tool: searchManager_searchContent + - tool: searchManager_content success: true result: results: - path: notes/roadmap-q2.md score: 0.95 - searchManager_searchContent: + searchManager_content: success: true result: results: diff --git a/tests/eval/scenarios/provider-parity.eval.yaml b/tests/eval/scenarios/provider-parity.eval.yaml index d63d20695..bec9f794b 100644 --- a/tests/eval/scenarios/provider-parity.eval.yaml +++ b/tests/eval/scenarios/provider-parity.eval.yaml @@ -67,7 +67,7 @@ result: tools: - agent: searchManager - tool: searchContent + tool: content description: Search for notes by content keyword command: "search content" usage: "search content " @@ -89,7 +89,7 @@ success: true result: results: - - tool: searchManager_searchContent + - tool: searchManager_content success: true result: results: @@ -97,7 +97,7 @@ score: 0.92 snippet: "Meeting notes..." totalResults: 1 - searchManager_searchContent: + searchManager_content: success: true result: results: diff --git a/tests/eval/scenarios/search-variations.eval.yaml b/tests/eval/scenarios/search-variations.eval.yaml index 20fb0df4b..d66641f5b 100644 --- a/tests/eval/scenarios/search-variations.eval.yaml +++ b/tests/eval/scenarios/search-variations.eval.yaml @@ -2,7 +2,7 @@ # for different flavors of "find/search/list" user requests. - name: search-by-content-keyword - description: User asks to find notes about a topic — should use searchManager_searchContent + description: User asks to find notes about a topic — should use searchManager_content toolSet: meta turns: - userMessage: "Find all my notes about machine learning" @@ -16,10 +16,10 @@ result: tools: - agent: searchManager - tool: searchContent + tool: content description: Search for notes containing specific content - command: "search search-content" - usage: "search search-content [--limit ]" + command: "search content" + usage: "search content [--limit ]" arguments: - name: query flag: --query @@ -32,18 +32,18 @@ required: false positional: false examples: - - 'search search-content "query-value"' + - 'search content "query-value"' - expectedTools: - name: useTools params: - tool: "search search-content" + tool: "search content" mockResponses: useTools: success: true result: results: - - tool: searchContent + - tool: content success: true data: results: @@ -70,7 +70,7 @@ tool: list description: List files and folders in a directory command: "storage list" - usage: "storage list " + usage: "storage list [--path ]" arguments: - name: path flag: --path @@ -114,10 +114,10 @@ result: tools: - agent: searchManager - tool: searchContent + tool: content description: Search for notes containing specific content - command: "search search-content" - usage: "search search-content " + command: "search content" + usage: "search content " arguments: - name: query flag: --query @@ -125,7 +125,7 @@ required: true positional: true examples: - - 'search search-content "query-value"' + - 'search content "query-value"' - agent: contentManager tool: read description: Read content from a file @@ -148,13 +148,13 @@ - expectedTools: - name: useTools params: - tool: "search search-content" + tool: "search content" mockResponses: useTools: success: true result: results: - - tool: searchContent + - tool: content success: true data: results: @@ -176,7 +176,7 @@ content: "# Q2 Budget\n\nTotal: $1.2M\n- Engineering: $800K\n- Marketing: $400K" - name: directory-search-not-content-search - description: User asks to find a specific file by name — should use searchManager_searchDirectory not searchContent + description: User asks to find a specific file by name — should use searchManager_directory not searchContent toolSet: meta turns: - userMessage: "Where is the file called meeting-notes.md?" @@ -190,10 +190,10 @@ result: tools: - agent: searchManager - tool: searchDirectory + tool: directory description: Search for files and folders by name pattern - command: "search search-directory" - usage: "search search-directory " + command: "search directory" + usage: "search directory " arguments: - name: query flag: --query @@ -201,12 +201,12 @@ required: true positional: true examples: - - 'search search-directory "query-value"' + - 'search directory "query-value"' - agent: searchManager - tool: searchContent + tool: content description: Search for notes containing specific content - command: "search search-content" - usage: "search search-content " + command: "search content" + usage: "search content " arguments: - name: query flag: --query @@ -214,18 +214,18 @@ required: true positional: true examples: - - 'search search-content "query-value"' + - 'search content "query-value"' - expectedTools: - name: useTools params: - tool: "search search-directory" + tool: "search directory" mockResponses: useTools: success: true result: results: - - tool: searchDirectory + - tool: directory success: true data: results: diff --git a/tests/eval/scenarios/storage-operations.eval.yaml b/tests/eval/scenarios/storage-operations.eval.yaml index ee6d370ed..586b179ac 100644 --- a/tests/eval/scenarios/storage-operations.eval.yaml +++ b/tests/eval/scenarios/storage-operations.eval.yaml @@ -18,20 +18,20 @@ tool: move description: Move a file or folder command: "storage move" - usage: "storage move " + usage: "storage move " arguments: - name: path flag: --path type: string required: true positional: true - - name: destination - flag: --destination + - name: newPath + flag: --new-path type: string required: true positional: true examples: - - 'storage move "path-value" "destination-value"' + - 'storage move "path-value" "newPath-value"' - expectedTools: - name: useTools @@ -103,20 +103,20 @@ tool: copy description: Copy a file or folder command: "storage copy" - usage: "storage copy " + usage: "storage copy " arguments: - name: path flag: --path type: string required: true positional: true - - name: destination - flag: --destination + - name: newPath + flag: --new-path type: string required: true positional: true examples: - - 'storage copy "path-value" "destination-value"' + - 'storage copy "path-value" "newPath-value"' - expectedTools: - name: useTools diff --git a/tests/eval/scenarios/system-prompt.eval.yaml b/tests/eval/scenarios/system-prompt.eval.yaml index 6c25816af..2241def78 100644 --- a/tests/eval/scenarios/system-prompt.eval.yaml +++ b/tests/eval/scenarios/system-prompt.eval.yaml @@ -109,7 +109,7 @@ result: tools: - agent: searchManager - tool: searchContent + tool: content description: Search for notes by content keyword command: "search content" usage: "search content " @@ -131,14 +131,14 @@ success: true result: results: - - tool: searchManager_searchContent + - tool: searchManager_content success: true result: results: - path: notes/q2-review.md score: 0.91 totalResults: 1 - searchManager_searchContent: + searchManager_content: success: true result: results: diff --git a/tests/eval/scenarios/tool-discovery.eval.yaml b/tests/eval/scenarios/tool-discovery.eval.yaml index be57fd864..df147df74 100644 --- a/tests/eval/scenarios/tool-discovery.eval.yaml +++ b/tests/eval/scenarios/tool-discovery.eval.yaml @@ -60,10 +60,10 @@ result: tools: - agent: searchManager - tool: searchContent + tool: content description: Search for notes containing specific content - command: "search search-content" - usage: "search search-content " + command: "search content" + usage: "search content " arguments: - name: query flag: --query @@ -71,18 +71,18 @@ required: true positional: true examples: - - 'search search-content "query-value"' + - 'search content "query-value"' - expectedTools: - name: useTools params: - tool: "search search-content" + tool: "search content" mockResponses: useTools: success: true result: results: - - tool: searchContent + - tool: content success: true data: results: @@ -153,20 +153,20 @@ tool: copy description: Copy a file or folder command: "storage copy" - usage: "storage copy " + usage: "storage copy " arguments: - name: path flag: --path type: string required: true positional: true - - name: destination - flag: --destination + - name: newPath + flag: --new-path type: string required: true positional: true examples: - - 'storage copy "path-value" "destination-value"' + - 'storage copy "path-value" "newPath-value"' - expectedTools: - name: useTools @@ -198,10 +198,10 @@ result: tools: - agent: searchManager - tool: searchContent + tool: content description: Search for notes - command: "search search-content" - usage: "search search-content " + command: "search content" + usage: "search content " arguments: - name: query flag: --query @@ -209,18 +209,18 @@ required: true positional: true examples: - - 'search search-content "query-value"' + - 'search content "query-value"' - expectedTools: - name: useTools params: - tool: "search search-content" + tool: "search content" mockResponses: useTools: success: true result: results: - - tool: searchContent + - tool: content success: true data: results: @@ -286,20 +286,20 @@ tool: move description: Move a file or folder command: "storage move" - usage: "storage move " + usage: "storage move " arguments: - name: path flag: --path type: string required: true positional: true - - name: destination - flag: --destination + - name: newPath + flag: --new-path type: string required: true positional: true examples: - - 'storage move "path-value" "destination-value"' + - 'storage move "path-value" "newPath-value"' - expectedTools: - name: useTools diff --git a/tests/eval/scenarios/vague-prompts.eval.yaml b/tests/eval/scenarios/vague-prompts.eval.yaml index fb218b545..b49d78c3b 100644 --- a/tests/eval/scenarios/vague-prompts.eval.yaml +++ b/tests/eval/scenarios/vague-prompts.eval.yaml @@ -16,10 +16,10 @@ result: tools: - agent: searchManager - tool: searchContent + tool: content description: Search for notes containing specific content - command: "search search-content" - usage: "search search-content " + command: "search content" + usage: "search content " arguments: - name: query flag: --query @@ -27,18 +27,18 @@ required: true positional: true examples: - - 'search search-content "query-value"' + - 'search content "query-value"' - expectedTools: - name: useTools params: - tool: "search search-content" + tool: "search content" mockResponses: useTools: success: true result: results: - - tool: searchContent + - tool: content success: true data: results: @@ -109,7 +109,7 @@ tool: list description: List files and folders command: "storage list" - usage: "storage list " + usage: "storage list [--path ]" arguments: - name: path flag: --path @@ -122,20 +122,20 @@ tool: move description: Move a file or folder command: "storage move" - usage: "storage move " + usage: "storage move " arguments: - name: path flag: --path type: string required: true positional: true - - name: destination - flag: --destination + - name: newPath + flag: --new-path type: string required: true positional: true examples: - - 'storage move "path-value" "destination-value"' + - 'storage move "path-value" "newPath-value"' - expectedTools: - name: useTools @@ -204,10 +204,10 @@ examples: - 'content read "path-value" 1' - agent: searchManager - tool: searchContent + tool: content description: Search for notes containing specific content - command: "search search-content" - usage: "search search-content " + command: "search content" + usage: "search content " arguments: - name: query flag: --query @@ -215,12 +215,12 @@ required: true positional: true examples: - - 'search search-content "query-value"' + - 'search content "query-value"' - expectedTools: - name: useTools params: - tool: "content read, search search-content" + tool: "content read, search content" mockResponses: useTools: success: true @@ -230,7 +230,7 @@ success: true data: content: "# Todo\n- Fix login bug\n- Update docs" - - tool: searchContent + - tool: content success: true data: results: