Skip to content

fix(evals): capture tool calls in eval runner and improve canhelp evals#134

Merged
marc0olo merged 1 commit intomainfrom
fix/eval-runner-tool-call-capture
Mar 31, 2026
Merged

fix(evals): capture tool calls in eval runner and improve canhelp evals#134
marc0olo merged 1 commit intomainfrom
fix/eval-runner-tool-call-capture

Conversation

@marc0olo
Copy link
Copy Markdown
Member

Summary

  • Eval runner now uses --output-format stream-json --verbose to capture tool calls during execution, giving the judge visibility into which scripts were actually run (not just the final text output)
  • Parses allowed-tools from skill frontmatter and passes them to claude --allowedTools, so skills requiring Bash scripts (like canhelp) can execute them during evals
  • Replaces well-known canisters (ICP Ledger, NNS Governance) with obscure ones (Neutrinite) in canhelp evals to prevent Claude answering from training data
  • Uses an OpenChat SNS canister (r2pvs-tyaaa-aaaar-ajcwq-cai) with wasm but no candid:service metadata for the missing metadata eval
  • Fixes local canister eval to match the skill's mainnet-only behavior
  • Removes redundant "Large interface summarization" eval

The eval runner now uses stream-json to capture tool calls during
execution, giving the judge visibility into which scripts were actually
run. Also parses allowed-tools from skill frontmatter so skills that
require Bash scripts (like canhelp) can execute them during evals.

Canhelp eval improvements:
- Use obscure canisters (Neutrinite) instead of well-known ones
  (ICP Ledger, NNS Governance) to prevent Claude answering from
  training data instead of running the scripts
- Use a canister with wasm but no candid:service metadata (OpenChat
  SNS canister r2pvs-tyaaa-aaaar-ajcwq-cai) for the missing metadata
  eval instead of one with no wasm installed
- Fix local canister eval to match skill behavior (mainnet-only
  guidance) instead of expecting a fetch attempt
- Remove redundant Large interface summarization eval that duplicated
  Lookup by name and Output format evals
@marc0olo marc0olo requested review from a team and JoshDFN as code owners March 30, 2026 10:21
@marc0olo marc0olo requested a review from gregorydemay March 30, 2026 10:21
@github-actions
Copy link
Copy Markdown

Skill Validation Report

No skill files were changed in this PR — validation skipped.

@marc0olo
Copy link
Copy Markdown
Member Author

@gregorydemay I am curious if we would want to "enforce" skill usage even for well-known canisters. when I ran the evals locally, it turned out that for the ICP Ledger and NNS Governance, the existing training data was preferred over running the script.

for that reason I switched to other canisters for now. if we want to always make sure to run the script, we need to tweak the skill a bit I guess and add an additional eval just for that.

@gregorydemay
Copy link
Copy Markdown
Collaborator

@gregorydemay I am curious if we would want to "enforce" skill usage even for well-known canisters. when I ran the evals locally, it turned out that for the ICP Ledger and NNS Governance, the existing training data was preferred over running the script.

for that reason I switched to other canisters for now. if we want to always make sure to run the script, we need to tweak the skill a bit I guess and add an additional eval just for that.

Thanks for asking, in general I would avoid as much as possible trying to force "its hand" and let it figure things out, so that seems the right call to me.

@@ -5,10 +5,10 @@
"output_evals": [
{
"name": "Lookup by canister ID",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting this locally, is the goal to have 100% ✅ ?

━━━ Lookup by canister ID ━━━

  Running WITH skill...
  Tool calls: 3
    → Bash: ./scripts/resolve-canister-id.sh "f54if-eqaaa-aaaaq-aacea-cai"
    → Bash: ./scripts/fetch-candid.sh "f54if-eqaaa-aaaaq-aacea-cai"
    → Bash: which dfx && dfx canister --network ic metadata f54if-eqaaa-aaaaq-aacea-cai candid:service 2>/dev/null || echo "dfx not available or failed"
  Running WITHOUT skill...
  Judging WITH skill...
  Judging WITHOUT skill...

  WITH skill: 2/6 passed
    ✅ Runs resolve-canister-id.sh with the provided principal
    ✅ Runs fetch-candid.sh with the canister ID
    ❌ Reads the downloaded .did file
       → No Read tool call or any file-reading operation on a .did file appears in the tool calls; the assistant gave up after the fetch scripts failed.
    ❌ Groups methods into Query and Update sections
       → The output contains no method grouping; the assistant only reported failure and provided external links.
    ❌ Sorts methods alphabetically within each group
       → No methods were listed at all, so no alphabetical sorting was performed or shown.
    ❌ Lists key custom types (records, variants) defined in the interface
       → No custom types were listed; the assistant could not retrieve the Candid interface and provided no type information.

  WITHOUT skill: 0/6 passed
    ❌ Runs resolve-canister-id.sh with the provided principal
       → The output shows a timeout error (ETIMEDOUT), indicating no script was executed successfully.
    ❌ Runs fetch-candid.sh with the canister ID
       → The process timed out before any scripts could be run.
    ❌ Reads the downloaded .did file
       → No .did file was read; the process failed with a timeout error.
    ❌ Groups methods into Query and Update sections
       → No output was produced to group methods; the process timed out.
    ❌ Sorts methods alphabetically within each group
       → No output was produced to sort methods; the process timed out.
    ❌ Lists key custom types (records, variants) defined in the interface
       → No output was produced to list types; the process timed out.

━━━ Summary ━━━

  Output evals:
    Lookup by canister ID: WITH 2/6 | WITHOUT 0/6

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally yes, but the evaluations are non-deterministic. when I ran the evals yesterday, they were mostly green.

just now I ran them again with this result:

━━━ Lookup by canister ID ━━━

  Running WITH skill...
  Tool calls: 3
    → Bash: ./scripts/resolve-canister-id.sh "f54if-eqaaa-aaaaq-aacea-cai"
    → Bash: ./scripts/fetch-candid.sh f54if-eqaaa-aaaaq-aacea-cai
    → Read: /tmp/candid_f54if-eqaaa-aaaaq-aacea-cai.did
  Running WITHOUT skill...
  Judging WITH skill...
  Judging WITHOUT skill...

  WITH skill: 5/6 passed
    ✅ Runs resolve-canister-id.sh with the provided principal
    ✅ Runs fetch-candid.sh with the canister ID
    ✅ Reads the downloaded .did file
    ✅ Groups methods into Query and Update sections
    ❌ Sorts methods alphabetically within each group
       → Methods within Query and Update sections are arranged by sub-topic (e.g., Token Metadata, Balances, History) rather than alphabetically; for example `archives` appears after `get_data_certificate` and `icrc1_balance_of` appears after `is_ledger_ready`.
    ✅ Lists key custom types (records, variants) defined in the interface

  WITHOUT skill: 0/6 passed
    ❌ Runs resolve-canister-id.sh with the provided principal
       → The assistant did not run any shell script; it only suggested manual dfx commands for the user to run themselves.
    ❌ Runs fetch-candid.sh with the canister ID
       → No fetch-candid.sh script was executed; the assistant only provided links and dfx commands as suggestions.
    ❌ Reads the downloaded .did file
       → No .did file was downloaded or read at any point in the response.
    ❌ Groups methods into Query and Update sections
       → No canister methods were retrieved or categorized; the response contains no method listings whatsoever.
    ❌ Sorts methods alphabetically within each group
       → No methods were listed, so no alphabetical sorting was performed.
    ❌ Lists key custom types (records, variants) defined in the interface
       → No custom types were extracted or listed; the Candid interface was never fetched or parsed.

━━━ Summary ━━━

  Output evals:
    Lookup by canister ID: WITH 5/6 | WITHOUT 0/6

the one eval that was always red, was Sorts methods alphabetically within each group. but I don't think that matters too much. that is anyway independent of this change.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in your case I find it interesting, that dfx metadata lookup is triggered which seems strange 🤔

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that dfx metadata lookup is triggered which seems strange

good call! It's because it was the first time I tried this eval in my devenv (before I tried locally) to be sure to have a clean slate but I forgot that icp was not installed 🙈 . Once installed I have the same results as you did.

@marc0olo marc0olo merged commit 4e147d3 into main Mar 31, 2026
6 checks passed
@marc0olo marc0olo deleted the fix/eval-runner-tool-call-capture branch March 31, 2026 11:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants