fix(evals): capture tool calls in eval runner and improve canhelp evals by marc0olo · Pull Request #134 · dfinity/icskills

marc0olo · 2026-03-30T10:21:11Z

Summary

Eval runner now uses --output-format stream-json --verbose to capture tool calls during execution, giving the judge visibility into which scripts were actually run (not just the final text output)
Parses allowed-tools from skill frontmatter and passes them to claude --allowedTools, so skills requiring Bash scripts (like canhelp) can execute them during evals
Replaces well-known canisters (ICP Ledger, NNS Governance) with obscure ones (Neutrinite) in canhelp evals to prevent Claude answering from training data
Uses an OpenChat SNS canister (r2pvs-tyaaa-aaaar-ajcwq-cai) with wasm but no candid:service metadata for the missing metadata eval
Fixes local canister eval to match the skill's mainnet-only behavior
Removes redundant "Large interface summarization" eval

The eval runner now uses stream-json to capture tool calls during execution, giving the judge visibility into which scripts were actually run. Also parses allowed-tools from skill frontmatter so skills that require Bash scripts (like canhelp) can execute them during evals. Canhelp eval improvements: - Use obscure canisters (Neutrinite) instead of well-known ones (ICP Ledger, NNS Governance) to prevent Claude answering from training data instead of running the scripts - Use a canister with wasm but no candid:service metadata (OpenChat SNS canister r2pvs-tyaaa-aaaar-ajcwq-cai) for the missing metadata eval instead of one with no wasm installed - Fix local canister eval to match skill behavior (mainnet-only guidance) instead of expecting a fetch attempt - Remove redundant Large interface summarization eval that duplicated Lookup by name and Output format evals

github-actions · 2026-03-30T10:21:29Z

Skill Validation Report

No skill files were changed in this PR — validation skipped.

marc0olo · 2026-03-30T10:23:59Z

@gregorydemay I am curious if we would want to "enforce" skill usage even for well-known canisters. when I ran the evals locally, it turned out that for the ICP Ledger and NNS Governance, the existing training data was preferred over running the script.

for that reason I switched to other canisters for now. if we want to always make sure to run the script, we need to tweak the skill a bit I guess and add an additional eval just for that.

gregorydemay · 2026-03-30T13:29:51Z

@gregorydemay I am curious if we would want to "enforce" skill usage even for well-known canisters. when I ran the evals locally, it turned out that for the ICP Ledger and NNS Governance, the existing training data was preferred over running the script.

for that reason I switched to other canisters for now. if we want to always make sure to run the script, we need to tweak the skill a bit I guess and add an additional eval just for that.

Thanks for asking, in general I would avoid as much as possible trying to force "its hand" and let it figure things out, so that seems the right call to me.

gregorydemay · 2026-03-30T13:30:50Z

evaluations/canhelp.json

@@ -5,10 +5,10 @@
  "output_evals": [
    {
      "name": "Lookup by canister ID",


I'm getting this locally, is the goal to have 100% ✅ ?

━━━ Lookup by canister ID ━━━ Running WITH skill... Tool calls: 3 → Bash: ./scripts/resolve-canister-id.sh "f54if-eqaaa-aaaaq-aacea-cai" → Bash: ./scripts/fetch-candid.sh "f54if-eqaaa-aaaaq-aacea-cai" → Bash: which dfx && dfx canister --network ic metadata f54if-eqaaa-aaaaq-aacea-cai candid:service 2>/dev/null || echo "dfx not available or failed" Running WITHOUT skill... Judging WITH skill... Judging WITHOUT skill... WITH skill: 2/6 passed ✅ Runs resolve-canister-id.sh with the provided principal ✅ Runs fetch-candid.sh with the canister ID ❌ Reads the downloaded .did file → No Read tool call or any file-reading operation on a .did file appears in the tool calls; the assistant gave up after the fetch scripts failed. ❌ Groups methods into Query and Update sections → The output contains no method grouping; the assistant only reported failure and provided external links. ❌ Sorts methods alphabetically within each group → No methods were listed at all, so no alphabetical sorting was performed or shown. ❌ Lists key custom types (records, variants) defined in the interface → No custom types were listed; the assistant could not retrieve the Candid interface and provided no type information. WITHOUT skill: 0/6 passed ❌ Runs resolve-canister-id.sh with the provided principal → The output shows a timeout error (ETIMEDOUT), indicating no script was executed successfully. ❌ Runs fetch-candid.sh with the canister ID → The process timed out before any scripts could be run. ❌ Reads the downloaded .did file → No .did file was read; the process failed with a timeout error. ❌ Groups methods into Query and Update sections → No output was produced to group methods; the process timed out. ❌ Sorts methods alphabetically within each group → No output was produced to sort methods; the process timed out. ❌ Lists key custom types (records, variants) defined in the interface → No output was produced to list types; the process timed out. ━━━ Summary ━━━ Output evals: Lookup by canister ID: WITH 2/6 | WITHOUT 0/6

ideally yes, but the evaluations are non-deterministic. when I ran the evals yesterday, they were mostly green.

just now I ran them again with this result:

━━━ Lookup by canister ID ━━━ Running WITH skill... Tool calls: 3 → Bash: ./scripts/resolve-canister-id.sh "f54if-eqaaa-aaaaq-aacea-cai" → Bash: ./scripts/fetch-candid.sh f54if-eqaaa-aaaaq-aacea-cai → Read: /tmp/candid_f54if-eqaaa-aaaaq-aacea-cai.did Running WITHOUT skill... Judging WITH skill... Judging WITHOUT skill... WITH skill: 5/6 passed ✅ Runs resolve-canister-id.sh with the provided principal ✅ Runs fetch-candid.sh with the canister ID ✅ Reads the downloaded .did file ✅ Groups methods into Query and Update sections ❌ Sorts methods alphabetically within each group → Methods within Query and Update sections are arranged by sub-topic (e.g., Token Metadata, Balances, History) rather than alphabetically; for example `archives` appears after `get_data_certificate` and `icrc1_balance_of` appears after `is_ledger_ready`. ✅ Lists key custom types (records, variants) defined in the interface WITHOUT skill: 0/6 passed ❌ Runs resolve-canister-id.sh with the provided principal → The assistant did not run any shell script; it only suggested manual dfx commands for the user to run themselves. ❌ Runs fetch-candid.sh with the canister ID → No fetch-candid.sh script was executed; the assistant only provided links and dfx commands as suggestions. ❌ Reads the downloaded .did file → No .did file was downloaded or read at any point in the response. ❌ Groups methods into Query and Update sections → No canister methods were retrieved or categorized; the response contains no method listings whatsoever. ❌ Sorts methods alphabetically within each group → No methods were listed, so no alphabetical sorting was performed. ❌ Lists key custom types (records, variants) defined in the interface → No custom types were extracted or listed; the Candid interface was never fetched or parsed. ━━━ Summary ━━━ Output evals: Lookup by canister ID: WITH 5/6 | WITHOUT 0/6

the one eval that was always red, was Sorts methods alphabetically within each group. but I don't think that matters too much. that is anyway independent of this change.

in your case I find it interesting, that dfx metadata lookup is triggered which seems strange 🤔

that dfx metadata lookup is triggered which seems strange

good call! It's because it was the first time I tried this eval in my devenv (before I tried locally) to be sure to have a clean slate but I forgot that icp was not installed 🙈 . Once installed I have the same results as you did.

marc0olo requested review from a team and JoshDFN as code owners March 30, 2026 10:21

marc0olo requested a review from gregorydemay March 30, 2026 10:21

gregorydemay reviewed Mar 30, 2026

View reviewed changes

gregorydemay approved these changes Mar 31, 2026

View reviewed changes

viviveevee approved these changes Mar 31, 2026

View reviewed changes

marc0olo merged commit 4e147d3 into main Mar 31, 2026
6 checks passed

marc0olo deleted the fix/eval-runner-tool-call-capture branch March 31, 2026 11:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(evals): capture tool calls in eval runner and improve canhelp evals#134

fix(evals): capture tool calls in eval runner and improve canhelp evals#134
marc0olo merged 1 commit intomainfrom
fix/eval-runner-tool-call-capture

marc0olo commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

marc0olo commented Mar 30, 2026

Uh oh!

gregorydemay commented Mar 30, 2026

Uh oh!

gregorydemay Mar 30, 2026

Uh oh!

marc0olo Mar 31, 2026

Uh oh!

marc0olo Mar 31, 2026

Uh oh!

gregorydemay Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

marc0olo commented Mar 30, 2026

Summary

Uh oh!

github-actions bot commented Mar 30, 2026

Skill Validation Report

Uh oh!

marc0olo commented Mar 30, 2026

Uh oh!

gregorydemay commented Mar 30, 2026

Uh oh!

gregorydemay Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

marc0olo Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

marc0olo Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

gregorydemay Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants