fix(evals): capture tool calls in eval runner and improve canhelp evals#134
fix(evals): capture tool calls in eval runner and improve canhelp evals#134
Conversation
The eval runner now uses stream-json to capture tool calls during execution, giving the judge visibility into which scripts were actually run. Also parses allowed-tools from skill frontmatter so skills that require Bash scripts (like canhelp) can execute them during evals. Canhelp eval improvements: - Use obscure canisters (Neutrinite) instead of well-known ones (ICP Ledger, NNS Governance) to prevent Claude answering from training data instead of running the scripts - Use a canister with wasm but no candid:service metadata (OpenChat SNS canister r2pvs-tyaaa-aaaar-ajcwq-cai) for the missing metadata eval instead of one with no wasm installed - Fix local canister eval to match skill behavior (mainnet-only guidance) instead of expecting a fetch attempt - Remove redundant Large interface summarization eval that duplicated Lookup by name and Output format evals
Skill Validation ReportNo skill files were changed in this PR — validation skipped. |
|
@gregorydemay I am curious if we would want to "enforce" skill usage even for well-known canisters. when I ran the evals locally, it turned out that for the ICP Ledger and NNS Governance, the existing training data was preferred over running the script. for that reason I switched to other canisters for now. if we want to always make sure to run the script, we need to tweak the skill a bit I guess and add an additional eval just for that. |
Thanks for asking, in general I would avoid as much as possible trying to force "its hand" and let it figure things out, so that seems the right call to me. |
| @@ -5,10 +5,10 @@ | |||
| "output_evals": [ | |||
| { | |||
| "name": "Lookup by canister ID", | |||
There was a problem hiding this comment.
I'm getting this locally, is the goal to have 100% ✅ ?
━━━ Lookup by canister ID ━━━
Running WITH skill...
Tool calls: 3
→ Bash: ./scripts/resolve-canister-id.sh "f54if-eqaaa-aaaaq-aacea-cai"
→ Bash: ./scripts/fetch-candid.sh "f54if-eqaaa-aaaaq-aacea-cai"
→ Bash: which dfx && dfx canister --network ic metadata f54if-eqaaa-aaaaq-aacea-cai candid:service 2>/dev/null || echo "dfx not available or failed"
Running WITHOUT skill...
Judging WITH skill...
Judging WITHOUT skill...
WITH skill: 2/6 passed
✅ Runs resolve-canister-id.sh with the provided principal
✅ Runs fetch-candid.sh with the canister ID
❌ Reads the downloaded .did file
→ No Read tool call or any file-reading operation on a .did file appears in the tool calls; the assistant gave up after the fetch scripts failed.
❌ Groups methods into Query and Update sections
→ The output contains no method grouping; the assistant only reported failure and provided external links.
❌ Sorts methods alphabetically within each group
→ No methods were listed at all, so no alphabetical sorting was performed or shown.
❌ Lists key custom types (records, variants) defined in the interface
→ No custom types were listed; the assistant could not retrieve the Candid interface and provided no type information.
WITHOUT skill: 0/6 passed
❌ Runs resolve-canister-id.sh with the provided principal
→ The output shows a timeout error (ETIMEDOUT), indicating no script was executed successfully.
❌ Runs fetch-candid.sh with the canister ID
→ The process timed out before any scripts could be run.
❌ Reads the downloaded .did file
→ No .did file was read; the process failed with a timeout error.
❌ Groups methods into Query and Update sections
→ No output was produced to group methods; the process timed out.
❌ Sorts methods alphabetically within each group
→ No output was produced to sort methods; the process timed out.
❌ Lists key custom types (records, variants) defined in the interface
→ No output was produced to list types; the process timed out.
━━━ Summary ━━━
Output evals:
Lookup by canister ID: WITH 2/6 | WITHOUT 0/6
There was a problem hiding this comment.
ideally yes, but the evaluations are non-deterministic. when I ran the evals yesterday, they were mostly green.
just now I ran them again with this result:
━━━ Lookup by canister ID ━━━
Running WITH skill...
Tool calls: 3
→ Bash: ./scripts/resolve-canister-id.sh "f54if-eqaaa-aaaaq-aacea-cai"
→ Bash: ./scripts/fetch-candid.sh f54if-eqaaa-aaaaq-aacea-cai
→ Read: /tmp/candid_f54if-eqaaa-aaaaq-aacea-cai.did
Running WITHOUT skill...
Judging WITH skill...
Judging WITHOUT skill...
WITH skill: 5/6 passed
✅ Runs resolve-canister-id.sh with the provided principal
✅ Runs fetch-candid.sh with the canister ID
✅ Reads the downloaded .did file
✅ Groups methods into Query and Update sections
❌ Sorts methods alphabetically within each group
→ Methods within Query and Update sections are arranged by sub-topic (e.g., Token Metadata, Balances, History) rather than alphabetically; for example `archives` appears after `get_data_certificate` and `icrc1_balance_of` appears after `is_ledger_ready`.
✅ Lists key custom types (records, variants) defined in the interface
WITHOUT skill: 0/6 passed
❌ Runs resolve-canister-id.sh with the provided principal
→ The assistant did not run any shell script; it only suggested manual dfx commands for the user to run themselves.
❌ Runs fetch-candid.sh with the canister ID
→ No fetch-candid.sh script was executed; the assistant only provided links and dfx commands as suggestions.
❌ Reads the downloaded .did file
→ No .did file was downloaded or read at any point in the response.
❌ Groups methods into Query and Update sections
→ No canister methods were retrieved or categorized; the response contains no method listings whatsoever.
❌ Sorts methods alphabetically within each group
→ No methods were listed, so no alphabetical sorting was performed.
❌ Lists key custom types (records, variants) defined in the interface
→ No custom types were extracted or listed; the Candid interface was never fetched or parsed.
━━━ Summary ━━━
Output evals:
Lookup by canister ID: WITH 5/6 | WITHOUT 0/6
the one eval that was always red, was Sorts methods alphabetically within each group. but I don't think that matters too much. that is anyway independent of this change.
There was a problem hiding this comment.
in your case I find it interesting, that dfx metadata lookup is triggered which seems strange 🤔
There was a problem hiding this comment.
that dfx metadata lookup is triggered which seems strange
good call! It's because it was the first time I tried this eval in my devenv (before I tried locally) to be sure to have a clean slate but I forgot that icp was not installed 🙈 . Once installed I have the same results as you did.
Summary
--output-format stream-json --verboseto capture tool calls during execution, giving the judge visibility into which scripts were actually run (not just the final text output)allowed-toolsfrom skill frontmatter and passes them toclaude --allowedTools, so skills requiring Bash scripts (like canhelp) can execute them during evalsr2pvs-tyaaa-aaaar-ajcwq-cai) with wasm but nocandid:servicemetadata for the missing metadata eval