Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 10 additions & 22 deletions evaluations/canhelp.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@
"output_evals": [
{
"name": "Lookup by canister ID",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting this locally, is the goal to have 100% ✅ ?

━━━ Lookup by canister ID ━━━

  Running WITH skill...
  Tool calls: 3
    → Bash: ./scripts/resolve-canister-id.sh "f54if-eqaaa-aaaaq-aacea-cai"
    → Bash: ./scripts/fetch-candid.sh "f54if-eqaaa-aaaaq-aacea-cai"
    → Bash: which dfx && dfx canister --network ic metadata f54if-eqaaa-aaaaq-aacea-cai candid:service 2>/dev/null || echo "dfx not available or failed"
  Running WITHOUT skill...
  Judging WITH skill...
  Judging WITHOUT skill...

  WITH skill: 2/6 passed
    ✅ Runs resolve-canister-id.sh with the provided principal
    ✅ Runs fetch-candid.sh with the canister ID
    ❌ Reads the downloaded .did file
       → No Read tool call or any file-reading operation on a .did file appears in the tool calls; the assistant gave up after the fetch scripts failed.
    ❌ Groups methods into Query and Update sections
       → The output contains no method grouping; the assistant only reported failure and provided external links.
    ❌ Sorts methods alphabetically within each group
       → No methods were listed at all, so no alphabetical sorting was performed or shown.
    ❌ Lists key custom types (records, variants) defined in the interface
       → No custom types were listed; the assistant could not retrieve the Candid interface and provided no type information.

  WITHOUT skill: 0/6 passed
    ❌ Runs resolve-canister-id.sh with the provided principal
       → The output shows a timeout error (ETIMEDOUT), indicating no script was executed successfully.
    ❌ Runs fetch-candid.sh with the canister ID
       → The process timed out before any scripts could be run.
    ❌ Reads the downloaded .did file
       → No .did file was read; the process failed with a timeout error.
    ❌ Groups methods into Query and Update sections
       → No output was produced to group methods; the process timed out.
    ❌ Sorts methods alphabetically within each group
       → No output was produced to sort methods; the process timed out.
    ❌ Lists key custom types (records, variants) defined in the interface
       → No output was produced to list types; the process timed out.

━━━ Summary ━━━

  Output evals:
    Lookup by canister ID: WITH 2/6 | WITHOUT 0/6

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally yes, but the evaluations are non-deterministic. when I ran the evals yesterday, they were mostly green.

just now I ran them again with this result:

━━━ Lookup by canister ID ━━━

  Running WITH skill...
  Tool calls: 3
    → Bash: ./scripts/resolve-canister-id.sh "f54if-eqaaa-aaaaq-aacea-cai"
    → Bash: ./scripts/fetch-candid.sh f54if-eqaaa-aaaaq-aacea-cai
    → Read: /tmp/candid_f54if-eqaaa-aaaaq-aacea-cai.did
  Running WITHOUT skill...
  Judging WITH skill...
  Judging WITHOUT skill...

  WITH skill: 5/6 passed
    ✅ Runs resolve-canister-id.sh with the provided principal
    ✅ Runs fetch-candid.sh with the canister ID
    ✅ Reads the downloaded .did file
    ✅ Groups methods into Query and Update sections
    ❌ Sorts methods alphabetically within each group
       → Methods within Query and Update sections are arranged by sub-topic (e.g., Token Metadata, Balances, History) rather than alphabetically; for example `archives` appears after `get_data_certificate` and `icrc1_balance_of` appears after `is_ledger_ready`.
    ✅ Lists key custom types (records, variants) defined in the interface

  WITHOUT skill: 0/6 passed
    ❌ Runs resolve-canister-id.sh with the provided principal
       → The assistant did not run any shell script; it only suggested manual dfx commands for the user to run themselves.
    ❌ Runs fetch-candid.sh with the canister ID
       → No fetch-candid.sh script was executed; the assistant only provided links and dfx commands as suggestions.
    ❌ Reads the downloaded .did file
       → No .did file was downloaded or read at any point in the response.
    ❌ Groups methods into Query and Update sections
       → No canister methods were retrieved or categorized; the response contains no method listings whatsoever.
    ❌ Sorts methods alphabetically within each group
       → No methods were listed, so no alphabetical sorting was performed.
    ❌ Lists key custom types (records, variants) defined in the interface
       → No custom types were extracted or listed; the Candid interface was never fetched or parsed.

━━━ Summary ━━━

  Output evals:
    Lookup by canister ID: WITH 5/6 | WITHOUT 0/6

the one eval that was always red, was Sorts methods alphabetically within each group. but I don't think that matters too much. that is anyway independent of this change.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in your case I find it interesting, that dfx metadata lookup is triggered which seems strange 🤔

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that dfx metadata lookup is triggered which seems strange

good call! It's because it was the first time I tried this eval in my devenv (before I tried locally) to be sure to have a clean slate but I forgot that icp was not installed 🙈 . Once installed I have the same results as you did.

"prompt": "What can canister ryjl3-tyaaa-aaaaa-aaaba-cai do?",
"prompt": "What can canister f54if-eqaaa-aaaaq-aacea-cai do?",
"expected_behaviors": [
"Runs resolve-canister-id.sh with the provided principal",
"Runs fetch-candid.sh with the resolved canister ID",
"Runs fetch-candid.sh with the canister ID",
"Reads the downloaded .did file",
"Groups methods into Query and Update sections",
"Sorts methods alphabetically within each group",
Expand All @@ -17,7 +17,7 @@
},
{
"name": "Lookup by human-readable name",
"prompt": "Show me the interface for the NNS governance canister",
"prompt": "Show me the interface for the Neutrinite Governance canister",
"expected_behaviors": [
"Runs resolve-canister-id.sh with the name (not a hardcoded canister ID)",
"Displays the resolved canister ID before proceeding",
Expand All @@ -35,21 +35,9 @@
"Does NOT arbitrarily pick one and proceed without asking"
]
},
{
"name": "Large interface summarization",
"prompt": "Show me what the NNS governance canister can do",
"expected_behaviors": [
"Resolves 'NNS governance' to a canister ID",
"Fetches and reads the Candid interface",
"Groups methods by query vs update",
"Sorts methods alphabetically within each group",
"Shows method signatures with argument types and return types",
"Lists key types like Neuron, ProposalInfo, ManageNeuronRequest"
]
},
{
"name": "Output format structure",
"prompt": "Tell me about canister rrkah-fqaaa-aaaaa-aaaaq-cai",
"prompt": "Tell me about canister eqsml-lyaaa-aaaaq-aacdq-cai",
"expected_behaviors": [
"Starts the summary with 'Canister <canister-id>'",
"Has a Query methods section",
Expand All @@ -60,20 +48,20 @@
]
},
{
"name": "Empty Candid interface",
"prompt": "What methods does canister uf6dk-hyaaa-aaaaq-qaaaq-cai expose?",
"name": "Missing Candid metadata",
"prompt": "What methods does canister r2pvs-tyaaa-aaaar-ajcwq-cai expose?",
"expected_behaviors": [
"Runs fetch-candid.sh with the canister ID",
"If the Candid interface is empty, explains that the canister is not exposing its Candid interface in the wasm metadata",
"Explains that the canister does not expose its Candid interface in the wasm metadata",
"Does NOT fabricate or hallucinate methods"
]
},
{
"name": "Adversarial: local canister",
"prompt": "Show me the interface for my local canister bkyz2-fmaaa-aaaaa-qaaaq-cai",
"expected_behaviors": [
"Attempts to resolve and fetch the canister",
"If the fetch fails, suggests verifying the canister ID and that icp is installed",
"Recognizes that this is a local canister or that the skill is mainnet-only",
"Suggests reading the local .did file from the project directory instead",
"Does NOT hallucinate a Candid interface"
]
},
Expand Down Expand Up @@ -113,4 +101,4 @@
"How do I test my canister locally?"
]
}
}
}
108 changes: 90 additions & 18 deletions scripts/evaluate-skills.js
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
import { readFileSync, writeFileSync, mkdirSync } from "fs";
import { execFileSync } from "child_process";
import { join } from "path";
import { readAllSkills } from "./lib/parse-skill.js";
import { readAllSkills, parseFrontmatter } from "./lib/parse-skill.js";

const ROOT = new URL("..", import.meta.url).pathname.replace(/\/$/, "");

Expand All @@ -51,6 +51,8 @@ const listEvals = args.includes("--list");
// ---------------------------------------------------------------------------
const skillDir = join(ROOT, "skills", skillName);
const skillContent = readFileSync(join(skillDir, "SKILL.md"), "utf-8");
const skillMeta = parseFrontmatter(skillContent);
const skillAllowedTools = skillMeta?.["allowed-tools"] || "Read";
const evalsFile = join(ROOT, "evaluations", `${skillName}.json`);
const evals = JSON.parse(readFileSync(evalsFile, "utf-8"));

Expand All @@ -76,56 +78,113 @@ if (evalFilter) {
// ---------------------------------------------------------------------------

/**
* Run a prompt through claude CLI and return the output text.
* Run a prompt through claude CLI and return the output text (and optionally tool calls).
* @param {string} prompt - The user prompt
* @param {string|null} systemPrompt - Optional system prompt (skill content)
* @param {object} [options] - Optional settings
* @param {string} [options.cwd] - Working directory (defaults to /tmp)
* @param {boolean} [options.allowRead] - Allow the Read tool so the agent can fetch reference files
* @param {string} [options.allowedTools] - Comma-separated tool patterns to allow (e.g. from allowed-tools frontmatter)
* @param {boolean} [options.captureToolCalls] - If true, use stream-json to capture tool calls
* @returns {string|{text: string, toolCalls: string[]}} - Plain text or object with tool calls
*/
function runClaude(prompt, systemPrompt, options = {}) {
// Use execFileSync with input option to avoid shell expansion issues.
// Shell expansion of $VAR and $(...) in skill content (e.g., $ICP_WASM_OUTPUT_PATH)
// would corrupt the system prompt when passed via "$(cat ...)".
const useStreamJson = options.captureToolCalls;
const args = ["-p", "--model", "sonnet"];
if (useStreamJson) {
args.push("--output-format", "stream-json", "--verbose");
}
if (systemPrompt) {
args.push("--system-prompt", systemPrompt);
}
if (options.allowRead) {
args.push("--allowedTools", "Read");
if (options.allowedTools) {
args.push("--allowedTools", options.allowedTools);
}

const cwd = options.cwd || "/tmp";
let raw;
try {
return execFileSync("claude", args, {
raw = execFileSync("claude", args, {
input: prompt,
encoding: "utf-8",
maxBuffer: 1024 * 1024,
maxBuffer: 5 * 1024 * 1024,
timeout: 120_000,
cwd,
}).trim();
} catch (e) {
return `[ERROR] ${e.message}`;
const errText = `[ERROR] ${e.message}`;
return useStreamJson ? { text: errText, toolCalls: [] } : errText;
}

if (!useStreamJson) return raw;

// Parse stream-json lines to extract tool calls and final result
const toolCalls = [];
let resultText = "";
for (const line of raw.split("\n")) {
let msg;
try { msg = JSON.parse(line); } catch { continue; }

if (msg.type === "assistant" && msg.message?.content) {
for (const block of msg.message.content) {
if (block.type === "tool_use") {
const input = block.input || {};
const summary = block.name === "Bash"
? `Bash: ${input.command || ""}`
: block.name === "Read"
? `Read: ${input.file_path || ""}`
: `${block.name}: ${JSON.stringify(input).slice(0, 200)}`;
toolCalls.push(summary);
}
}
}
if (msg.type === "result") {
resultText = msg.result || "";
}
}

return { text: resultText, toolCalls };
}

/** Ask claude to judge an output against expected behaviors. */
/**
* Ask claude to judge an output against expected behaviors.
* @param {object} evalCase - The eval case with prompt and expected_behaviors
* @param {string|{text: string, toolCalls: string[]}} output - Plain text or structured output
* @param {string} label - Label for logging
*/
function judge(evalCase, output, label) {
const behaviors = evalCase.expected_behaviors
.map((b, i) => `${i + 1}. ${b}`)
.join("\n");

// Build the output section, including tool calls if available
const isStructured = typeof output === "object" && output.toolCalls;
let outputSection;
if (isStructured && output.toolCalls.length > 0) {
const toolList = output.toolCalls.map((t, i) => `${i + 1}. ${t}`).join("\n");
outputSection = `<tool_calls>
The assistant made the following tool calls during execution:
${toolList}
</tool_calls>

<output>
${output.text}
</output>`;
} else {
outputSection = `<output>
${isStructured ? output.text : output}
</output>`;
}

const judgePrompt = `You are an evaluation judge. A coding assistant was given this task:

<task>
${evalCase.prompt}
</task>

The assistant produced this output:

<output>
${output}
</output>
${outputSection}

Score each expected behavior as PASS or FAIL. Be strict — the behavior must be clearly present, not just vaguely implied. Return ONLY a JSON array of objects with "behavior", "pass" (boolean), and "reason" (one sentence).

Expand Down Expand Up @@ -231,14 +290,23 @@ if (!triggersOnly && outputCases.length > 0) {
for (const evalCase of outputCases) {
console.log(`━━━ ${evalCase.name} ━━━\n`);

// Run WITH skill — from the skill directory with Read access so the
// agent can fetch reference files on demand, matching real usage.
// Run WITH skill — from the skill directory with tools declared in the
// skill's allowed-tools frontmatter, matching real usage. Use stream-json
// to capture tool calls so the judge can verify script execution.
console.log(" Running WITH skill...");
const withOutput = runClaude(evalCase.prompt, skillContent, {
cwd: skillDir,
allowRead: true,
allowedTools: skillAllowedTools,
captureToolCalls: true,
});

if (withOutput.toolCalls?.length > 0) {
console.log(` Tool calls: ${withOutput.toolCalls.length}`);
for (const tc of withOutput.toolCalls) {
console.log(` → ${tc}`);
}
}

// Run WITHOUT skill (baseline) — no tools, no skill context
let withoutOutput = null;
if (!skipBaseline) {
Expand Down Expand Up @@ -277,9 +345,13 @@ if (!triggersOnly && outputCases.length > 0) {
}
}

// Store text output (not the full structured object) in results
const withOutputText = typeof withOutput === "object" ? withOutput.text : withOutput;
const withToolCalls = typeof withOutput === "object" ? withOutput.toolCalls : [];

allResults.output_evals.push({
name: evalCase.name,
with_skill: { output: withOutput, judgment: withJudgment },
with_skill: { output: withOutputText, tool_calls: withToolCalls, judgment: withJudgment },
without_skill: withoutOutput
? { output: withoutOutput, judgment: withoutJudgment }
: null,
Expand Down
Loading