Skip to content

CertyPro/certy-gh600-evaluation-debugging-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Certy GH-600 - Evaluation, Debugging and Iteration Lab

A public, fully runnable and tested TypeScript training lab for GH-600 domain 4.0: Evaluation, Debugging and Iteration.

Most agent courses skip evaluation entirely. This one does not. Everything here is deterministic - no network, no API keys, no LLM calls - so the same dataset always produces the same score, and the lab runs identically on your machine and in CI.

What is inside

evals/
  datasets/      issue-triage, code-review and safety cases as JSONL (input + expected)
  rubrics/       the matching rubric for each dataset
  runners/       run-eval.ts (deterministic classifier + scorer) and score-result.ts
  reports/       a sample eval report
traces/          passing, failing and tool-call agent traces (JSON)
debugging/       trace-reader.ts, failure-analysis.md, iteration-log.md
tests/           Node test runner suites for the scorer, trace reader and eval runner
labs/            eight hands on labs

Requirements

  • Node.js 22 or newer (the CI pins Node 22).
  • npm.

Install

npm install --no-fund --no-audit

Run the evaluation

npm run eval

This reads each JSONL dataset, runs a deterministic rule based classifier over every case, scores the prediction against the expected label, prints a per case PASS/FAIL breakdown plus a mean score against the threshold, and exits 0. See evals/reports/sample-eval-report.md for example output.

Run a single dataset:

node --import tsx evals/runners/run-eval.ts safety

Run the tests

npm test

The suite verifies that the scorer returns correct pass/fail/score values, that the trace reader finds the first failing step in traces/failing-trace.json, and that the eval runner's classifiers score known cases correctly.

Type check

npm run typecheck

Continuous integration

.github/workflows/agent-evals.yml runs install, npm test and npm run typecheck on every push and pull request to main. Dependabot keeps npm and GitHub Actions current.

Start here

Work through labs/README.md - eight labs that take you from authoring a dataset to writing an evidence backed iteration report.

Links

Licence

MIT. See LICENSE.

About

Evaluate and debug agents: datasets, rubrics, a runnable eval runner, traces. GH-600. By Certy.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors