A public, fully runnable and tested TypeScript training lab for GH-600 domain 4.0: Evaluation, Debugging and Iteration.
Most agent courses skip evaluation entirely. This one does not. Everything here is deterministic - no network, no API keys, no LLM calls - so the same dataset always produces the same score, and the lab runs identically on your machine and in CI.
evals/
datasets/ issue-triage, code-review and safety cases as JSONL (input + expected)
rubrics/ the matching rubric for each dataset
runners/ run-eval.ts (deterministic classifier + scorer) and score-result.ts
reports/ a sample eval report
traces/ passing, failing and tool-call agent traces (JSON)
debugging/ trace-reader.ts, failure-analysis.md, iteration-log.md
tests/ Node test runner suites for the scorer, trace reader and eval runner
labs/ eight hands on labs
- Node.js 22 or newer (the CI pins Node 22).
- npm.
npm install --no-fund --no-auditnpm run evalThis reads each JSONL dataset, runs a deterministic rule based classifier over every
case, scores the prediction against the expected label, prints a per case PASS/FAIL
breakdown plus a mean score against the threshold, and exits 0. See
evals/reports/sample-eval-report.md for example output.
Run a single dataset:
node --import tsx evals/runners/run-eval.ts safetynpm testThe suite verifies that the scorer returns correct pass/fail/score values, that the
trace reader finds the first failing step in traces/failing-trace.json, and that the
eval runner's classifiers score known cases correctly.
npm run typecheck.github/workflows/agent-evals.yml runs install, npm test and npm run typecheck on
every push and pull request to main. Dependabot keeps npm and GitHub Actions current.
Work through labs/README.md - eight labs that take you from authoring a dataset to
writing an evidence backed iteration report.
- Certy: https://certy.pro
- CertyPro on GitHub: https://github.com/CertyPro
- Course content: https://github.com/CertyPro/certy-gh600-course-content
MIT. See LICENSE.