Add bakery Claude skill with inspect_ai evals#585
Open
bschwedler wants to merge 10 commits into
Open
Conversation
Encodes critical bakery invariants and common workflows for use across the posit-dev image repos. Covers the template editing rule, uv invocation, cross-repo change protocol, CI merge flag forwarding (the UID collision mitigation), and reference tables for CLI flags and template variables. Symlinked to ~/.claude/skills/bakery/ for local use during iteration; will move to posit-dev-skills marketplace when stable.
Running `bakery update files` without filter flags re-renders every version, which is almost never correct. The skill now requires filter flags and defaults to scoping renders to the most recent version. Also adds guidance for the systematic-change case: render one version to validate, then apply to the template and re-render.
Six test cases covering the critical invariants: - Never edit rendered files (template editing rule) - Always use uv run bakery - Always scope bakery update files with filter flags - Read sibling repo CLAUDE.md/bakery.yaml before cross-repo changes - Forward filter flags to bakery ci merge - Correct version creation flow Adapted from the doc-reviewer eval pattern with a behavior_scorer that checks expected behaviors (YES/NO) and forbidden antipatterns (ABSENT/PRESENT) using a grader model.
- Matrix versions are excluded by default: --matrix-versions must be set explicitly for connect-content/workbench-session or builds silently produce zero targets - --dev-stream is silently ignored (warning only) without --dev-versions include/only — must always pair them - --plan only works with --strategy bake (the default), errors with --strategy build Also adds bakery get tags as a lightweight preview command alongside bakery build --plan, and clarifies --push --no-load as the CI pattern.
Based on PR #565 (dev-spec-dispatch): - --dev-stream is deprecated; replace with --dev-channel - Add --dev-spec / BAKERY_DEV_SPEC for CI dispatch builds that must pin an exact dev version (overrides CDN discovery for the channel) - Note that channel conflict between --dev-spec and --dev-channel raises an error - Forward --dev-spec alongside --dev-channel in the ci merge invariant
Four new cases: - 7: --dev-channel is used, not deprecated --dev-stream - 8: --dev-channel silently ignored without --dev-versions - 9: --dev-spec / BAKERY_DEV_SPEC for dispatch pinning - 10: --matrix-versions excluded by default (silent zero targets) Also updates case 5 (ci-merge-flag-forwarding) to expect --dev-channel and BAKERY_DEV_SPEC to be forwarded to the merge step, not just --dev-versions.
CLI findings: - bakery remove is irreversible (no dry-run, no confirmation) - bakery create version marks new version as latest by default, unmarking all others - bakery update version --clean deletes files before re-rendering (destructive default); use bakery update files for safe re-renders - bakery dgoss run replaces deprecated bakery run dgoss Workflow findings: - --dev-versions in clean must match build or the wrong images are cleaned - --temp-registry in merge must exactly match the build value - clean.yml callers must guard with github.ref == 'refs/heads/main' to avoid fork PR failures Also extends invariant 6 (flag forwarding) to cover the clean stage and temp registry consistency.
Add invariant #3 to the bakery skill directing the model to read bakery.yaml before constructing filter-flag values. The skill already encoded how to use --image-name/--image-version/--image-os/--image-variant but never told the model where to get valid values — leaving it to guess. Add three inspect_ai eval cases (IDs 11–13) that test whether a model correctly reads bakery.yaml to discover image names, version names, and OS name strings rather than inventing or inferring them from context. Renumber the downstream invariants (#4–#11 → #5–#13) to accommodate the new entry.
Runs inspect_ai evals against the bakery skill whenever files under .claude/skills/bakery/ change. Path-filtered so it only fires on skill changes, not on every PR. Results are uploaded as artifacts. Not wired into the required CI gate — evals are a signal, not a hard block, given their cost and non-determinism.
ianpittwood
approved these changes
Jun 9, 2026
ianpittwood
left a comment
Contributor
There was a problem hiding this comment.
Seems like a great idea! Hopefully this helps fix some of the woes with multi-repo edits.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a Claude Code skill for the bakery CLI and an inspect_ai eval harness to validate it.
What's included
.claude/skills/bakery/SKILL.md— A skill that encodes 13 critical invariants for working in Posit container image repos: never editing rendered files, always usinguv run, readingbakery.yamlbefore constructing filter flags, forwarding flags consistently toci merge, and more. Available in all sibling repos that haveimages-sharedas an additional working directory..claude/skills/bakery/evals/— An inspect_ai eval harness with 13 dataset entries covering the skill's key invariants. Uses a behavior scorer that checks expected behaviors and forbidden behaviors against a grader model's assessment..github/workflows/bakery-skill-eval.yml— CI workflow that runs the evals when skill files change, uploads results as artifacts. Path-filtered to.claude/skills/bakery/**so it only fires on skill changes. Not wired into the required CI gate — evals are a signal, not a hard block.