Skip to content

#67 ci: detect watcher_client drift vs committed OpenAPI snapshot (Layer B)#68

Merged
gregoryfoster merged 4 commits into
mainfrom
67-detect-client-drift
Jun 24, 2026
Merged

#67 ci: detect watcher_client drift vs committed OpenAPI snapshot (Layer B)#68
gregoryfoster merged 4 commits into
mainfrom
67-detect-client-drift

Conversation

@gregoryfoster

Copy link
Copy Markdown
Contributor

What

Layer B of #67: a hermetic, PR-blocking CI gate that catches generated-client drift — the failure mode behind #66 (Watcher dropped default_content_typeKeyError on every parse → all dashboard Watcher actions broke, found only in prod).

The client-drift job regenerates watcher_client from a committed OpenAPI snapshot and fails on any diff vs the committed generated/ tree. Stale client → red build instead of a prod incident.

How

  • clients/watcher-python/watcher-openapi.json — committed contract-of-record snapshot (pretty-printed, order-preserving — sort_keys would reshape, not just reformat, the generated tree).
  • scripts/check_client_drift.py — regenerates from the snapshot into a temp dir inside the SDK tree (so ruff resolves the same config regen.sh uses — formatting outside the tree falls back to line-length 88 and produces spurious import-wrap diffs), then diffs vs committed. --write is non-destructive remediation; subprocess failures surface as DriftCheckError with captured output; bounded timeout. Exit: 0 no drift · 1 drift · 2 regen error.
  • clients/watcher-python/scripts/regen.sh — now writes the snapshot and regenerates the tree from it, in lockstep, so running it after a legitimate Watcher change refreshes both and leaves the gate a no-op.
  • .github/workflows/ci.ymlclient-drift job (push/PR to main).
  • Tests: diff_trees comparator + main exit codes + _run error/timeout surfacing (15 tests).

The committed generated/ tree is unchanged — the gate is green against pristine HEAD.

Scope / non-goals

  • Consistency only, not currency. This proves generated/ == regen-from-snapshot. It does not detect drift of the snapshot itself vs live Watcher — that needs a scheduled live-compare (Layer C, follow-up).
  • archiver-client (Layer A) deferred. Its committed SDK is genuinely stale (missing /api/v1/domains v4.1+ models) and regen.sh doesn't prune /dashboard/* routes, so a clean regen isn't reproducible today — a real public-surface SDK refresh, filed separately.
  • Layer D (runtime canary on the existing WatcherResponseError) — follow-up.

Plan: docs/plans/2026-06-23-detect-generated-client-drift.md. Reviewed (3 CR rounds, converged).

Closes #67 partially (Layer B); follow-ups tracked separately.

🤖 Generated with Claude Code

gregoryfoster and others added 4 commits June 23, 2026 22:55
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…yer B)

Adds a hermetic, PR-blocking gate that turns a stale generated client (the
#66 failure mode) into a red build:

- clients/watcher-python/watcher-openapi.json — committed contract-of-record
  snapshot of Watcher's OpenAPI (order-preserving pretty-print; generation-
  identical to the live spec).
- scripts/check_client_drift.py — regenerates the client from the snapshot
  into a temp dir *inside the SDK tree* (so ruff resolves the same config as
  regen.sh) and diffs vs the committed generated/ tree. --write regenerates
  in place as drift remediation. Extensible client registry.
- tests/scripts/test_check_client_drift.py — TDD coverage of diff_trees.
- .github/workflows/ci.yml — client-drift job (push/PR to main).

Gate is green against the pristine HEAD tree (no generated-tree change). This
is consistency-only (snapshot vs tree); detecting drift of the snapshot itself
vs live Watcher needs a scheduled live-compare (Layer C, follow-up). The
archiver-client direction (Layer A) is deferred — its committed SDK is stale
(missing /api/v1/domains models) and regen.sh doesn't prune /dashboard routes;
filed separately. See docs/plans/2026-06-23-detect-generated-client-drift.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…R round 1)

- regen.sh now writes watcher-openapi.json (canonical, order-preserving) AND
  regenerates the tree FROM the snapshot, keeping both in lockstep; running it
  after a legitimate Watcher change refreshes both and leaves the gate a no-op
  (CR finding 1/1b).
- check_client_drift.py: surface subprocess failures via DriftCheckError with
  captured stdout/stderr (no bare traceback); 300s timeout per step; --write is
  now non-destructive (regen into in-tree staging, swap on success); reworded
  the drift hint to distinguish hand-edit drift (--write) from upstream change
  (regen.sh) (CR findings 2, 3).
- Reworded docstring: mirrors regen.sh's generate+format invocation against the
  snapshot, not live (CR finding 6).
- Tests: cover main() exit codes (0 clean / 1 drift / 2 subprocess error) and
  --write dispatch (CR finding 4).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Extract _regenerate_in_tree context manager; check_client/write_client share
  it (CR finding 10).
- Document exit codes (0/1/2) in the module Usage block (CR finding 9).
- Add _run tests: clean exit, failure surfaces captured stderr + exit code,
  timeout raises DriftCheckError (CR finding 8).
- AGENTS.md: distinguish hand-edit fix (--write) from upstream-change fix
  (regen.sh) (CR finding 7).
- regen.sh: canonicalize via 'uv run --no-project python' for toolchain
  consistency (CR finding 11).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@gregoryfoster gregoryfoster merged commit fad2e13 into main Jun 24, 2026
4 checks passed
@gregoryfoster gregoryfoster deleted the 67-detect-client-drift branch June 24, 2026 14:42
gregoryfoster added a commit that referenced this pull request Jun 24, 2026
The #68 client-drift gate is consistency-only (generated/ == regen-from-snapshot);
it cannot see the committed watcher-openapi.json snapshot going stale vs the live
Watcher service — the actual #66 failure mode. Close that gap with an on-VM
scheduled detector, not the GH Action the issue proposed: there is no public
Watcher URL for runners, and CannObserv/watcher has no CI and no committed spec
for a repository_dispatch push. On-VM localhost reaches Watcher with no
public-URL / hairpin / uptime coupling and reuses regen.sh as-is.

- scripts/check_watcher_live_drift.py: stdlib detector — fetch live /openapi.json,
  canonicalize as regen.sh does (snapshot is its fixed point), byte-compare.
  Exit 0 no drift / 1 drift (prints SPEC_SHA256) / 3 unreachable (skip).
- scripts/watcher_live_drift_pr.sh: on drift, regen snapshot + tree in an
  isolated worktree off origin/main and open a PR. Branch keyed on live spec SHA
  (one PR per upstream shape). watcher_client is outside the changelog trigger.
- deploy/watcher-live-drift.{service,timer} + deploy/README.md: daily oneshot,
  Persistent=true; install via systemctl enable --now (manual sudo step).
- tests/scripts/test_check_watcher_live_drift.py: canonicalize fixed-point parity
  + drift + exit-code orchestration.

Validated on the VM: no-drift -> exit 0; faked-drift --dry-run exercised
detect -> worktree -> regen -> correct no-op -> clean teardown.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Detect generated API-client drift in CI (watcher_client / archiver-client vs upstream OpenAPI)

1 participant