fix(agent): cert SAN identity binding + invalid-length regression hardening#286
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes the agent-name length regression caused by x509 CN’s 64-byte cap by moving per-agent identity to URI SANs, and hardens the trust model by grounding identity in Docker-attested facts (peer IP → container → labels) with SAN cross-checking at the gRPC boundary.
Changes:
- Move per-agent identity to
urn:clawker:agent:<AgentFullName>URI SAN and pin cert CN toconsts.ContainerClawkerd; introduce typedauth.ProjectSlug/auth.AgentNameand ordering helpers. - Make
IdentityInterceptorthe universal identity gate (includingRegister) using peer-IP→Docker label resolution + constant-time SAN/label comparison; pass resolved container identity via context to handlers. - Adopt goose migrations for the agent registry DB and adjust registry/dialer/listing flows to handle snapshot errors and thumbprint-only registry matching.
Reviewed changes
Copilot reviewed 46 out of 47 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| NOTICE | Adds newly introduced dependency licenses. |
| internal/workspace/strategy.go | Switches volume purpose strings to typed constants. |
| internal/workspace/snapshot.go | Switches workspace volume purpose to typed constant. |
| internal/docker/names.go | Adds typed volume purpose constants + list for tests. |
| internal/docker/names_test.go | Adds headroom + random-name validity regression tests. |
| internal/docker/env.go | Updates comments to reflect AgentFullName usage. |
| internal/controlplane/server.go | Propagates Snapshot() ([]Entry, error) and returns codes.Internal on failure. |
| internal/controlplane/server_test.go | Updates tests for typed identity values and new Snapshot signature. |
| internal/controlplane/CLAUDE.md | Updates controlplane startup/identity gate documentation. |
| internal/controlplane/agent/start.go | Adds PeerLookup dep validation; adds session-cancel subscriber; handles Snapshot errors. |
| internal/controlplane/agent/start_test.go | Updates tests for new deps + Snapshot error return; adds session-cancel subscriber coverage. |
| internal/controlplane/agent/registry.go | Converts identity fields to typed values; changes Snapshot to return error; removes CN-based Lookup. |
| internal/controlplane/agent/registry_test.go | Updates tests for error-returning invariants and typed identity. |
| internal/controlplane/agent/registry_sqlite.go | Switches to goose-managed embedded migrations; adds malformed-row sentinel handling and typed re-validation. |
| internal/controlplane/agent/registry_sqlite_test.go | Updates tests for goose migrations, Snapshot errors, and typed identity fields. |
| internal/controlplane/agent/registry_mock_test.go | Regenerates mock for new Registry interface (Snapshot returns error; removes Lookup). |
| internal/controlplane/agent/register_handler.go | Consumes middleware-resolved container identity; verifies request/cert against resolved truth; rewrites malformed rows. |
| internal/controlplane/agent/peer_lookup.go | Introduces ContainerByPeerIP interface + ResolvedContainer + sentinels. |
| internal/controlplane/agent/peer_lookup_test.go | Adds unit tests for peer-IP→container resolution and label validation behavior. |
| internal/controlplane/agent/peer_lookup_moby.go | Implements Docker-backed peer lookup with label filtering and error aggregation. |
| internal/controlplane/agent/migrations/00001_init.sql | Adds goose v1 baseline schema (includes legacy canonical_cn column). |
| internal/controlplane/agent/migrations/00002_drop_canonical_cn.sql | Drops canonical_cn as identity moves to SAN + labels. |
| internal/controlplane/agent/identity_interceptor.go | Reworks interceptor into universal 3-stage gate (CN pin, peer lookup, SAN/label compare). |
| internal/controlplane/agent/identity_interceptor_test.go | Adds tests for each gate stage + stream-context propagation. |
| internal/controlplane/agent/handler.go | Adds peer-leaf extraction helpers and context plumbing for resolved container identity. |
| internal/controlplane/agent/events_session.go | Renames peer identity fields to SAN-based PeerAgentFullName. |
| internal/controlplane/agent/dispatch_register_test.go | Updates tests for typed identity fields and SAN-based peer identity naming. |
| internal/controlplane/agent/dialer.go | Stores per-dial cancel funcs for synchronous teardown; registry classification becomes thumbprint-only; SAN captured for diagnostics. |
| internal/controlplane/agent/dialer_test.go | Updates dialer tests for SAN identity capture and thumbprint-only registry outcomes. |
| internal/controlplane/agent/CLAUDE.md | Updates agent package docs for new trust model, resolver, and registry behavior. |
| internal/consts/consts.go | Adds ContainerClawkerd CN pin constant. |
| internal/cmd/container/shared/containerfs.go | Uses typed volume purpose constants. |
| internal/cmd/container/shared/container_start.go | Updates comments to reflect SAN-based identity. |
| internal/cmd/container/shared/container_create.go | Uses typed volume purpose constants for cleanup tracking. |
| internal/cmd/container/shared/CLAUDE.md | Updates docs for AgentFullName-in-SAN and CN pin. |
| internal/cmd/container/shared/agent_bootstrap.go | Updates docs to reflect AgentFullName SAN composition. |
| internal/cmd/container/shared/agent_bootstrap_test.go | Updates tests: CN pin + AgentFullName SAN presence. |
| internal/auth/identity.go | Increases short-name cap; adds MaxShortNameLen() and Less methods for typed identities. |
| internal/auth/identity_test.go | Updates tests for AgentFullName terminology and random-name length regression. |
| internal/auth/CLAUDE.md | Updates auth docs for SAN-based identity and CN pin. |
| internal/auth/agent_cert.go | Adds Agent SAN scheme + tri-state SAN parsing; CN set to ContainerClawkerd; embeds both agent/container URI SANs. |
| internal/auth/agent_cert_test.go | Updates tests for CN pin + AgentFullNameFromCert tri-state behavior. |
| go.mod | Adds goose + related deps; bumps several indirect dependency versions. |
| go.sum | Records module checksum updates for new/updated dependencies. |
| cmd/clawker-cp/main.go | Wires peer lookup into IdentityInterceptor and agent.Start; updates Register handler construction. |
| .claude/docs/KEY-CONCEPTS.md | Updates repository docs to reflect new identity/registry model. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
schmitthub
added a commit
that referenced
this pull request
May 13, 2026
Addresses critical + important findings from the post-punch-list review of PR #286. CP serve path hardened, identity gate constructor follows the (*T, error) pattern, trust-anchor resolver fails closed on ambiguous peer IPs. - C1: IdentityInterceptor signature changes from panic-on-nil to (unary, stream, error). cmd/clawker-cp/main.go logs event=agent_identity_unavailable on error and degrades the AgentService listener while CP/firewall/admin stay up. Panic at this seam (post-eBPF-load) would have stranded pinned eBPF programs with no supervisor — exactly the silent-failure shape root CLAUDE.md forbids. - C2: internal/consts/consts.go — three doc-comment fixes to drop stale "canonical CN" / "pre-computed registry column" references that contradict the post-migration-00002 model. - C3: TestSQLiteRegistry_Snapshot_ReturnsErrOnDBFailure and TestAdminServer_ListAgents_SnapshotError_ReturnsCodesInternal — pin the eviction-cascade-prevention surface: Snapshot must surface a non-nil error, ListAgents must map it to codes.Internal, never silently return an empty result. - I1: Registry.Add godoc — panics→error wording matching the punch-list refactor. - I2: AgentFullNameFromCert, ContainerIDFromCert, sanTailFromCert — three duplicate godoc blocks deleted (godoc/IDE were picking up stale (string, bool) descriptions on the SAN parsers). - I3: dialer.classifyRegistry — ErrMalformedEntry now classifies as outcomeRegistryMiss so the Register handshake drives the evict+rewrite path. Treating it as NotQueried would have left malformed rows stranded forever (every reconnect publishes AgentUntrusted, no recovery trigger). - I4+I5: peer_lookup_moby.LookupByIP — exhaustively walks every purpose=agent candidate (no first-match-wins) and returns the new ErrAmbiguousPeerIP sentinel when two containers advertise the same clawker-net IP. Fail-closed for the transient-stale-endpoint race window. Daemon-error wrap now only escalates when NO candidate inspected cleanly — preserves the peer_lookup_no_match audit signal operators rely on. IdentityInterceptor Stage 2b gains a dedicated case → event=agent_identity_ambiguous_peer_ip. Tests: 116 packages green (in-container, e2e+whail skipped per project convention). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 46 out of 47 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (5)
internal/controlplane/agent/registry_test.go:1
- This comment block is duplicated and internally contradictory (first describes panic semantics, then error-return semantics). Please collapse it into a single, accurate description (Add returns an error) so future readers/tests don’t get misled about the intended contract.
internal/controlplane/agent/start.go:1 StartDeps.PeerLookupis made required and validated, butStart()doesn’t appear to use it anywhere in this diff (beyond the nil-check). IfStart()doesn’t actually depend on peer lookup yet, consider (mandatory) either wiring it into the components started here or (optional) deferring the requirement until it’s used to avoid forcing callers/tests to pass an unused dependency.
internal/controlplane/agent/start.go:1StartDeps.PeerLookupis made required and validated, butStart()doesn’t appear to use it anywhere in this diff (beyond the nil-check). IfStart()doesn’t actually depend on peer lookup yet, consider (mandatory) either wiring it into the components started here or (optional) deferring the requirement until it’s used to avoid forcing callers/tests to pass an unused dependency.
internal/docker/names.go:1VolumePurposesis exported as a mutable slice, which can be modified by any caller/package at runtime and unintentionally weaken the headroom test’s coverage assumptions (and any logic that iterates it). Consider exposing this list via a function that returns a copy, or keeping the backing list unexported and providing an accessor, so the canonical set cannot be mutated.
internal/controlplane/agent/dispatch_register_test.go:1- The helper parameter is still named
peerCN, but it now representsPeerAgentFullName(SAN-sourced AgentFullName), not CN. Renaming the parameter (e.g.,peerAgentFullName) would make the test intent clearer and prevent confusion now that CN is a fixed literal.
package agent
| } | ||
| } | ||
|
|
||
| func TestContainerName(t *testing.T) { |
Random container names (docker.GenerateRandomName worst case ~27 chars) exceeded the 24-char shortNameMax cap that existed so the composed CN "clawker.<project>.<agent>" stayed under x509's 64-byte limit. Surfaces as "agent bootstrap: invalid agent name" from CreateContainer (broke TestCreateContainer_NoPostInit). Mint per-agent leaves with Subject.CommonName = consts.ContainerClawkerd (deterministic binary identity) and put the canonical agent identity in a urn:clawker:agent:<canonical> URI SAN. CP-side IdentityInterceptor gains a universal CN pin to ContainerClawkerd (runs for every RPC including Register); registry Lookup keys off the SAN-sourced AgentFullName instead of Subject.CommonName. Register handler keeps its SAN-canonical-vs-request cross-check. Cap raised 24 -> 50; locked with a cartesian-product regression test over the GenerateRandomName word lists. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace hand-rolled applySchema + schemaMissingCanonicalCN drift detection with goose-managed migrations embedded via embed.FS. 00001_init.sql captures the current schema verbatim — no behavioral change for fresh DBs or DBs already at the current shape. Pre-goose DBs (have `agents` table but no `goose_db_version` sibling) get their `agents` table dropped via a one-shot pre-flight; goose then applies 00001 fresh. Alpha policy stands — registry state is regenerable from the next clawker run's Register handshake. Migrations live co-located at internal/controlplane/agent/migrations/ and ship inside the CP binary via embed.FS so the schema definitions are reachable at runtime inside the CP container without host-side asset bind mounts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Maps a live mTLS peer IP to the purpose=agent container owning that clawker-net endpoint and reads project + agent_name straight off the Docker labels. Lays the trust anchor for the universal identity middleware: labels are the authoritative source, cert claims become inputs to verify against them rather than the basis of identity lookup. ResolvedContainer carries typed Project/AgentName (auth.ProjectSlug / auth.AgentName) so a zero-value instance cannot pass an empty-vs-empty identity compare downstream. Resolver validates labels at the seam and returns ErrInvalidAgentLabels on malformed input — distinct from ErrNoContainerForPeerIP so callers can tell daemon-state corruption from a clean no-match. Per-container inspect failures log + continue iteration; the wrapped daemon error propagates only when no candidate matched AND any inspect erred, preserving the "we couldn't tell" vs "no agent owns this IP" distinction. Read-only docker calls are bounded by peerLookupTimeout under context.Background() so a per-RPC cancel cannot turn into a spurious reject. Wired through agent.StartDeps as a required field with a nil-check mirroring Registry/DockerLister/Bus/Dialer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ddleware
Trust anchor is now the kernel-attested peer IP, not cert claims. The
cert's urn:clawker:agent: URI SAN becomes a CLAIM verified against
Docker labels resolved from the peer's IP — never the basis of lookup.
The middleware runs three stages on every RPC including Register
(no opt-out):
1. CN pin: leaf.Subject.CommonName == consts.ContainerClawkerd
2a. Cert must carry an agent URI SAN (explicit reject for diagnostic
fidelity + Docker-roundtrip short-circuit)
2b. Peer IP -> Docker -> labels via ContainerByPeerIP
3. Cert SAN AgentFullName constant-time-compared against the
label-derived AgentFullName
Resolved (containerID, project, agent_name) is attached to ctx via
WithResolvedContainer; downstream handlers read it via
ResolvedContainerFromContext. Register reads label-derived identity
instead of the request's claim.
Key changes:
- IdentityInterceptor signature: (peerLookup ContainerByPeerIP, log)
drops the Registry + optedOut params and the IdentityOptedOutMethods
helper. Registry-row lookup is bookkeeping, not auth.
- WithEntry / EntryFromContext deleted; WithResolvedContainer /
ResolvedContainerFromContext replace them. The write side is
permissive on empty ContainerID (no panic on serving path per the
CP no-panic security contract); the read side returns ok=false for
both absent and zero-value cases, preventing silent identity vacuum.
- peerIdentity gains PeerAddr (the trust anchor); drops the unused
Raw field; peerIdentityFromContext delegates to peerLeafFromContext.
- remoteAddrToNetip moves from register_handler.go to handler.go so
both the interceptor and Register share the same extraction path
(Task 3 will collapse peerLeafAndIP).
- Five distinct event= log fields per reject path for operator
triage; uniform "registration rejected" wire envelope.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Register handler used to call ContainerInspector + cross-check labels + peer-IP itself. With IdentityInterceptor as universal trust middleware (peer-IP -> Docker -> labels -> cert-SAN cross-check), the handler now reads the resolved (containerID, project, agentName) from ctx via ResolvedContainerFromContext and focuses on what only it can do: capture the live thumbprint, verify the cert's container_id SAN against the resolved truth, verify the request body against the resolved truth, and write the row using label-derived (authoritative) values. Drops: - ContainerInspector interface + NewMobyContainerInspector + mobyContainerInspector + inspectTimeout const + fakeContainerInspector test fake. The middleware owns Docker resolution; the handler is Docker-free. - peerLeafAndIP -> handler now uses peerLeafFromContext (handler.go). - labelsMatchRequest, peerIPMatchesContainer -> middleware's concern. - subtle.ConstantTimeCompare for non-secret values (ProjectSlug, AgentName, ContainerID are daemon-attested labels / public docker IDs). Plain != is equally secure here; thumbprint compare remains direct [32]byte equality. Adds: - agent_register_no_resolved_container -> codes.Internal (wiring bug, not identity verdict; operator can distinguish "rejected" from "broken"). - agent_register_peer_cert_missing_post_resolve -> defense-in-depth PermissionDenied if ctx is tampered between interceptor and handler. - agent_register_request_label_mismatch -> request body lies about identity even though cert/labels agree. - agent_register_container_id_mismatch -> cert container SAN claim disagrees with peer-IP-resolved truth. Tests collapsed into table-driven groups (RequestValidation, CtxGates, IdentityCrossChecks, RegistryBranches) for ~80 line reduction. internal/controlplane/agent/CLAUDE.md updated: Identity contract rewritten around peer-IP trust anchor; Register flow step 5 reworded; opt-out subsection replaced with universal three-stage gate; Files + Test seam tables refreshed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Registry.Lookup(thumbprint, cn) gate combined two checks — channel binding via thumbprint and identity binding via canonical CN — into one sentinel-returning call. After Tasks 1-3 the identity binding moved upstream into IdentityInterceptor (peer-IP → Docker labels → cert SAN cross-check), so Registry no longer needs to carry the CN half. Registry surface shrinks: Lookup removed; Add no longer validates project/agent_name strings (the wire boundary already does, via auth.NewProjectSlug/NewAgentName). memEntry wrapper and the canonical-CN pre-compute deleted; entries map is now [Thumbprint]Entry directly. Dialer.classifyRegistry collapses to a thumbprint compare against the row keyed by container_id — Match / Miss / ThumbprintMismatch / NotQueried only. outcomeRegistryCNMismatch and the dead computeCNPinMatch + canonicalCNFromStrings helpers go away. sqlite still has the canonical_cn column (NOT NULL) so Add writes "" as a placeholder until Task 5 of the SAN refactor drops the column via goose migration 00002. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The canonical_cn pre-compute column has had no readers since Task 4 of the SAN refactor; writes have gone to empty string in the interim to keep the NOT-NULL-no-default constraint valid. AgentFullName now lives in a urn:clawker:agent: URI SAN and trust derives from the kernel-attested peer IP via Docker labels, not from cached strings. Test seeds a DB at goose-version-1 with a populated canonical_cn row rather than running on a fresh DB. A fresh-DB test could not distinguish "00002 dropped the column" from "the column was never there"; seeding pre-002 state pins the actual migration effect and verifies row-level data preservation across DROP COLUMN. Also fix applySchema partial-apply observability: previously the applied-migration log lines were emitted only on the success branch, so a stuck DB hid the last successful version behind the wrapped error. Iterate results before returning the error so triage is possible from log surface alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sweeps "canonical" terminology out of identity naming per the SAN-refactor initiative. Identity now consistently uses AgentFullName for the clawker.<project>.<agent> string; "canonical" only remains in generic English (canonical container ID, canonical aud claim) and in the intentionally-kept outcomeRegistryCNMismatch / UntrustedReasonCNMismatch enum values. Closes the Task 4 silent-failure-hunter MEDIUM by typing agent.Entry's AgentName and Project fields as auth.AgentName / auth.ProjectSlug so a future writer cannot bypass validation. sqlite scanEntry re-runs auth.NewAgentName / auth.NewProjectSlug on read; Snapshot logs and skips malformed rows with event=agentregistry_row_skipped plus a per-snapshot summary line carrying the count. Drops the server.go ListAgents re-sort — Snapshot's interface contract already guarantees (Project, AgentName) ordering. Renames: - auth.CanonicalAgentCN → auth.AgentFullName - auth.AgentCanonicalFromCert → auth.AgentFullNameFromCert - canonicalPrefix → agentFullNamePrefix Deferred (out of rename scope): - validateEntry panics on CP serve path (FIXME comment landed) - LookupByContainerID malformed-row DOS path against re-registration - auth.AgentName.Less / ProjectSlug.Less typed comparators - CLAUDE.md doc sweep (Task 8) - Test-file comment scrub (Task 7) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Task 7 of the SAN-refactor initiative. - Scrub "canonical" residue in identity-context test comments + identifiers after the Task 6 rename; generic-English "canonical" usages preserved. - Add TestCapturePeer_DistinctCNAndSAN — production has CN=clawker-clawkerd but PeerAgentFullName must come from the urn:clawker:agent: SAN. Existing TestCapturePeer_ValidChain collapses CN==SAN so it would pass a regression that read CN. - Add TestContainerName_HeadroomForMaxFields — composes max-length project + max-length agent + every production purpose suffix and asserts Docker's 128-byte resource-name budget holds. - Expose auth.MaxShortNameLen() so the headroom test consumes the source of truth instead of hardcoding 50; catches both shrink and bump regressions of shortNameMax. - Collapse signLeafCNSAN helper into signLeaf (now takes cn + sanFullName). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Task 8 of the SAN-refactor initiative. Sweeps the post-rename and post-Lookup-removal residue across user-facing package docs and the project-wide concept index so a future reader reaching for these files sees the post-Task-7 reality: - auth/CLAUDE.md: agent_cert.go row references AgentFullName / AgentFullNameFromCert; drops "pre-stored canonical_cn column" sentence (column dropped by migration 00002 in Task 5). - controlplane/agent/CLAUDE.md: Identity contract section documents the dropped-column reality plus the typed Entry fields (auth.ProjectSlug / auth.AgentName) and the on-read scanEntry re-validation + skipped-row events from Task 6. Register flow step 1 uses urn:clawker:agent:<full-name> wording. Imports section lists AgentFullName / AgentFullNameFromCert. - cmd/container/shared/CLAUDE.md: rewrites the agent-bootstrap CN framing — CN is the deterministic ContainerClawkerd literal; AgentFullName rides in the URI SAN. - controlplane/CLAUDE.md: startup step 8 rewritten around the three-stage IdentityInterceptor gate (CN pin → peer-IP→Docker →labels → cert SAN ↔ label compare) with the ContainerByPeerIP dep and WithResolvedContainer ctx attachment. Constructor signature documented. - .claude/docs/KEY-CONCEPTS.md: agent.Registry row rewritten around the current surface (Add / LookupByContainerID / Evict / Snapshot); agent.IdentityInterceptor row documents the universal three-stage gate; WithEntry/EntryFromContext row replaced with WithResolvedContainer/ResolvedContainerFromContext; CanonicalAgentCN row replaced with AgentFullName; MintAgentCert signature corrected to include containerID; shared.AgentBootstrap row drops the removed RegisterAgentInRegistry function reference. Memory updated: Task 8 → complete. Two new items added to Deferred list from the final silent-failure-hunter sweep (Snapshot query- failure swallow; migration 00002 Down-path empty-CN backfill). Verification: make test GOFLAGS="-trimpath -buildvcs=false" → 4643 tests pass / 8 legitimate skips. go build clean. grep canonical on internal/auth, internal/controlplane, internal/cmd/container, .claude/docs returns only generic-English usages (pattern descriptions, ColorScheme). Initiative complete. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses every item in the invalid-length-pr-review-tasks tracker: - AgentFullNameFromCert now returns (string, error) with ErrAgentSANMissing / ErrAgentSANMalformed sentinels so the IdentityInterceptor can emit distinct structured-log events for missing vs malformed agent URI SANs (wire envelope stays uniform PermissionDenied). - docker.VolumePurposes + typed const suffixes so the headroom test in names_test.go auto-covers any new purpose; inline call sites migrated to the typed names. - New subscribeSessionCancel listens for container die/stop/kill/ oom/destroy docker events (source of truth) and calls Dialer.CancelDial, tearing down the doomed CP→clawkerd Session synchronously instead of waiting for the next reconnect to classify outcomeContainerGone. Per-container cancel handles replace the dialing dedup set. - validateEntry returns error instead of panicking; both Add impls propagate; tests assert error returns rather than panics. Keeps the CP serve path clear of panic() per root CLAUDE.md security contract. - ErrMalformedEntry sentinel wrapped by scanEntry; Register evicts malformed rows and re-writes using middleware-resolved identity rather than refusing the handshake; covered by two new tests. - ProjectSlug.Less / AgentName.Less on the typed identity values; sortEntries helper shared by both registry impls. - Registry.Snapshot() now returns ([]Entry, error); reapOrphans aborts on snapshot failure (no longer treats empty as authoritative); ListAgents surfaces as codes.Internal. - Migration 00002 is one-way; the Down half (which pretended to preserve trust state across rollback) is removed. make test: 4663 pass, 8 skipped.
Addresses critical + important findings from the post-punch-list review of PR #286. CP serve path hardened, identity gate constructor follows the (*T, error) pattern, trust-anchor resolver fails closed on ambiguous peer IPs. - C1: IdentityInterceptor signature changes from panic-on-nil to (unary, stream, error). cmd/clawker-cp/main.go logs event=agent_identity_unavailable on error and degrades the AgentService listener while CP/firewall/admin stay up. Panic at this seam (post-eBPF-load) would have stranded pinned eBPF programs with no supervisor — exactly the silent-failure shape root CLAUDE.md forbids. - C2: internal/consts/consts.go — three doc-comment fixes to drop stale "canonical CN" / "pre-computed registry column" references that contradict the post-migration-00002 model. - C3: TestSQLiteRegistry_Snapshot_ReturnsErrOnDBFailure and TestAdminServer_ListAgents_SnapshotError_ReturnsCodesInternal — pin the eviction-cascade-prevention surface: Snapshot must surface a non-nil error, ListAgents must map it to codes.Internal, never silently return an empty result. - I1: Registry.Add godoc — panics→error wording matching the punch-list refactor. - I2: AgentFullNameFromCert, ContainerIDFromCert, sanTailFromCert — three duplicate godoc blocks deleted (godoc/IDE were picking up stale (string, bool) descriptions on the SAN parsers). - I3: dialer.classifyRegistry — ErrMalformedEntry now classifies as outcomeRegistryMiss so the Register handshake drives the evict+rewrite path. Treating it as NotQueried would have left malformed rows stranded forever (every reconnect publishes AgentUntrusted, no recovery trigger). - I4+I5: peer_lookup_moby.LookupByIP — exhaustively walks every purpose=agent candidate (no first-match-wins) and returns the new ErrAmbiguousPeerIP sentinel when two containers advertise the same clawker-net IP. Fail-closed for the transient-stale-endpoint race window. Daemon-error wrap now only escalates when NO candidate inspected cleanly — preserves the peer_lookup_no_match audit signal operators rely on. IdentityInterceptor Stage 2b gains a dedicated case → event=agent_identity_ambiguous_peer_ip. Tests: 116 packages green (in-container, e2e+whail skipped per project convention). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CLAUDE.md drift: drop stale "Session teardown on evict" deferred-limitation
(shipped via Dialer.CancelDial), document 4th Start step + PeerLookup dep,
add ErrAmbiguousPeerIP to sentinel list, note goose migrations, add
ErrAgentSAN* + MaxShortNameLen() to auth surface.
Comments: drop removed peer-CN match references from dialer doc + agents
field; rewrite register_handler "cert+CN pin" to SAN/labels middleware
gate; add ErrAmbiguousPeerIP to peer_lookup interface; move sanState
block out of BuildContainerSAN's doc.
Error handling: rename main.go err → identityErr to scope IdentityInterceptor
degrade; omit `rows` field on countRows-fail path (Str/Int under one key
breaks OpenSearch ingest); migrate ContainerIDFromCert to (string, error)
tri-state with ErrContainerSAN{Missing,Malformed} sentinels mirroring
AgentFullNameFromCert — distinct structured-log events for operator
triage, uniform PermissionDenied on the wire.
Tests: add ErrAmbiguousPeerIP row to interceptor sentinel table,
TestMintAgentCert_MaxLengthRoundTrip (pins x509 64-byte CN limit against
MaxShortNameLen() inputs), TestContainerIDFromCert_TriState. Delete
TestNamePrefix (tautology) and TestGenerateRandomName format/uniqueness
loop (subsumed by cartesian-walk TestGenerateRandomName_AlwaysValidAsAgentName).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add internal/cmdutil/slugify.go: centralized input normalization (lowercase, whitespace collapse, control-strip, leading/trailing dash trim). New slugify_test.go covers the FSM branches. - Reduce auth.NewAgentName / NewProjectSlug to compile-time type wrappers; remove shortNameRE, MaxShortNameLen, and the docker 128-byte cap in ValidateResourceName. Downstream consumers (Docker create, x509 URI SAN, IdentityInterceptor) enforce their own constraints at op time. - Drop internal/cmd/project/init/init.go's validateProjectName helper and the parallel charset/length test tables; coverage now lives centrally in slugify_test.go. - Wire clawker.yaml::project.name override through project.Manager and document the resolution hierarchy in docs/configuration.mdx + the gen-docs template. - Update CLAUDE.md files (auth, cmd/project) and load-bearing godocs (agent_cert.go) to stop referring to deleted APIs and to describe the new typed-wrapper-only contract honestly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves 14 Copilot review threads: - identity_interceptor: resolve closure now returns the original ctx on all five PermissionDenied paths instead of nil, hardening against future logging/metrics that might nil-deref on the CP serve path (root CLAUDE.md treats CP panic as a security incident). - agent/CLAUDE.md: step #3 rewritten to describe Session cancel as a docker-event subscriber (die/stop/kill/oom/destroy), independent of the evict subscriber. Both fire off the same docker bus; neither drives the other. - agent_cert: sanTailFromCert comment updated for tri-state ContainerIDFromCert (matches AgentFullNameFromCert). - registry_test: removed stale duplicate "Add panics" doc paragraph. - docker/names_test: deleted TestGenerateRandomName_AlwaysValidAsAgentName (and the now-unused fmt + auth imports). NewAgentName only rejects empty per the package's intentional design. - project/manager + cmd/project/register + gen-docs tmpl + regen mdx: rewrote stale `clawker.yaml::project.name` references to `clawker.yaml::name` (root is Project; name is top-level). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two coupled changes that reframe the 2-segment naming case as a first-class concept rather than a degraded/error form: 1. Terminology: "unscoped" / "empty project" → "global-scope agent" across docs, comments, log messages, and Godoc. The empty ProjectSlug is a legitimate signal (no project namespace), not an absence — every caller now reads as such. 2. Rename ErrInvalidAgentLabels → ErrInvalidAgentLabel (singular), clarifying that a missing dev.clawker.project label is NOT an error — it's the global-scope signal. Only the dev.clawker.agent label being missing/malformed trips the sentinel. Updates the identity interceptor's structured-log event name and all test references. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two stale doc claims in registry.go: - Entry.AgentName/Project: previous comment said auth.NewAgentName rejects dots/"clawker." prefix/charset; today it only rejects empty. Restate the typed-field rationale as compile-time discipline and point at the real enforcement layers (Docker create, x509 SAN encoding, IdentityInterceptor) plus the upstream cmdutil.ProjectSlugify normalization. - Add/validateEntry: previous comment said Register maps the error to codes.InvalidArgument; the handler maps to codes.Internal. Every field validateEntry checks is server-derived, so a failure is a CP wiring bug, not bad client input — Internal is correct. User-typed identity strings are validated upstream at the wire boundary and map to InvalidArgument there. Doc-only; no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d098996 to
7980098
Compare
schmitthub
added a commit
that referenced
this pull request
May 19, 2026
resolveHostID now parses CLAWKER_HOST_UID/GID via ParseUint(_,10,32) so over-uint32 inputs become a structured "malformed" fallback instead of silently wrapping when userStage casts to uint32 (closes CodeQL #285, #286). HostUID()/HostGID() and HostIDResolution.Value return uint32, so the PipeStage assignment is a total identity at the call site. Diagnostic event renamed host_uid_unavailable -> host_id_unavailable; the env field on the record (CLAWKER_HOST_UID or CLAWKER_HOST_GID) already disambiguates UID vs GID for the operator. Also collapses the duplicated logHostIdentity doc paragraph and qualifies the bare HostUID() reference in workspace/setup.go. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
schmitthub
added a commit
that referenced
this pull request
May 19, 2026
resolveHostID now parses CLAWKER_HOST_UID/GID via ParseUint(_,10,32) so over-uint32 inputs become a structured "malformed" fallback instead of silently wrapping when userStage casts to uint32 (closes CodeQL #285, #286). HostUID()/HostGID() and HostIDResolution.Value return uint32, so the PipeStage assignment is a total identity at the call site. Diagnostic event renamed host_uid_unavailable -> host_id_unavailable; the env field on the record (CLAWKER_HOST_UID or CLAWKER_HOST_GID) already disambiguates UID vs GID for the operator. Also collapses the duplicated logHostIdentity doc paragraph and qualifies the bare HostUID() reference in workspace/setup.go. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8 tasks
schmitthub
added a commit
that referenced
this pull request
May 19, 2026
…ic (#291) * fix(consts): host-derive ContainerUID, add HostUID for CP PR #269's ~/.claude/projects bind mount silently failed to persist auto-memory + session jsonls when the host UID was not 1001 — the agent image's claude user was hardcoded at UID 1001 (Dockerfile.tmpl useradd) so bind-mount writes from inside the container hit EACCES on virtually every Linux host (typical UID is 1000). macOS unaffected (virtiofs translates). Two scoped accessors in internal/consts/, distinguished by where they read: - consts.ContainerUID / ContainerGID: refactored from const 1001 to vars initialized from os.Getuid() / os.Getgid(). Rejects 0 (sudo invocation) and -1 (Windows); falls back to fallbackContainerUID (1001). CLI-side. All existing callers (bundler tar, containerfs tar, docker volume copy, deprecated cfg delegates) transparently get the host UID with zero call-site changes. - consts.HostUID / HostGID: new env-fed vars in internal/consts/controlplane.go. Read CLAWKER_HOST_UID / CLAWKER_HOST_GID via resolveHostUID with the same > 0 guard; fallback pinned to fallbackContainerUID (NOT ContainerUID, which inside the CP container resolves to 0 — CP runs as root for BPF caps, so a missing env would have silently dropped userStage to root). Invalid env emits event=host_uid_invalid on stderr so operators get a signal in docker logs <cp>. The CLI populates the env vars on the CP container at boot via BuildCPContainerConfig (cpboot/cp_container.go), using consts.ContainerUID / ContainerGID (host invoker's UID in CLI process). The single CP-side caller — userStage in internal/controlplane/agent/init.go — switches to HostUID / HostGID so post-init shell stages drop to the same UID baked into the agent image. The workspace UID-mismatch warning at setup.go is deleted (now dead by construction). Image content hash is host-UID-specific via the existing Dockerfile rendering, so multi-user hosts naturally get separate cached images — documented in internal/bundler/CLAUDE.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(consts): UID accessors, structured CP host-id diagnostic, dead Warnings field Follow-up to the host-derive ContainerUID commit on this branch. Addresses PR-review findings without changing behavior: - consts.{Container,Host}{UID,GID} converted from exported mutable vars to func() int accessors backed by unexported package-private state. Removes the stomp-able exported-mutable-var pattern; all callers updated. - resolveHostID is now side-effect-free, returning (int, HostIDResolution). cmd/clawker-cp/main.go adds logHostIdentity which emits event=host_uid_unavailable via the project zerolog surface (rotating CP logfile) when EnvHostUID/GID came through unset or invalid. The earlier package-init fmt.Fprintf to stderr is removed. - internal/workspace setup.go: SetupMountsResult.Warnings field + the dead mountWarnings slice are removed; iterating caller and warnings-empty asserts go with them. - TestConstantAccessors no longer hardcodes 1001; UID/GID asserts dropped as circular. - TestCPContainer_HostUIDGIDEnv_Emitted anchored on os.Getuid() with the same fallback rule. - TestLogHostIdentity parses zerolog JSON and asserts level/event/ env/reason/fallback structurally, not substring grep. - test/e2e/bind_mount_uid_test.go: new Linux-only e2e exercising the full chain end-to-end (skips on darwin — virtiofs would false-pass). - Misleading init-ordering comment on HostUID deleted. - Comments trimmed across consts, controlplane, cp_container, agent/init — drop WHAT, keep WHY. - Postmortem memory: file:line refs scrubbed; follow-ups 2 + 3 marked DONE. - CLAUDE.md docs updated to the func-call form. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(consts): uid_t-typed host id accessors, neutral diagnostic event resolveHostID now parses CLAWKER_HOST_UID/GID via ParseUint(_,10,32) so over-uint32 inputs become a structured "malformed" fallback instead of silently wrapping when userStage casts to uint32 (closes CodeQL #285, #286). HostUID()/HostGID() and HostIDResolution.Value return uint32, so the PipeStage assignment is a total identity at the call site. Diagnostic event renamed host_uid_unavailable -> host_id_unavailable; the env field on the record (CLAWKER_HOST_UID or CLAWKER_HOST_GID) already disambiguates UID vs GID for the operator. Also collapses the duplicated logHostIdentity doc paragraph and qualifies the bare HostUID() reference in workspace/setup.go. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(consts,bundler): Linux-only host UID/GID, idempotent groupadd Host-UID/GID derivation is only meaningful on Linux hosts, where the container's numeric IDs land directly on bind-mount files. Docker Desktop on macOS (virtiofs / gRPC FUSE) masks UID/GID at the share boundary — any container UID is presented on the host as the host user, so baking the host UID into the image gains nothing and forces groupadd to claim a GID that often collides with a base-image group (macOS GID 20 = staff vs Debian dialout). resolveProcessID now returns fallbackContainerUID/GID on non-Linux hosts, restoring the pre-fix 1001 path on macOS where virtiofs already handles ownership translation. The Dockerfile useradd block is also made idempotent so that an edge-case Linux host whose UID/GID happens to overlap a base-image group (low GIDs assigned to system groups) still builds: groupadd at the host GID falls back to an auto-assigned GID for ${USERNAME}, preserving UPG. The bind-mount writability contract holds on UID match alone — GID match is best-effort. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(serena): postmortem reflects HEAD, soft-closed pending alpha Append Phase 9 (uid_t-typed accessors + neutral host_id_unavailable event, ce98cbe), Phase 10 (Linux-only resolveProcessID + idempotent groupadd, 1cc857c), and Phase 11 (soft-close hand-off). Top-of-file status banner marks the work complete pending alpha-release verification on a Linux host — the Linux-only TestBindMountUID_E2E and the Phase 7 host-side manual sequence are the only gates remaining before hard-close. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(test,docs): mirror Linux-only host-ID fallback in test + CLAUDE.md TestCPContainer_HostUIDGIDEnv_Emitted computed wantUID/wantGID directly from os.Getuid()/os.Getgid() with only a `<= 0` fallback, diverging from consts.resolveProcessID's `runtime.GOOS != "linux" => 1001` short-circuit added in 1cc857c. On macOS where os.Getuid() is typically 501, the test would assert CLAWKER_HOST_UID=501 while the production resolver emits 1001 — hard fail on every macOS run. Mirror the production fallback exactly: branch on runtime.GOOS, then apply the `> 0` Linux refinement. internal/config/CLAUDE.md and internal/bundler/CLAUDE.md both described ContainerUID/GID as Linux-style host-derived without qualifying the Linux-only gate. Updated wording to call out the non-Linux unconditional-fallback path with the virtiofs / groupadd collision rationale so a future reader doesn't try to "fix" the gate as if it were a bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(plugin): drop UID-mismatch known-issue, add monitoring ref, bump 0.10.0 clawker-support - known-issues.md: removed the "~/.claude/projects bind mount Linux UID mismatch" entry; fix landed (host-derived ContainerUID on Linux via 1cc857c). Workaround block obsolete. - monitoring.md (new): OTel + OpenSearch + Prometheus stack overview, Clawker analytics workspace navigation, telemetry env var contract (build-time bake vs runtime enable), troubleshooting for empty indices / failed bootstrap / Prometheus targets / cross-container diagnostics. Routed from SKILL.md step 12 and troubleshooting.md. - SKILL.md / troubleshooting.md: route monitoring questions to the new reference; old "punt to docs.clawker.dev" step kept for other features (worktrees, etc.). - plugin.json: 0.9.1 -> 0.10.0 (minor — new reference file, user-visible structural change, plus the cd84d2e / 1cc857c Dockerfile.tmpl idempotent groupadd block from the UID branch). docs.clawker.dev (driveby from fix/uid PR) - container-internals.mdx: drop literal "UID 1001" from the claude user description; the Privilege Model table now says "claude (host-derived on Linux, 1001 elsewhere)". The literal was wrong on every non-1001 Linux host post-fix. - threat-model.mdx, security.mdx: same hedge — describe the user as unprivileged + UID baked at build time with Linux-host-derived semantics, instead of asserting a stale literal value. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(consts): drop os.Unsetenv from TestResolveHostID resolveHostID reads env via os.Getenv, which returns "" for both unset and set-to-"". The "unset" and "empty" cases exercised the same code path; "unset" used os.Unsetenv which doesn't restore a prior value at test end — a theoretical cross-test leak if the probe env were set in the parent shell. Drop the "unset" case + envSet branching; all cases now drive the env through t.Setenv, which restores the prior value via t.Cleanup automatically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the underlying regression that surfaced as agent-name length overflow ("invalid length") on long randomly-generated agent names, and the cascade of follow-ups that the PR review identified.
The root cause: the per-agent
AgentFullName(clawker.<project>.<agent>) used to live in the x509Subject.CommonName, which is hard-capped at 64 bytes. A longdocker.GenerateRandomNameoutput ("adventurous-rosalind-franklin-…") pushed past that cap andMintAgentCertfailed at signing time. MovingAgentFullNameto aurn:clawker:agent:URI SAN removes the cap, but it also changes the trust model — the cert's claim is no longer load-bearing identity, so the CP-side gate has to ground identity in Docker labels (peer-IP → container → labels) and verify the cert SAN against that.Done across this branch:
MintAgentCertembedsurn:clawker:agent:<full-name>(the per-agent identity) andurn:clawker:container:<id>(channel binding). CN is now the deterministicconsts.ContainerClawkerdliteral. Typedauth.ProjectSlug/auth.AgentNameenforce the form/charset/length contract at every wire boundary.IdentityInterceptorbecomes the universal trust gate. Runs on every AgentService RPC (Register included). Three stages: (1) CN pin toconsts.ContainerClawkerd, (2) peer-IP → Docker →dev.clawker.{project,agent}labels viaContainerByPeerIP, (3) constant-time compare cert SAN AgentFullName against the label-derived AgentFullName. Resolved container attached to ctx viaWithResolvedContainerfor handlers.urn:clawker:container:SAN and the RPC body against the daemon-attested truth; captures the cert thumbprint at the gate that persists it; writes the row using label-derived (authoritative) values.Registry.Lookupis gone; the dialer classifies by thumbprint vs row (outcomeRegistryMatch/outcomeRegistryMiss/outcomeRegistryThumbprintMismatch) and letsIdentityInterceptorcatch SAN-label drift on the agent's next inbound RPC. Connection stays open in all cases — asymmetric trust contract (CP must always be reachable for containment).00001is the current schema baseline;00002drops the now-unusedcanonical_cnpre-compute column. Both one-way — registry state regenerates from Docker ground truth on next CP boot.AgentFullNameFromCert→(string, error)withErrAgentSANMissing/ErrAgentSANMalformedsentinels; interceptor emits distinctevent=per shape (wire envelope unchanged).docker.VolumePurposes+ typed const suffixes — headroom test auto-covers any new purpose againstauth.MaxShortNameLen().subscribeSessionCancellistens to container die/stop/kill/oom/destroy and callsDialer.CancelDial, tearing down the doomed CP→clawkerd Session synchronously. Independent of registry evict; docker is the source of truth.validateEntryreturns error (no longerpanic()); bothRegistry.Addimpls propagate. Keeps the CP serve path clear of panics per root CLAUDE.md.ErrMalformedEntrysentinel wrapped inscanEntry; Register evicts the malformed row and re-writes using middleware-resolved identity rather than refusing the handshake.ProjectSlug.Less/AgentName.Lesson the typed identity values; sharedsortEntrieshelper consolidates the duplicated Snapshot sort.Registry.Snapshot()→([]Entry, error)so a sqlite query failure can't be silently treated as "no agents" (would evict every row on the next reap pass).ListAgentssurfacescodes.Internal.Test plan
make test— 4663 pass, 8 skipped (the usual integration / OS-specific skips).docker.VolumePurposes.clawker runwith a randomly-generated long agent name now succeeds (the original regression). Verify viaclawker controlplane agentsthat the row lands with the typed identity.docker stopan agent, confirm CP→clawkerd Session tears down synchronously rather than on next reconnect attempt.🤖 Generated with Claude Code