feat: c1z sanitizer v0.1 — library + CLI#875
Conversation
Adds a new pkg/c1zsanitize package and cmd/baton-c1z-sanitize CLI that copies a .c1z file through connectorstore.Reader/Writer, transforming identifiers, names, free text, emails, and timestamps under a per-c1z HMAC-SHA256 secret while preserving graph topology, cardinalities, and annotation structure. Cross-references stay coherent because every transform is deterministic within a c1z: the same input id always maps to the same sanitized id, so GrantRecord.principal / .entitlement / .sources keys, EntitlementRecord .resource, and ResourceRecord.parent_resource_id resolve in the output. Different per-c1z secrets keep distinct c1zs uncorrelatable. Annotation dispatch is a whitelist keyed by Any type URL. v0.1 ships handlers for UserTrait, GroupTrait, AppTrait, RoleTrait, SecretTrait, LicenseProfileTrait, and ScopeBindingTrait. Unknown annotation types are dropped by default with a log line naming the type URL; an operator flag flips behavior to pass-through. Timestamps use anchor-and-shift: a single delta is computed from the newest source sync timestamp and applied uniformly so relative deltas survive. AssetRecord.data is replaced with a content-type-matched placeholder while the asset ref chain is preserved.
| Secret: secret, | ||
| TimestampAnchor: anchor, | ||
| DropUnknownAnnotations: !*allowUnknown, | ||
| } |
There was a problem hiding this comment.
🟠 Bug: dst.Close(ctx) is deferred so its error is silently dropped. For a write-mode C1File that must flush and compress the sqlite-zstd output, a failed close means the CLI reports success but the output .c1z is corrupt. Explicitly close dst on the success path and check the error, keeping the defer only as a safety net for early-return error paths.
| if err != nil { | ||
| // Asset referenced from an annotation but missing from | ||
| // the asset table. Skip — we don't fabricate placeholder | ||
| // rows because the cross-reference invariant treats it | ||
| // as a known dangling pointer in the source. | ||
| s.log.Debug("c1zsanitize: asset ref not found in source", zap.String("asset_id", srcID), zap.Error(err)) | ||
| continue | ||
| } | ||
| if _, err := io.Copy(io.Discard, r); err != nil && !errors.Is(err, io.EOF) { | ||
| return fmt.Errorf("drain source asset %s: %w", srcID, err) |
There was a problem hiding this comment.
🟡 Suggestion: The reader r from GetAsset is drained but never closed. If the concrete implementation returns an io.ReadCloser behind the io.Reader interface, this leaks the underlying resource. Consider adding a defer-close with an io.Closer type assertion after the nil-error check.
General PR Review: feat: c1z sanitizer v0.1 — library + CLIBlocking Issues: 0 | Suggestions: 0 | Threads Resolved: 0 Review SummaryThe new commits address both findings from the previous review. The Security IssuesNone found. Correctness IssuesNone found. SuggestionsNone. |
The CLI deferred dst.Close and threw the error away. For a write-mode C1File the Close call is where sqlite-zstd finalizes and compresses the output; losing that error meant a corrupt sanitized .c1z could be reported as success. Explicit Close on the success path with a guarded defer for the early-return paths. The sanitizer's hot-loop id() built a fresh hmac.Hash on every call, redoing the SHA-256 key schedule each time. Stash one hmac.Hash on the sanitizer and Reset() between calls — single-threaded run, no locking needed. SanitizeID stays as the allocation-y reference for external callers (sanitizeEmail, tests). copyAssets drained the GetAsset reader with io.Copy but never asked whether the underlying type was an io.Closer; at least one impl is *os.File-backed and was leaking a fd per asset. drainAndClose does the type assertion and the drain in one spot. Drop a dead range loop in the xref-integrity test.
Summary
pkg/c1zsanitizepackage +cmd/baton-c1z-sanitizeCLI that transform a.c1zinto an identity-stripped copy viaconnectorstore.Reader/Writer. Per-c1z HMAC-SHA256 secret drives every transform.Grant.principal,Grant.entitlement,Grant.sourcesmap keys,Entitlement.resource,Resource.parent_resource_id) maps through the samesanitize_id.UserTrait,GroupTrait,AppTrait,RoleTrait,SecretTrait,LicenseProfileTrait,ScopeBindingTrait. Unknown annotations are dropped by default with a log line;-allow-unknown-annotationsflips to pass-through.Implementation follows the design in §6.2 of the investigation document. The sanitizer code never imports
c1.storage.v3; it works entirely throughconnectorstore.Reader/Writerand the connector-v2 wire types as the investigation prescribed.Output format choice
v2 (sqlite-zstd). The investigation's §7 question 5 punted on v2 vs v3 with the proviso "v0.1 should write v3 by default if PRs #870/#871/#872 have landed; otherwise v2." At the time of this PR, the storage-engine-v4 stack (#867–#872) is all still open on
main, so v0.1 writes v2 and v0.2 swaps to v3 once the writer adapter ships.Open questions / choices for ambiguous items
resource_type_idwith tenant data. v0.1 preserves them; the §7 question 1 audit hasn't run yet.PutAssetsilently drops empty data, so the single byte is the minimum that keeps the cross-reference alive. Document as known-lossy.StartNewSyncmints a fresh KSUID rather than accepting a deterministic transform of the source sync id. Parent linkage is preserved via an in-memorysrcSyncID → dstSyncIDmap maintained for the call. The v2connectorstore.Writerinterface doesn't exposeSetSyncID, so the deterministic-KSUID approach from the investigation §6.4 is deferred.-max-sync-runsflag yet (investigation §7 question 6). Add when needed.-secret-file, or the CLI generates one and writes it next to-outwith mode 0600. Archive or shred — the sanitizer doesn't choose.Out of scope for v0.1 (per §6.3)
Test plan
go vet ./...,gofmt -l,golangci-lint runclean on new codego test ./pkg/c1zsanitize/passes — all unit + invariant tests greengo test ./...passes — no existing tests brokensanitize_id(id)appears exactly N times in dstGrant.principal/Grant.entitlement/Entitlement.resource/Resource.parent_resource_idall resolve in dstDropUnknownAnnotations=truebaton-c1z-sanitize -in src -out dst, assert exit 0 and dst exists