Skip to content

feat: c1z sanitizer v0.1 — library + CLI#875

Open
btipling wants to merge 2 commits into
mainfrom
bt/c1zsanitize-v0.1
Open

feat: c1z sanitizer v0.1 — library + CLI#875
btipling wants to merge 2 commits into
mainfrom
bt/c1zsanitize-v0.1

Conversation

@btipling
Copy link
Copy Markdown
Contributor

Summary

  • New pkg/c1zsanitize package + cmd/baton-c1z-sanitize CLI that transform a .c1z into an identity-stripped copy via connectorstore.Reader/Writer. Per-c1z HMAC-SHA256 secret drives every transform.
  • Cross-references stay coherent because the transform is deterministic within a c1z — every place an id appears (Grant.principal, Grant.entitlement, Grant.sources map keys, Entitlement.resource, Resource.parent_resource_id) maps through the same sanitize_id.
  • Annotation dispatch is a whitelist of Any type URLs. v0.1 handles UserTrait, GroupTrait, AppTrait, RoleTrait, SecretTrait, LicenseProfileTrait, ScopeBindingTrait. Unknown annotations are dropped by default with a log line; -allow-unknown-annotations flips to pass-through.
  • Timestamps use anchor-and-shift (newest source timestamp → anchor, single Δ applied uniformly) so relative deltas survive. Asset payloads get a content-type-matched placeholder while the ref chain is preserved.

Implementation follows the design in §6.2 of the investigation document. The sanitizer code never imports c1.storage.v3; it works entirely through connectorstore.Reader/Writer and the connector-v2 wire types as the investigation prescribed.

Output format choice

v2 (sqlite-zstd). The investigation's §7 question 5 punted on v2 vs v3 with the proviso "v0.1 should write v3 by default if PRs #870/#871/#872 have landed; otherwise v2." At the time of this PR, the storage-engine-v4 stack (#867#872) is all still open on main, so v0.1 writes v2 and v0.2 swaps to v3 once the writer adapter ships.

Open questions / choices for ambiguous items

  • Resource type IDs preserved. §2.3.1 of the investigation flagged that some connectors might construct resource_type_id with tenant data. v0.1 preserves them; the §7 question 1 audit hasn't run yet.
  • Asset placeholder content. Image content types get a 1x1 PNG; everything else gets a single zero byte. PutAsset silently drops empty data, so the single byte is the minimum that keeps the cross-reference alive. Document as known-lossy.
  • Sync ID minting. The destination Writer's StartNewSync mints a fresh KSUID rather than accepting a deterministic transform of the source sync id. Parent linkage is preserved via an in-memory srcSyncID → dstSyncID map maintained for the call. The v2 connectorstore.Writer interface doesn't expose SetSyncID, so the deterministic-KSUID approach from the investigation §6.4 is deferred.
  • All sync runs copied. No -max-sync-runs flag yet (investigation §7 question 6). Add when needed.
  • Reversibility. Operator-supplied secret via -secret-file, or the CLI generates one and writes it next to -out with mode 0600. Archive or shred — the sanitizer doesn't choose.
  • Role scope conditions. §3.7's expression strings are HMACed wholesale. Lossy by design (see §7 question 3). Connector-side structured alternatives are the v0.2 path.

Out of scope for v0.1 (per §6.3)

  • S3 fetch / upload, Union Station / Temporal integration, content-store registration, opt-in tracking, encryption at rest, reversibility tooling, selective re-sanitization.

Test plan

  • go vet ./..., gofmt -l, golangci-lint run clean on new code
  • go test ./pkg/c1zsanitize/ passes — all unit + invariant tests green
  • go test ./... passes — no existing tests broken
  • Hand-crafted fixture exercises every annotation handler in v0.1's whitelist (UserTrait/GroupTrait/AppTrait/RoleTrait/SecretTrait/LicenseProfileTrait/ScopeBindingTrait)
  • Cross-reference integrity invariant: for every source id that appears N times, sanitize_id(id) appears exactly N times in dst
  • Graph integrity invariants: Grant.principal/Grant.entitlement/Entitlement.resource/Resource.parent_resource_id all resolve in dst
  • Unknown annotation type dropped by default with DropUnknownAnnotations=true
  • CLI end-to-end smoke test: build a small fixture, run baton-c1z-sanitize -in src -out dst, assert exit 0 and dst exists

Adds a new pkg/c1zsanitize package and cmd/baton-c1z-sanitize CLI that
copies a .c1z file through connectorstore.Reader/Writer, transforming
identifiers, names, free text, emails, and timestamps under a per-c1z
HMAC-SHA256 secret while preserving graph topology, cardinalities, and
annotation structure.

Cross-references stay coherent because every transform is deterministic
within a c1z: the same input id always maps to the same sanitized id, so
GrantRecord.principal / .entitlement / .sources keys, EntitlementRecord
.resource, and ResourceRecord.parent_resource_id resolve in the output.
Different per-c1z secrets keep distinct c1zs uncorrelatable.

Annotation dispatch is a whitelist keyed by Any type URL. v0.1 ships
handlers for UserTrait, GroupTrait, AppTrait, RoleTrait, SecretTrait,
LicenseProfileTrait, and ScopeBindingTrait. Unknown annotation types
are dropped by default with a log line naming the type URL; an
operator flag flips behavior to pass-through.

Timestamps use anchor-and-shift: a single delta is computed from the
newest source sync timestamp and applied uniformly so relative deltas
survive. AssetRecord.data is replaced with a content-type-matched
placeholder while the asset ref chain is preserved.
Secret: secret,
TimestampAnchor: anchor,
DropUnknownAnnotations: !*allowUnknown,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Bug: dst.Close(ctx) is deferred so its error is silently dropped. For a write-mode C1File that must flush and compress the sqlite-zstd output, a failed close means the CLI reports success but the output .c1z is corrupt. Explicitly close dst on the success path and check the error, keeping the defer only as a safety net for early-return error paths.

Comment thread pkg/c1zsanitize/assets.go
Comment on lines +99 to +108
if err != nil {
// Asset referenced from an annotation but missing from
// the asset table. Skip — we don't fabricate placeholder
// rows because the cross-reference invariant treats it
// as a known dangling pointer in the source.
s.log.Debug("c1zsanitize: asset ref not found in source", zap.String("asset_id", srcID), zap.Error(err))
continue
}
if _, err := io.Copy(io.Discard, r); err != nil && !errors.Is(err, io.EOF) {
return fmt.Errorf("drain source asset %s: %w", srcID, err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: The reader r from GetAsset is drained but never closed. If the concrete implementation returns an io.ReadCloser behind the io.Reader interface, this leaks the underlying resource. Consider adding a defer-close with an io.Closer type assertion after the nil-error check.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 25, 2026

General PR Review: feat: c1z sanitizer v0.1 — library + CLI

Blocking Issues: 0 | Suggestions: 0 | Threads Resolved: 0
Review mode: incremental since 4c26a7f
View review run

Review Summary

The new commits address both findings from the previous review. The defer dst.Close(ctx) correctness issue in cmd/baton-c1z-sanitize/main.go is fixed with a dstClosed bool guard pattern that explicitly checks the close error on the success path while keeping the defer as a safety net. The asset reader leak suggestion in pkg/c1zsanitize/assets.go is addressed with a new drainAndClose helper that checks for io.Closer and defers closing. Additionally, an HMAC reuse optimization was added to the sanitizer.id() hot path to avoid re-creating the SHA-256 key schedule per call, and a dead loop in the test was cleaned up. No new issues found.

Security Issues

None found.

Correctness Issues

None found.

Suggestions

None.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking issues found — see review comments.

The CLI deferred dst.Close and threw the error away. For a write-mode
C1File the Close call is where sqlite-zstd finalizes and compresses
the output; losing that error meant a corrupt sanitized .c1z could
be reported as success. Explicit Close on the success path with a
guarded defer for the early-return paths.

The sanitizer's hot-loop id() built a fresh hmac.Hash on every call,
redoing the SHA-256 key schedule each time. Stash one hmac.Hash on
the sanitizer and Reset() between calls — single-threaded run, no
locking needed. SanitizeID stays as the allocation-y reference for
external callers (sanitizeEmail, tests).

copyAssets drained the GetAsset reader with io.Copy but never asked
whether the underlying type was an io.Closer; at least one impl is
*os.File-backed and was leaking a fd per asset. drainAndClose does
the type assertion and the drain in one spot.

Drop a dead range loop in the xref-integrity test.
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocking issues found.

@btipling btipling marked this pull request as ready for review May 25, 2026 09:05
@btipling btipling requested a review from a team May 25, 2026 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant