feat: c1z sanitizer v0.1 — library + CLI by btipling · Pull Request #875 · ConductorOne/baton-sdk

btipling · 2026-05-25T05:11:16Z

Summary

New pkg/c1zsanitize package + cmd/baton-c1z-sanitize CLI that transform a .c1z into an identity-stripped copy via connectorstore.Reader/Writer. Per-c1z HMAC-SHA256 secret drives every transform.
Cross-references stay coherent because the transform is deterministic within a c1z — every place an id appears (Grant.principal, Grant.entitlement, Grant.sources map keys, Entitlement.resource, Resource.parent_resource_id) maps through the same sanitize_id.
Annotation dispatch is a whitelist of Any type URLs. v0.1 handles UserTrait, GroupTrait, AppTrait, RoleTrait, SecretTrait, LicenseProfileTrait, ScopeBindingTrait. Unknown annotations are dropped by default with a log line; -allow-unknown-annotations flips to pass-through.
Timestamps use anchor-and-shift (newest source timestamp → anchor, single Δ applied uniformly) so relative deltas survive. Asset payloads get a content-type-matched placeholder while the ref chain is preserved.

Implementation follows the design in §6.2 of the investigation document. The sanitizer code never imports c1.storage.v3; it works entirely through connectorstore.Reader/Writer and the connector-v2 wire types as the investigation prescribed.

Output format choice

v2 (sqlite-zstd). The investigation's §7 question 5 punted on v2 vs v3 with the proviso "v0.1 should write v3 by default if PRs #870/#871/#872 have landed; otherwise v2." At the time of this PR, the storage-engine-v4 stack (#867–#872) is all still open on main, so v0.1 writes v2 and v0.2 swaps to v3 once the writer adapter ships.

Open questions / choices for ambiguous items

Resource type IDs preserved. §2.3.1 of the investigation flagged that some connectors might construct resource_type_id with tenant data. v0.1 preserves them; the §7 question 1 audit hasn't run yet.
Asset placeholder content. Image content types get a 1x1 PNG; everything else gets a single zero byte. PutAsset silently drops empty data, so the single byte is the minimum that keeps the cross-reference alive. Document as known-lossy.
Sync ID minting. The destination Writer's StartNewSync mints a fresh KSUID rather than accepting a deterministic transform of the source sync id. Parent linkage is preserved via an in-memory srcSyncID → dstSyncID map maintained for the call. The v2 connectorstore.Writer interface doesn't expose SetSyncID, so the deterministic-KSUID approach from the investigation §6.4 is deferred.
All sync runs copied. No -max-sync-runs flag yet (investigation §7 question 6). Add when needed.
Reversibility. Operator-supplied secret via -secret-file, or the CLI generates one and writes it next to -out with mode 0600. Archive or shred — the sanitizer doesn't choose.
Role scope conditions. §3.7's expression strings are HMACed wholesale. Lossy by design (see §7 question 3). Connector-side structured alternatives are the v0.2 path.

Out of scope for v0.1 (per §6.3)

S3 fetch / upload, Union Station / Temporal integration, content-store registration, opt-in tracking, encryption at rest, reversibility tooling, selective re-sanitization.

Test plan

go vet ./..., gofmt -l, golangci-lint run clean on new code
go test ./pkg/c1zsanitize/ passes — all unit + invariant tests green
go test ./... passes — no existing tests broken
Hand-crafted fixture exercises every annotation handler in v0.1's whitelist (UserTrait/GroupTrait/AppTrait/RoleTrait/SecretTrait/LicenseProfileTrait/ScopeBindingTrait)
Cross-reference integrity invariant: for every source id that appears N times, sanitize_id(id) appears exactly N times in dst
Graph integrity invariants: Grant.principal/Grant.entitlement/Entitlement.resource/Resource.parent_resource_id all resolve in dst
Unknown annotation type dropped by default with DropUnknownAnnotations=true
CLI end-to-end smoke test: build a small fixture, run baton-c1z-sanitize -in src -out dst, assert exit 0 and dst exists

Adds a new pkg/c1zsanitize package and cmd/baton-c1z-sanitize CLI that copies a .c1z file through connectorstore.Reader/Writer, transforming identifiers, names, free text, emails, and timestamps under a per-c1z HMAC-SHA256 secret while preserving graph topology, cardinalities, and annotation structure. Cross-references stay coherent because every transform is deterministic within a c1z: the same input id always maps to the same sanitized id, so GrantRecord.principal / .entitlement / .sources keys, EntitlementRecord .resource, and ResourceRecord.parent_resource_id resolve in the output. Different per-c1z secrets keep distinct c1zs uncorrelatable. Annotation dispatch is a whitelist keyed by Any type URL. v0.1 ships handlers for UserTrait, GroupTrait, AppTrait, RoleTrait, SecretTrait, LicenseProfileTrait, and ScopeBindingTrait. Unknown annotation types are dropped by default with a log line naming the type URL; an operator flag flips behavior to pass-through. Timestamps use anchor-and-shift: a single delta is computed from the newest source sync timestamp and applied uniformly so relative deltas survive. AssetRecord.data is replaced with a content-type-matched placeholder while the asset ref chain is preserved.

github-actions · 2026-05-25T05:13:52Z

+		Secret:                 secret,
+		TimestampAnchor:        anchor,
+		DropUnknownAnnotations: !*allowUnknown,
+	}


🟠 Bug: dst.Close(ctx) is deferred so its error is silently dropped. For a write-mode C1File that must flush and compress the sqlite-zstd output, a failed close means the CLI reports success but the output .c1z is corrupt. Explicitly close dst on the success path and check the error, keeping the defer only as a safety net for early-return error paths.

github-actions · 2026-05-25T05:13:55Z

+		if err != nil {
+			// Asset referenced from an annotation but missing from
+			// the asset table. Skip — we don't fabricate placeholder
+			// rows because the cross-reference invariant treats it
+			// as a known dangling pointer in the source.
+			s.log.Debug("c1zsanitize: asset ref not found in source", zap.String("asset_id", srcID), zap.Error(err))
+			continue
+		}
+		if _, err := io.Copy(io.Discard, r); err != nil && !errors.Is(err, io.EOF) {
+			return fmt.Errorf("drain source asset %s: %w", srcID, err)


🟡 Suggestion: The reader r from GetAsset is drained but never closed. If the concrete implementation returns an io.ReadCloser behind the io.Reader interface, this leaks the underlying resource. Consider adding a defer-close with an io.Closer type assertion after the nil-error check.

github-actions · 2026-05-25T05:14:16Z

General PR Review: feat: c1z sanitizer v0.1 — library + CLI

Blocking Issues: 0 | Suggestions: 0 | Threads Resolved: 0
Review mode: incremental since 4c26a7f
View review run

Review Summary

The new commits address both findings from the previous review. The defer dst.Close(ctx) correctness issue in cmd/baton-c1z-sanitize/main.go is fixed with a dstClosed bool guard pattern that explicitly checks the close error on the success path while keeping the defer as a safety net. The asset reader leak suggestion in pkg/c1zsanitize/assets.go is addressed with a new drainAndClose helper that checks for io.Closer and defers closing. Additionally, an HMAC reuse optimization was added to the sanitizer.id() hot path to avoid re-creating the SHA-256 key schedule per call, and a dead loop in the test was cleaned up. No new issues found.

Security Issues

None found.

Correctness Issues

None found.

Suggestions

None.

github-actions

Blocking issues found — see review comments.

The CLI deferred dst.Close and threw the error away. For a write-mode C1File the Close call is where sqlite-zstd finalizes and compresses the output; losing that error meant a corrupt sanitized .c1z could be reported as success. Explicit Close on the success path with a guarded defer for the early-return paths. The sanitizer's hot-loop id() built a fresh hmac.Hash on every call, redoing the SHA-256 key schedule each time. Stash one hmac.Hash on the sanitizer and Reset() between calls — single-threaded run, no locking needed. SanitizeID stays as the allocation-y reference for external callers (sanitizeEmail, tests). copyAssets drained the GetAsset reader with io.Copy but never asked whether the underlying type was an io.Closer; at least one impl is *os.File-backed and was leaking a fd per asset. drainAndClose does the type assertion and the drain in one spot. Drop a dead range loop in the xref-integrity test.

github-actions

No blocking issues found.

github-actions Bot reviewed May 25, 2026

View reviewed changes

github-actions Bot requested changes May 25, 2026

View reviewed changes

github-actions Bot reviewed May 25, 2026

View reviewed changes

btipling marked this pull request as ready for review May 25, 2026 09:05

btipling requested a review from a team May 25, 2026 09:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: c1z sanitizer v0.1 — library + CLI#875

feat: c1z sanitizer v0.1 — library + CLI#875
btipling wants to merge 2 commits into
mainfrom
bt/c1zsanitize-v0.1

btipling commented May 25, 2026

Uh oh!

github-actions Bot May 25, 2026

Uh oh!

github-actions Bot May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

btipling commented May 25, 2026

Summary

Output format choice

Open questions / choices for ambiguous items

Out of scope for v0.1 (per §6.3)

Test plan

Uh oh!

github-actions Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

General PR Review: feat: c1z sanitizer v0.1 — library + CLI

Review Summary

Security Issues

Correctness Issues

Suggestions

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 25, 2026 •

edited

Loading