Skip to content

catalog: repair Role records left in pre-v81 wire shape (v82→v83) [draft]#36511

Draft
bosconi wants to merge 1 commit into
MaterializeInc:mainfrom
bosconi:jc/catalog-auto-provision-source-repair
Draft

catalog: repair Role records left in pre-v81 wire shape (v82→v83) [draft]#36511
bosconi wants to merge 1 commit into
MaterializeInc:mainfrom
bosconi:jc/catalog-auto-provision-source-repair

Conversation

@bosconi
Copy link
Copy Markdown
Member

@bosconi bosconi commented May 11, 2026

⚠️ Hypothesis disputed — this PR may not address the actual bug.

The premise of this fix is that affected envs have Role records in v80 wire shape on disk (without an auto_provision_source key), and that subsequent in-band writes produce byte-mismatched retractions that fail to consolidate. Data from the reported env contradicts the first half of that:

select * from mz_internal.mz_catalog_raw where data->>'kind' = 'Role';

shows the auto_provision_source key is present in every Role record's JSON (most as null, a few as "Frontegg"). That means the v80→v81 migration ran with is_cloud == true and produced canonical v81-shape bytes for every Role — there is no v80-shape data to repair via the mechanism this PR targets.

A separate, confirmed catalog problem is in play for that env: negative accumulations were observed in the catalog persist data. Root cause is unknown and may or may not involve byte-shape drift; if it does, the mismatch is somewhere other than the auto_provision_source field.

Pausing this PR pending better diagnosis. Keeping the diff open for design discussion of the two reusable pieces — the byte-precise migration entry point (run_versioned_upgrade_raw) and the skip_serializing_if annotation on auto_provision_source — which may have value on their own merits.


Summary

Draft for discussion. Potential fix for the v80→v81 catalog migration issue.

What's in the diff

  • CATALOG_VERSION 82 → 83. v83 is structurally identical to v82 except that RoleAttributes::auto_provision_source has #[serde(default, skip_serializing_if = "Option::is_none")]. This makes the JSON round-trip byte-stable in both directions (None ⇔ no-field; Some(X) ⇔ "X"). objects_v82.rs is unchanged — it remains the frozen "from" version the migration reads.

  • New run_versioned_upgrade_raw helper in upgrade.rs. Byte-precise counterpart to the existing run_versioned_upgrade. Migrations using this entry point receive each record's raw StateUpdateKindJson alongside its V1 deserialized form, and return raw (StateUpdateKindJson, Diff) updates that are appended verbatim. This is what enables byte-precise retraction of stale wire shapes — the standard MigrationAction::Update path re-serializes old through V1::Serialize, which cannot reproduce the v80 shape from a deserialized v82 value. All existing migrations remain on the typed path.

  • v82→v83 migration (v82_to_v83.rs). For every Role:

    • field present and Some(X) on disk → preserve X
    • field absent on disk (v80 shape) → apply the email-regex heuristic from the original v80→v81 migration (Frontegg if name matches .+@.+\..+, else None), unconditional — no is_cloud gate
    • field present and explicitly null → preserve None

    If the canonical v83 bytes differ from on-disk bytes, the migration emits a retraction of the exact on-disk bytes plus an insertion of canonical v83 bytes. Records already in canonical shape are skipped. Non-Role records are untouched.

What's intentionally NOT in this diff

  • No modification to objects_v82.rs. That would change an already-shipped wire format.
  • No change to MigrationAction. The new raw-bytes pathway is a parallel entry point.
  • No change to the audit-log CreateRoleV1 struct (which also has an auto_provision_source field). Audit events are append-only — the byte-stability problem doesn't apply.
  • No framework hardening test (chain byte-stability across all upgrades from MIN_CATALOG_VERSION). That was a separate proposal in the investigation; happy to file as a follow-up if there's appetite.

Background: the bug

The v80→v81 migration (src/catalog/src/durable/upgrade/v80_to_v81.rs:36-60) was guarded by an is_cloud heuristic that returned false for any catalog where, at the moment the migration committed in writable mode:

  • mz_system was not a Managed cluster, OR
  • mz_system.replication_factor was 0, OR
  • enable_password_auth was \"on\"

When the heuristic returned false, the migration emitted Vec::new() and left every existing Role record's on-disk JSON in v80 wire shape (without an auto_provision_source key). On v81+ the value is still readable (serde fills None for the missing Option<AutoProvisionSource> field), so the env doesn't crash. But any in-band write that retracts a Role record — ALTER ROLE, membership changes, etc. — re-serializes the deserialized value through the v81 type's Serialize, which emits \"auto_provision_source\": null. Those retraction bytes are not byte-equal to the v80-shape bytes on disk, so differential_dataflow consolidation cannot cancel them; the original +1 survives as a ghost record alongside whatever else accumulates.

In at least one environment this manifested as: most pre-04-02 user roles have auto_provision_source: None (the field-absent v80 shape, deserialized as None), only roles created post-v81 have Frontegg.

Open questions for review

  1. The new run_versioned_upgrade_raw is a one-off for this fix. Worth keeping for future shape-drift migrations, or should it be inlined into v82_to_v83 and removed once this migration is no longer needed?
  2. The email-regex heuristic is reused from v80_to_v81. Should the migration also classify by some other signal (e.g., audit-log lookup for the role's first-seen authenticator_kind), or is the same email-regex sufficient?
  3. No protection against the same class of bug recurring. Framework hardening proposals (chain byte-stability test; lint against Vec::new() returns from migrations that don't declare json_compatible! for shape-changed types; replace is_cloud heuristics with an authoritative deployment-topology signal) are intentionally out of scope here. Worth filing as follow-ups?
  4. The fix doesn't address the existing audit log. Pre-04-02 role creations have audit-log entries that lack auto_provision_source in the CreateRoleV1 payload (because the field was added to the audit struct at that point too). Audit events are append-only, so we can't repair them. Acceptable, or should the migration write synthetic audit events for the canonicalized roles?

Test plan

Unit tests in v82_to_v83.rs cover:

  • v80-shape Role with email-named subject → gets Frontegg, retraction byte-matches on-disk
  • v80-shape Role with non-email name → stays None, retraction byte-matches on-disk
  • v82-shape Role with Frontegg → no rewrite (already canonical)
  • v82-shape Role with explicit null → rewritten to no-field (preserve None, do NOT apply email regex)
  • Non-Role records → untouched

Larger end-to-end test plan (not landed in this PR):

  • Construct a synthetic v80 catalog with one named-replica mz_system (Unmanaged) plus a mix of email-named and non-email Role records. Run upgrade chain v80→v81→v82→v83. Verify that after each step, every Role record's bytes are canonical for that version, and that subsequent ALTER ROLE writes don't accumulate ghost records.
  • Run cargo test --package mz-catalog --lib durable::upgrade::tests::generate_missing_encodings -- --ignored to generate objects_v83.txt snapshot.
  • Full cargo test -p mz-catalog --lib.
  • Run platform-checks 0dt-upgrade scenario starting from v26.22.x with mz_system pre-ALTERed to Unmanaged.

🤖 Generated with Claude Code

The v80→v81 migration (src/catalog/src/durable/upgrade/v80_to_v81.rs) was
guarded by an `is_cloud` heuristic that returned false for any catalog
where, at the moment the migration committed in writable mode, mz_system
was not a Managed cluster with replication_factor > 0, or where
enable_password_auth was "on". Affected catalogs ran the migration with
the heuristic returning false; the migration emitted Vec::new() and left
every existing Role record's on-disk JSON in v80 wire shape — without an
`auto_provision_source` key.

On v81+ the value is still readable (serde fills None for the missing
Option<AutoProvisionSource> field), so the env doesn't crash. But any
in-band write that retracts a Role record — ALTER ROLE, membership
changes, etc. — re-serializes the deserialized value through the v81
type's Serialize, which emits `"auto_provision_source": null`. Those
retraction bytes are not byte-equal to the v80-shape bytes on disk, so
consolidation cannot cancel them; the original `+1` survives as a ghost
record alongside whatever else accumulates.

This change:

* Adds CATALOG_VERSION v83. v83 is structurally identical to v82 except
  that RoleAttributes::auto_provision_source has #[serde(default,
  skip_serializing_if = "Option::is_none")], making the JSON round-trip
  byte-stable in both directions (None ⇔ no-field; Some(X) ⇔ "X").

* Adds run_versioned_upgrade_raw — a byte-precise counterpart to the
  existing run_versioned_upgrade. Migrations using this entry point
  receive each record's raw StateUpdateKindJson alongside its V1
  deserialized form, and return raw (StateUpdateKindJson, Diff) updates
  that are appended verbatim. This is what makes byte-precise retraction
  of stale wire shapes possible — the standard MigrationAction::Update
  path re-serializes `old` through V1::Serialize, which cannot reproduce
  the v80 shape from a deserialized v82 value.

* Adds the v82→v83 migration. For every Role record, the migration
  inspects the raw JSON to determine whether `auto_provision_source` is
  present as a key, then decides the target value:

    - field present, Some(X) on disk → preserve X
    - field absent on disk (v80 shape) → apply the email-regex heuristic
      from the original v80→v81 migration (Frontegg if the role name
      matches `.+@.+\..+`, else None) — unconditionally, with no
      is_cloud gate
    - field present and explicitly null → preserve None

  If the canonical v83 bytes differ from the on-disk bytes, the
  migration emits a retraction of the exact on-disk bytes and an
  insertion of the canonical v83 bytes. Records already in canonical v83
  shape are skipped. Non-Role records are untouched.

Tests in v82_to_v83.rs cover the four code paths plus "non-Role records
untouched".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant