catalog: add v82->v83 migration to repair v80-form-drift Role rows#36524
Merged
DAlperin merged 3 commits intoMay 12, 2026
Merged
Conversation
The v80->v81 catalog migration was supposed to backfill auto_provision_source on every Role, retracting the v80 byte form and inserting a v81 form with the new field set. Its is_cloud heuristic at v80_to_v81.rs:43-55 included an "mz_system cluster must be ClusterVariant::Managed" predicate; on envs where the variant did not match at upgrade time, the heuristic returned false and the migration silently no-opped. The version bump committed anyway, but every Role row kept its v80 byte form (no auto_provision_source key). Any subsequent role-touching DDL on v26.18+ — ALTER ROLE of any kind, role membership changes, DROP ROLE — reads the row, parses it (missing field becomes None), then writes a retract+insert pair through current protos that *do* include the auto_provision_source key. The retraction bytes diverge from the stored row; consolidation does not merge them; the shard ends up holding three rows per affected role: - stale +1 in v80 byte form (no auto_provision_source key) - dangling -1 in v82 byte form (retraction that missed) - live +1 in v82 byte form (post-mutation state, absent for DROP ROLE) apply() does not fire because it keys by parsed (RoleKey, RoleValue) and the dangling -1 parses-equal to the stale +1. PersistPeek::do_peek fires on the dangling -1 for any SELECT against mz_internal.mz_catalog_raw, and run_versioned_upgrade soft_asserts "snapshot is consolidated" on the next catalog version bump — which is why this surfaces for the first time on the v26.22 -> v26.23 upgrade (CATALOG_VERSION 81 -> 82, the first version bump since v26.18). This migration scans the consolidated snapshot for Role -1 rows whose parsed value equals at least one +1 sibling for the same RoleKey (the structural fingerprint of the bug). For each match it writes a +1 of the dangling bytes (cancels the -1) plus a -1 of every parsed-equal stale +1 row (completes the retraction the original DDL intended). At most one *other* +1 with a different parsed value is permitted; multiple distinct live candidates is ambiguous and bails. The predicate is structural, not symptomatic. We do not care which field the live +1 differs from the dangling -1 in; the cause is byte-form drift on the retraction side, not anything specific to the mutated attribute. Dangling -1 rows that do not fit the fingerprint (no parsed-equal sibling, multiple distinct live values, non-Role kinds, |diff| > 1) are logged at WARN and left for human triage. Bumps CATALOG_VERSION to 83. objects_v83.rs is byte-identical to v82 (no schema changes; the migration is byte-level repair only). The underlying serialization round-trip drift remains tracked at database-issues#7179.
Contributor
|
Migration makes sense to me and passes the regression test in #36519. Would decrease some of the "byte" terminology and verbosity in the comments |
aljoscha
approved these changes
May 12, 2026
| //! # Background | ||
| //! | ||
| //! The catalog persist shard requires that every `(key_bytes, ts)` tuple has | ||
| //! `Diff::ONE` after consolidation. `PersistPeek::do_peek` enforces this |
Contributor
There was a problem hiding this comment.
this persist peek or peek business is more a detail of how we investigated, I'd leave that out of the comment here
| #[mz_ore::test] | ||
| fn production_shape_alter_login_is_repaired() { | ||
| // The shape we actually pulled out of the affected env for | ||
| // jan@materialize.com (User:8): a stale v80-form `+1` lacking the |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The v80->v81 catalog migration was supposed to backfill auto_provision_source on every Role, retracting the v80 byte form and inserting a v81 form with the new field set. Its is_cloud heuristic at v80_to_v81.rs:43-55 included an "mz_system cluster must be ClusterVariant::Managed" predicate; on envs where the variant did not match at upgrade time, the heuristic returned false and the migration silently no-opped. The version bump committed anyway, but every Role row kept its v80 byte form (no auto_provision_source key).
Any subsequent role-touching DDL on v26.18+ — ALTER ROLE of any kind, role membership changes, DROP ROLE — reads the row, parses it (missing field becomes None), then writes a retract+insert pair through current protos that do include the auto_provision_source key. The retraction bytes diverge from the stored row; consolidation does not merge them; the shard ends up holding three rows per affected role:
apply() does not fire because it keys by parsed (RoleKey, RoleValue) and the dangling -1 parses-equal to the stale +1. PersistPeek::do_peek fires on the dangling -1 for any SELECT against mz_internal.mz_catalog_raw, and run_versioned_upgrade soft_asserts "snapshot is consolidated" on the next catalog version bump — which is why this surfaces for the first time on the v26.22 -> v26.23 upgrade (CATALOG_VERSION 81 -> 82, the first version bump since v26.18).
This migration scans the consolidated snapshot for Role -1 rows whose parsed value equals at least one +1 sibling for the same RoleKey (the structural fingerprint of the bug). For each match it writes a +1 of the dangling bytes (cancels the -1) plus a -1 of every parsed-equal stale +1 row (completes the retraction the original DDL intended). At most one other +1 with a different parsed value is permitted; multiple distinct live candidates is ambiguous and bails.
The predicate is structural, not symptomatic. We do not care which field the live +1 differs from the dangling -1 in; the cause is byte-form drift on the retraction side, not anything specific to the mutated attribute. Dangling -1 rows that do not fit the fingerprint (no parsed-equal sibling, multiple distinct live values, non-Role kinds, |diff| > 1) are logged at WARN and left for human triage.
Bumps CATALOG_VERSION to 83. objects_v83.rs is byte-identical to v82 (no schema changes; the migration is byte-level repair only).
The underlying serialization round-trip drift remains tracked at database-issues#7179.
Remove these sections if your commit already has a good description!
Motivation
Why does this change exist? Link to a GitHub issue, design doc, Slack
thread, or explain the problem in a sentence or two. A reviewer who has
no context should understand why after reading this section.
If this implements or addresses an existing issue, it's enough to link to that:
Closes
Fixes
etc.
Description
What does this PR actually do? Focus on the approach and any non-obvious
decisions. The diff shows the code --- use this space to explain what the
diff can't tell a reviewer.
Verification
How do you know this change is correct? Describe new or existing automated
tests, or manual steps you took.