Skip to content

catalog: add v82->v83 migration to repair v80-form-drift Role rows#36524

Merged
DAlperin merged 3 commits into
MaterializeInc:mainfrom
DAlperin:dov/catalog-v83-role-drift-repair
May 12, 2026
Merged

catalog: add v82->v83 migration to repair v80-form-drift Role rows#36524
DAlperin merged 3 commits into
MaterializeInc:mainfrom
DAlperin:dov/catalog-v83-role-drift-repair

Conversation

@DAlperin
Copy link
Copy Markdown
Member

The v80->v81 catalog migration was supposed to backfill auto_provision_source on every Role, retracting the v80 byte form and inserting a v81 form with the new field set. Its is_cloud heuristic at v80_to_v81.rs:43-55 included an "mz_system cluster must be ClusterVariant::Managed" predicate; on envs where the variant did not match at upgrade time, the heuristic returned false and the migration silently no-opped. The version bump committed anyway, but every Role row kept its v80 byte form (no auto_provision_source key).

Any subsequent role-touching DDL on v26.18+ — ALTER ROLE of any kind, role membership changes, DROP ROLE — reads the row, parses it (missing field becomes None), then writes a retract+insert pair through current protos that do include the auto_provision_source key. The retraction bytes diverge from the stored row; consolidation does not merge them; the shard ends up holding three rows per affected role:

  • stale +1 in v80 byte form (no auto_provision_source key)
  • dangling -1 in v82 byte form (retraction that missed)
  • live +1 in v82 byte form (post-mutation state, absent for DROP ROLE)

apply() does not fire because it keys by parsed (RoleKey, RoleValue) and the dangling -1 parses-equal to the stale +1. PersistPeek::do_peek fires on the dangling -1 for any SELECT against mz_internal.mz_catalog_raw, and run_versioned_upgrade soft_asserts "snapshot is consolidated" on the next catalog version bump — which is why this surfaces for the first time on the v26.22 -> v26.23 upgrade (CATALOG_VERSION 81 -> 82, the first version bump since v26.18).

This migration scans the consolidated snapshot for Role -1 rows whose parsed value equals at least one +1 sibling for the same RoleKey (the structural fingerprint of the bug). For each match it writes a +1 of the dangling bytes (cancels the -1) plus a -1 of every parsed-equal stale +1 row (completes the retraction the original DDL intended). At most one other +1 with a different parsed value is permitted; multiple distinct live candidates is ambiguous and bails.

The predicate is structural, not symptomatic. We do not care which field the live +1 differs from the dangling -1 in; the cause is byte-form drift on the retraction side, not anything specific to the mutated attribute. Dangling -1 rows that do not fit the fingerprint (no parsed-equal sibling, multiple distinct live values, non-Role kinds, |diff| > 1) are logged at WARN and left for human triage.

Bumps CATALOG_VERSION to 83. objects_v83.rs is byte-identical to v82 (no schema changes; the migration is byte-level repair only).

The underlying serialization round-trip drift remains tracked at database-issues#7179.

Remove these sections if your commit already has a good description!

Motivation

Why does this change exist? Link to a GitHub issue, design doc, Slack
thread, or explain the problem in a sentence or two. A reviewer who has
no context should understand why after reading this section.

If this implements or addresses an existing issue, it's enough to link to that:
Closes
Fixes
etc.

Description

What does this PR actually do? Focus on the approach and any non-obvious
decisions. The diff shows the code --- use this space to explain what the
diff can't tell a reviewer.

Verification

How do you know this change is correct? Describe new or existing automated
tests, or manual steps you took.

The v80->v81 catalog migration was supposed to backfill auto_provision_source
on every Role, retracting the v80 byte form and inserting a v81 form with the
new field set. Its is_cloud heuristic at v80_to_v81.rs:43-55 included an
"mz_system cluster must be ClusterVariant::Managed" predicate; on envs where
the variant did not match at upgrade time, the heuristic returned false and
the migration silently no-opped. The version bump committed anyway, but every
Role row kept its v80 byte form (no auto_provision_source key).

Any subsequent role-touching DDL on v26.18+ — ALTER ROLE of any kind, role
membership changes, DROP ROLE — reads the row, parses it (missing field
becomes None), then writes a retract+insert pair through current protos that
*do* include the auto_provision_source key. The retraction bytes diverge from
the stored row; consolidation does not merge them; the shard ends up holding
three rows per affected role:

  - stale +1 in v80 byte form (no auto_provision_source key)
  - dangling -1 in v82 byte form (retraction that missed)
  - live +1 in v82 byte form (post-mutation state, absent for DROP ROLE)

apply() does not fire because it keys by parsed (RoleKey, RoleValue) and the
dangling -1 parses-equal to the stale +1. PersistPeek::do_peek fires on the
dangling -1 for any SELECT against mz_internal.mz_catalog_raw, and
run_versioned_upgrade soft_asserts "snapshot is consolidated" on the next
catalog version bump — which is why this surfaces for the first time on the
v26.22 -> v26.23 upgrade (CATALOG_VERSION 81 -> 82, the first version bump
since v26.18).

This migration scans the consolidated snapshot for Role -1 rows whose parsed
value equals at least one +1 sibling for the same RoleKey (the structural
fingerprint of the bug). For each match it writes a +1 of the dangling bytes
(cancels the -1) plus a -1 of every parsed-equal stale +1 row (completes the
retraction the original DDL intended). At most one *other* +1 with a
different parsed value is permitted; multiple distinct live candidates is
ambiguous and bails.

The predicate is structural, not symptomatic. We do not care which field the
live +1 differs from the dangling -1 in; the cause is byte-form drift on the
retraction side, not anything specific to the mutated attribute. Dangling -1
rows that do not fit the fingerprint (no parsed-equal sibling, multiple
distinct live values, non-Role kinds, |diff| > 1) are logged at WARN and
left for human triage.

Bumps CATALOG_VERSION to 83. objects_v83.rs is byte-identical to v82 (no
schema changes; the migration is byte-level repair only).

The underlying serialization round-trip drift remains tracked at
database-issues#7179.
@SangJunBak
Copy link
Copy Markdown
Contributor

Migration makes sense to me and passes the regression test in #36519. Would decrease some of the "byte" terminology and verbosity in the comments

Copy link
Copy Markdown
Contributor

@aljoscha aljoscha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

//! # Background
//!
//! The catalog persist shard requires that every `(key_bytes, ts)` tuple has
//! `Diff::ONE` after consolidation. `PersistPeek::do_peek` enforces this
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this persist peek or peek business is more a detail of how we investigated, I'd leave that out of the comment here

#[mz_ore::test]
fn production_shape_alter_login_is_repaired() {
// The shape we actually pulled out of the affected env for
// jan@materialize.com (User:8): a stale v80-form `+1` lacking the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🪦

@DAlperin DAlperin marked this pull request as ready for review May 12, 2026 15:21
@DAlperin DAlperin requested a review from a team as a code owner May 12, 2026 15:21
@DAlperin DAlperin enabled auto-merge (squash) May 12, 2026 15:55
@DAlperin DAlperin merged commit a2c7f70 into MaterializeInc:main May 12, 2026
116 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants