Skip to content

Fix negative multiplicities caused by v80->v81 catalog migration#36519

Open
SangJunBak wants to merge 2 commits into
MaterializeInc:mainfrom
SangJunBak:neg-mult
Open

Fix negative multiplicities caused by v80->v81 catalog migration#36519
SangJunBak wants to merge 2 commits into
MaterializeInc:mainfrom
SangJunBak:neg-mult

Conversation

@SangJunBak
Copy link
Copy Markdown
Contributor

@SangJunBak SangJunBak commented May 11, 2026

Rebased off and dependent on #36524

Motivation

Creates a regression test for fixing negative multiplicities caused by caused by v80->v81 catalog migration seen in incident-1007

Description

  • Adds a v82 to v83 migration that fixes forward and clears out any negative multiplicities.

Verification

How do you know this change is correct? Describe new or existing automated
tests, or manual steps you took.

@SangJunBak SangJunBak force-pushed the neg-mult branch 4 times, most recently from c159418 to f2b0fd4 Compare May 12, 2026 06:13
@SangJunBak SangJunBak changed the title [DNM] Reproduce negative multiplicities for catalog migration Fix negative multiplicities caused by v80->v81 catalog migration May 12, 2026
@SangJunBak SangJunBak force-pushed the neg-mult branch 3 times, most recently from 2cf9fb1 to 40be5ee Compare May 12, 2026 06:39
@SangJunBak SangJunBak marked this pull request as ready for review May 12, 2026 06:40
@SangJunBak SangJunBak requested review from a team as code owners May 12, 2026 06:40
Copy link
Copy Markdown
Contributor

@def- def- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nightly triggered, I'll take a closer look at the code now: https://buildkite.com/materialize/nightly/builds/16399

Edit: Fixed the miri test failures.

Comment thread ci/test/pipeline.template.yml Outdated
agents:
queue: hetzner-x86-64-4cpu-8gb

- id: catalog-migration-incident-1007
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upgrade tests can't run in Test pipeline, they are way too slow. Should be moved to Nightly. I would also prefer if this was caught in platform-checks additionally, not just with those specific versions, but with any version in the future exhibiting similar bugs.
I'll try add it to platform-checks.

Comment on lines +13 to +25
Catalog versions:
v26.17.x -> catalog 80 (pre-bug)
v26.18.0 -> catalog 81, introduces the buggy `v80_to_v81::upgrade` whose
`is_cloud` heuristic silently no-ops on self-managed envs.
Role rows stay in their v80 byte form
(no `auto_provision_source` key) while the catalog version
advances. Any subsequent ALTER reads the v80 form, parses it,
and writes a retract+insert pair re-serialized through the
current proto — which always includes `auto_provision_source:
null` — leaving a dangling `-1`.
current -> catalog 83, this PR's `v82_to_v83::upgrade` scans for the
structural signature of the drift and emits compensating
updates to retire the dangling `-1` and the stale `+1`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we still need this test at all then, we could add a scenario to platform checks that does these versions too one we add the scenario that catches it by altering the role and then selecting from the catalog.

c.testdrive(dedent("""
> SELECT EXISTS (SELECT 1 FROM mz_roles WHERE name = 'user1');
false
"""))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire file smelly pretty LLMy, I think we shouldn't have one mzcompose.py test per bug/incident we run into, but use the existing frameworks when appropriate, so that we find similarly shaped bugs more generally.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point!

@def-
Copy link
Copy Markdown
Contributor

def- commented May 12, 2026

I pushed platform checks extensions that would have caught this:

source: /home/deen/git/materialize/misc/python/materialize/checks/all_checks/password_auth.py:114
2:1: error: executing postgres query: db error: ERROR: Invalid data in source, saw retractions (1) for row that does not exist: [Map({"key": Map({"id": Map({"User": Numeric(<#>)})}), "kind": String(<"XXXX">), "value": Map({"attributes": Map({"auto_provision_source": JsonNull, "inherit": True, "login": True, "superuser": JsonNull}), "membership": Map({"map": List([])}), "name": String(<"XXXX#">), "oid": Numeric(<#####>), "vars": Map({"entries": List([])})})})]
     |
   1 |
   2 | $ postgres-execute c ... [rest of line truncated for security]
     | ^

Test run on this PR now: https://buildkite.com/materialize/nightly/builds/16406

Edit: Removed it for now, I'll do it in a separate PR so that we can merge this test already.

Comment thread ci/test/pipeline.template.yml Outdated
agents:
queue: hetzner-x86-64-4cpu-8gb

- id: catalog-migration-incident-1007
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upgrade tests can't run in Test pipeline, they are way too slow. Should be moved to Nightly. I would also prefer if this was caught in platform-checks additionally, not just with those specific versions, but with any version in the future exhibiting similar bugs.
I'll try add it to platform-checks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants