Skip to content

PCSM-349: Add E2E movePrimary replay coverage#246

Closed
chupe wants to merge 1 commit into
PCSM-348from
PCSM-349
Closed

PCSM-349: Add E2E movePrimary replay coverage#246
chupe wants to merge 1 commit into
PCSM-348from
PCSM-349

Conversation

@chupe

@chupe chupe commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Problem

The movePrimary replay path (PCSM-342/346/347/348) has Go unit coverage but no end-to-end test. The only movePrimary E2E reference was a skipped stub in tests/test_collections_sharded.py that checked only that the primary shard changed, with no source/target parity.

Solution

Add tests/test_move_primary_replay_sharded.py exercising movePrimary -> phantom create/drop DDL -> replay skip -> source/target parity on a sharded cluster:

  • test_move_primary_steady_state_parity (4a): movePrimary in steady-state apply, then post-move insert/update/delete (exercises the worker-side UUID-resolved write-target routing from PCSM-348: Resolve DML namespaces from catalog at worker execution #245), then full parity. Non-slow.
  • test_move_primary_concurrent_writes (4b): writes overlapping the in-flight movePrimary are not lost or duplicated. Slow.
  • test_move_primary_mixed_sharded_unsharded (4c): a moved database with both sharded and unsharded collections stays in parity. Slow.
  • test_move_primary_replay_skip_no_dup_no_loss (4d): two movePrimary cycles, replay skip yields no duplicates and no loss. Slow.

Removes the skipped test_move_primary stub (and its unused OperationFailure import) that this file supersedes.

Only 4a is collected in CI (non-slow), so the movePrimary regression is guarded on every tested version pair including the pre-8 bug surface. 4b/4c/4d are --runslow local hardening; no workflow or Makefile changes.

Testing

  • All four scenarios pass end to end against a local sharded MongoDB 7.0 cluster (--runslow: 4 passed; non-slow: 4a passed, 4b/4c/4d skipped).
  • py_compile and ruff clean.

Notes

Part of PCSM-249. Stacks on #245 (worker-side UUID namespace resolution); retargets to its base as the stack merges.

Live testing surfaced that MongoDB rejects writes to the moving namespace during the movePrimary critical section with code 319 (MovePrimaryInProgress). 4b tolerates an in-window 319 rejection as proof the write overlapped the move, and abandons that _id, so no-duplicate/no-loss still holds.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid E2E coverage for the movePrimary replay path. Scope discipline is good: only the steady-state 4a test runs in CI, slow scenarios stay opt-in. Helpers are clean and the design rationale (no runtime OperationFailure skip on multi-shard deployments) is well-documented.

The main concerns are concentrated in test_move_primary_concurrent_writes (4b), where the writer-thread shutdown is best-effort: writer.join(timeout=5) can time out silently, and committed is then read without the lock to compute final_count. If the writer is still alive, that read races with ongoing mutations and the subsequent count_documents == final_count assertions can flake. See inline.

Runner-up notes not worth burning an inline slot:

  • mover_started = True is set before mover.start(); if start() ever raised, the finally would join an un-started thread. Low risk in practice.
  • _pick_target_shard's assert entry is not None will mask a config-discovery problem with a generic AssertionError rather than skip/fail with context. Minor.

Comment thread tests/test_move_primary_replay_sharded.py
Comment thread tests/test_move_primary_replay_sharded.py
Comment thread tests/test_move_primary_replay_sharded.py
chupe added a commit that referenced this pull request Jun 19, 2026
This reverts the unintended PR #246 patch from 94446bf. PCSM-349 follow-up work stays out of this session.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
@chupe

chupe commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

Closing in favor of the consolidated PR: #222

@chupe chupe closed this Jun 22, 2026
@chupe

chupe commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant