Conversation
There was a problem hiding this comment.
Solid E2E coverage for the movePrimary replay path. Scope discipline is good: only the steady-state 4a test runs in CI, slow scenarios stay opt-in. Helpers are clean and the design rationale (no runtime OperationFailure skip on multi-shard deployments) is well-documented.
The main concerns are concentrated in test_move_primary_concurrent_writes (4b), where the writer-thread shutdown is best-effort: writer.join(timeout=5) can time out silently, and committed is then read without the lock to compute final_count. If the writer is still alive, that read races with ongoing mutations and the subsequent count_documents == final_count assertions can flake. See inline.
Runner-up notes not worth burning an inline slot:
mover_started = Trueis set beforemover.start(); ifstart()ever raised, thefinallywouldjoinan un-started thread. Low risk in practice._pick_target_shard'sassert entry is not Nonewill mask a config-discovery problem with a generic AssertionError rather than skip/fail with context. Minor.
This reverts the unintended PR #246 patch from 94446bf. PCSM-349 follow-up work stays out of this session. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
|
Closing in favor of the consolidated PR: #222 |
Problem
The movePrimary replay path (PCSM-342/346/347/348) has Go unit coverage but no end-to-end test. The only movePrimary E2E reference was a skipped stub in
tests/test_collections_sharded.pythat checked only that the primary shard changed, with no source/target parity.Solution
Add
tests/test_move_primary_replay_sharded.pyexercising movePrimary -> phantom create/drop DDL -> replay skip -> source/target parity on a sharded cluster:test_move_primary_steady_state_parity(4a): movePrimary in steady-state apply, then post-move insert/update/delete (exercises the worker-side UUID-resolved write-target routing from PCSM-348: Resolve DML namespaces from catalog at worker execution #245), then full parity. Non-slow.test_move_primary_concurrent_writes(4b): writes overlapping the in-flight movePrimary are not lost or duplicated. Slow.test_move_primary_mixed_sharded_unsharded(4c): a moved database with both sharded and unsharded collections stays in parity. Slow.test_move_primary_replay_skip_no_dup_no_loss(4d): two movePrimary cycles, replay skip yields no duplicates and no loss. Slow.Removes the skipped
test_move_primarystub (and its unusedOperationFailureimport) that this file supersedes.Only 4a is collected in CI (non-slow), so the movePrimary regression is guarded on every tested version pair including the pre-8 bug surface. 4b/4c/4d are
--runslowlocal hardening; no workflow or Makefile changes.Testing
--runslow: 4 passed; non-slow: 4a passed, 4b/4c/4d skipped).py_compileandruffclean.Notes
Part of PCSM-249. Stacks on #245 (worker-side UUID namespace resolution); retargets to its base as the stack merges.
Live testing surfaced that MongoDB rejects writes to the moving namespace during the movePrimary critical section with code 319 (MovePrimaryInProgress). 4b tolerates an in-window 319 rejection as proof the write overlapped the move, and abandons that
_id, so no-duplicate/no-loss still holds.