Skip to content

Fix COC consistency: scrub stale jobTracking entries on connect#26423

Draft
anthony-murphy-agent wants to merge 26 commits intomicrosoft:mainfrom
anthony-murphy-agent:fix/coc-consistency
Draft

Fix COC consistency: scrub stale jobTracking entries on connect#26423
anthony-murphy-agent wants to merge 26 commits intomicrosoft:mainfrom
anthony-murphy-agent:fix/coc-consistency

Conversation

@anthony-murphy-agent
Copy link
Contributor

Summary

  • Fix ConsensusOrderedCollection data divergence between live clients. When a new client loads from snapshot + saved ops, it processes historical acquire ops (adding items to jobTracking) but never receives the removeMember quorum events for clients that left before it joined. Items get permanently stuck in jobTracking, diverging from long-lived clients that correctly returned those items to the queue.
  • Add scrubJobTrackingByQuorum() in onConnect() to clean up stale entries after all saved ops are processed and the quorum is current.
  • Add ConsensusOrderedCollection to quorumDependentDdsTypes for frozen container validation (frozen containers never connect, so the scrub can't run).
  • Seeds fixed: 37, 60. Seeds remaining in skip: 56, 63, 180 (queue ordering divergence from quorum-event replay timing).

Depends on #26422 and #26381.

Test plan

  • Full 200-seed stress test suite passes (196 passing, 4 pending, 0 failing)
  • No regressions on previously-passing seeds
  • COC DDS-level fuzz tests still pass

🤖 Generated with Claude Code

anthony-murphy-agent and others added 26 commits February 6, 2026 12:07
…d ops

- Add storeHandle field to createDataStore and createChannel ops that
  stores the handle in root, making objects collaboratively reachable
- Increase detached phase from 20 to 100 ops (20 creation + 80 DDS)
- Increase generator ops from 100 to 200

Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Select DDS op targets by picking a channel type first across all
datastores globally, then picking a channel of that type. This
eliminates directory dominance (80% → 19%) and distributes ops
more evenly across all 10 DDS types.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…then channel

Split the detached creation phase into two sub-phases: datastores first
(ops 0-9), then channels (ops 10-19). This ensures multiple datastores
exist before channels are created, so the global type-first selection
distributes channels across datastores instead of concentrating them
in datastore-0 (64% → 26%).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…nimization comment

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…ory state tracking

Introduces ContainerStateTracker to maintain channel type metadata in
memory, avoiding repeated queries to the system under test. Moves
global type-first channel selection into the tracker class and adds
datastoreTag to LocalServerStressState for cleaner reducer access.

Also skips seed 54 (pre-existing ConsensusOrderedCollection bug) and
updates storeHandle comments per review feedback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
- Fix misleading comment in mixinClientSelection: clarify that only type
  metadata lookup is in-memory, not channel/datastore discovery
- Update SelectedClientSpec.channelTag type to "root" | `channel-${number}`
  to accurately reflect possible values and remove unsafe cast
- Rename datastoreCreationPhaseOps/channelCreationPhaseOps to
  datastoreCreationPhaseEndOp/channelCreationPhaseEndOp to clarify
  they are end thresholds, not counts
- Format test results JSON file with biome

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Add per-client caching of resolved IChannel and StressDataObject instances
in ContainerStateTracker. Channel type selection is now fully in-memory
using an inverse type index, and resolved channels are cached on first
access per client. The reducer also uses the cache via resolveChannel()
instead of calling getContainerObjects()/getChannels() directly.

Also removes accidentally committed test results file (54.json).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
- Remove channelNameMap SharedMap from StressDataObject; channel tracking
  is now fully handled by ContainerStateTracker
- Replace getChannels() with getChannel(name) for single-channel resolution
- Use state tracker for channel names in ddsOperations.ts
- Simplify channelTag typing to `channel-${number}` (ignore root)
- Pass stateTracker to validateConsistency for channel enumeration
- Skip seeds 45, 155 (pre-existing ConsensusOrderedCollection bugs)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…from DefaultStressDataObject

Replace the Fluid SharedMap used for cross-client object discovery with
in-memory tracking in ContainerStateTracker. This eliminates the circular
dependency where the test harness relied on the system under test (Fluid
DDSes) for its own bookkeeping.

Key changes:
- ContainerStateTracker now stores handle absolute paths and resolves
  objects via resolveHandle instead of getContainerObjects()
- Added registerBlob, resolveContainerObject, resolveAllContainerObjects,
  and getAllHandles methods to ContainerStateTracker
- Gutted DefaultStressDataObject: removed containerObjectMap, locally
  created objects tracking, pending registrations, and staging mode
  deferral logic
- Validation uses intersection-based channel comparison, fixing 7
  previously-failing seeds where frozen containers couldn't resolve all
  objects

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
… into StressDataObject

DefaultStressDataObject only added staging mode and an alias after
gutting the containerObjectMap. Move staging mode methods and the
"default" alias onto StressDataObject, delete the subclass, and
simplify createRuntimeFactory to use a single factory.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…hes from state tracker

Remove resolvedChannelCache and resolvedObjectCache. Resolving existing
Fluid objects is fast; caching only helps avoid redundant calls for
objects that already exist. The complexity isn't worth the ~1 minute
speedup across 200 seeds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…results

Address PR feedback:
- Type channelTag as `channel-${number}` in registerChannel
- Type resolveContainerObject tag as `datastore-${number}` | `blob-${number}`
- Type objectPaths map key and ResolvedContainerObject.tag accordingly
- Add src/test/results/ to .gitignore to prevent committing failure files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…ressor and clusterSize

These comments from the original codebase explain important design decisions
about the idCompressor instanceof check and the cluster size override.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
… StressDataObject

- Replace containerRuntimeForTest with resolveByAbsolutePath method
  that encapsulates resolveHandle logic with local caching
- Rename alias to defaultInstanceAlias per review feedback
- Remove unused imports from ContainerStateTracker

The state tracker no longer needs to know about container runtime
internals; resolution is now delegated to StressDataObject.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
… handling

- Replace intersection-based channel validation with strict key-set equality
  so real divergence (missing channels) is caught as an error
- Add .catch() to blob upload promise chain to prevent unhandled rejections
- Eagerly cache created datastores in resolvedObjectCache on the creating client
- Add cacheResolvedObject() method to StressDataObject for pre-populating cache
- Group skip seeds by failure reason using sub-arrays
- Fix stale comment about per-client IChannel caching
- Document staging mode phantom entry tradeoff in state tracker

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Resolve merge conflicts in 4 files, preserving PR's state tracker
approach while incorporating upstream changes (storeHandle, naming
conventions, dataStoreOperations.ts).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
- Cache newly created datastores on state.client.entryPoint (not
  state.datastore) so ContainerStateTracker resolution hits the cache.
- Widen resolveByAbsolutePath typing to not imply StressDataObject
  is always present (blobs don't have it).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Revert the constant renaming (use upstream's datastoreCreationPhaseEnd
/ channelCreationPhaseEnd) and remove the eager cache logic from the
createDataStore reducer -- resolveByAbsolutePath already caches on
first resolution, making eager caching redundant.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Per review feedback, don't gitignore the test results directory so
accidentally committed result files remain visible.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Convert the per-instance resolvedObjectCache to a static class member
shared across all StressDataObject instances. Remove the public
cacheResolvedObject method; add a static clearCache() method that the
harness can call between test runs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
The static resolvedObjectCache was incorrect because different clients
have different container runtimes, and handles resolved in one client's
runtime are not valid for another. Converting to per-instance caching
naturally scopes the cache to each client's StressDataObject entrypoint,
since resolveByAbsolutePath is always called on client.entryPoint.

No clearCache() needed -- each test seed creates fresh containers with
fresh StressDataObject instances, so the cache resets naturally.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
… remove stale comment

- Use factory.attributes.type instead of factory.type in registerDatastore for
  consistency with ddsModelMap keying
- Replace manual Fisher-Yates shuffle with random.shuffle and simplify
  selectChannelForOperation to shuffle all channel types in a single loop,
  eliminating the separate fallback path
- Remove incorrect staging mode comment (channels/datastores created in staging
  mode remain valid on the creating client even after discard)
- Update skip list for seeds affected by shuffle order change

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…ve redundant .StressDataObject access

- Rename ResolvedContainerObject.stressDataObject field to datastore
- Access getChannel() directly on StressDataObject instance instead of
  going through the .StressDataObject member (which is only needed for
  anonymous object discovery)
- Update all usages in containerStateTracker.ts, ddsOperations.ts, and
  dataStoreOperations.ts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Revert the changes below line 225 in ddsOperations.ts as the original
validation logic (assert size equality + iterate aMap.keys()) is
sufficient and doesn't need to change.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…kip list

Frozen containers don't process quorum events (e.g., removeMember), so DDSs
like TaskManager whose in-memory state depends on quorum membership will
inevitably diverge from a live client. TaskManager's applyStashedOp is also
intentionally a no-op, meaning the frozen container isn't expected to reproduce
the live client's task queue state.

Changes:
- Add optional `skipChannelTypes` parameter to `validateConsistencyOfAllDDS`
- Define `quorumDependentDdsTypes` set containing the TaskManager type
- Pass this set when validating frozen containers in the removeClient reducer
- Remove 7 seeds from skip list that now pass (Directory: 1, 14, 45, 155, 184;
  COC: 54; TaskManager: 137)
- Reclassify seed 63 from Directory to COC category
- Keep seed 46 (genuine TaskManager live-client consistency bug)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
When a new client loads from snapshot + saved ops, it processes historical
acquire ops that add items to jobTracking. However, it never receives the
removeMember quorum events for clients that left before it joined. This causes
items to be permanently stuck in jobTracking on the new client, while long-lived
clients correctly returned those items to the queue via their removeMember
handlers — producing data divergence between live clients.

Fix: override onConnect() in ConsensusOrderedCollection to scrub jobTracking
entries for clients no longer in the quorum. This runs after all saved ops are
processed and the quorum is current, ensuring stale entries are cleaned up.

Also add ConsensusOrderedCollection to the quorumDependentDdsTypes set in the
stress test harness, since frozen containers (which never connect) can't
perform this scrub.

Seeds fixed: 37, 60 (data count divergence)
Seeds remaining: 56, 63, 180 (queue ordering divergence — the scrub adds items
back at the end of the queue, whereas the original removeMember event happened
at a different point during op processing, producing different item ordering)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant