Fix COC consistency: scrub stale jobTracking entries on connect by anthony-murphy-agent · Pull Request #26423 · microsoft/FluidFramework

anthony-murphy-agent · 2026-02-12T00:07:43Z

Summary

Fix ConsensusOrderedCollection data divergence between live clients. When a new client loads from snapshot + saved ops, it processes historical acquire ops (adding items to jobTracking) but never receives the removeMember quorum events for clients that left before it joined. Items get permanently stuck in jobTracking, diverging from long-lived clients that correctly returned those items to the queue.
Add scrubJobTrackingByQuorum() in onConnect() to clean up stale entries after all saved ops are processed and the quorum is current.
Add ConsensusOrderedCollection to quorumDependentDdsTypes for frozen container validation (frozen containers never connect, so the scrub can't run).
Seeds fixed: 37, 60. Seeds remaining in skip: 56, 63, 180 (queue ordering divergence from quorum-event replay timing).

Test plan

Full 200-seed stress test suite passes (196 passing, 4 pending, 0 failing)
No regressions on previously-passing seeds
COC DDS-level fuzz tests still pass

🤖 Generated with Claude Code

…d ops - Add storeHandle field to createDataStore and createChannel ops that stores the handle in root, making objects collaboratively reachable - Increase detached phase from 20 to 100 ops (20 creation + 80 DDS) - Increase generator ops from 100 to 200 Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Select DDS op targets by picking a channel type first across all datastores globally, then picking a channel of that type. This eliminates directory dominance (80% → 19%) and distributes ops more evenly across all 10 DDS types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>

…then channel Split the detached creation phase into two sub-phases: datastores first (ops 0-9), then channels (ops 10-19). This ensures multiple datastores exist before channels are created, so the global type-first selection distributes channels across datastores instead of concentrating them in datastore-0 (64% → 26%). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>

…nimization comment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>

…ory state tracking Introduces ContainerStateTracker to maintain channel type metadata in memory, avoiding repeated queries to the system under test. Moves global type-first channel selection into the tracker class and adds datastoreTag to LocalServerStressState for cleaner reducer access. Also skips seed 54 (pre-existing ConsensusOrderedCollection bug) and updates storeHandle comments per review feedback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>

- Fix misleading comment in mixinClientSelection: clarify that only type metadata lookup is in-memory, not channel/datastore discovery - Update SelectedClientSpec.channelTag type to "root" | `channel-${number}` to accurately reflect possible values and remove unsafe cast - Rename datastoreCreationPhaseOps/channelCreationPhaseOps to datastoreCreationPhaseEndOp/channelCreationPhaseEndOp to clarify they are end thresholds, not counts - Format test results JSON file with biome Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>

Add per-client caching of resolved IChannel and StressDataObject instances in ContainerStateTracker. Channel type selection is now fully in-memory using an inverse type index, and resolved channels are cached on first access per client. The reducer also uses the cache via resolveChannel() instead of calling getContainerObjects()/getChannels() directly. Also removes accidentally committed test results file (54.json). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>

- Remove channelNameMap SharedMap from StressDataObject; channel tracking is now fully handled by ContainerStateTracker - Replace getChannels() with getChannel(name) for single-channel resolution - Use state tracker for channel names in ddsOperations.ts - Simplify channelTag typing to `channel-${number}` (ignore root) - Pass stateTracker to validateConsistency for channel enumeration - Skip seeds 45, 155 (pre-existing ConsensusOrderedCollection bugs) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>

…from DefaultStressDataObject Replace the Fluid SharedMap used for cross-client object discovery with in-memory tracking in ContainerStateTracker. This eliminates the circular dependency where the test harness relied on the system under test (Fluid DDSes) for its own bookkeeping. Key changes: - ContainerStateTracker now stores handle absolute paths and resolves objects via resolveHandle instead of getContainerObjects() - Added registerBlob, resolveContainerObject, resolveAllContainerObjects, and getAllHandles methods to ContainerStateTracker - Gutted DefaultStressDataObject: removed containerObjectMap, locally created objects tracking, pending registrations, and staging mode deferral logic - Validation uses intersection-based channel comparison, fixing 7 previously-failing seeds where frozen containers couldn't resolve all objects Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>

… into StressDataObject DefaultStressDataObject only added staging mode and an alias after gutting the containerObjectMap. Move staging mode methods and the "default" alias onto StressDataObject, delete the subclass, and simplify createRuntimeFactory to use a single factory. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>

…hes from state tracker Remove resolvedChannelCache and resolvedObjectCache. Resolving existing Fluid objects is fast; caching only helps avoid redundant calls for objects that already exist. The complexity isn't worth the ~1 minute speedup across 200 seeds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>

…results Address PR feedback: - Type channelTag as `channel-${number}` in registerChannel - Type resolveContainerObject tag as `datastore-${number}` | `blob-${number}` - Type objectPaths map key and ResolvedContainerObject.tag accordingly - Add src/test/results/ to .gitignore to prevent committing failure files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>

…ressor and clusterSize These comments from the original codebase explain important design decisions about the idCompressor instanceof check and the cluster size override. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>

… StressDataObject - Replace containerRuntimeForTest with resolveByAbsolutePath method that encapsulates resolveHandle logic with local caching - Rename alias to defaultInstanceAlias per review feedback - Remove unused imports from ContainerStateTracker The state tracker no longer needs to know about container runtime internals; resolution is now delegated to StressDataObject. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>

… handling - Replace intersection-based channel validation with strict key-set equality so real divergence (missing channels) is caught as an error - Add .catch() to blob upload promise chain to prevent unhandled rejections - Eagerly cache created datastores in resolvedObjectCache on the creating client - Add cacheResolvedObject() method to StressDataObject for pre-populating cache - Group skip seeds by failure reason using sub-arrays - Fix stale comment about per-client IChannel caching - Document staging mode phantom entry tradeoff in state tracker Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>

Resolve merge conflicts in 4 files, preserving PR's state tracker approach while incorporating upstream changes (storeHandle, naming conventions, dataStoreOperations.ts). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>

- Cache newly created datastores on state.client.entryPoint (not state.datastore) so ContainerStateTracker resolution hits the cache. - Widen resolveByAbsolutePath typing to not imply StressDataObject is always present (blobs don't have it). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>

Revert the constant renaming (use upstream's datastoreCreationPhaseEnd / channelCreationPhaseEnd) and remove the eager cache logic from the createDataStore reducer -- resolveByAbsolutePath already caches on first resolution, making eager caching redundant. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>

Per review feedback, don't gitignore the test results directory so accidentally committed result files remain visible. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>

Convert the per-instance resolvedObjectCache to a static class member shared across all StressDataObject instances. Remove the public cacheResolvedObject method; add a static clearCache() method that the harness can call between test runs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>

The static resolvedObjectCache was incorrect because different clients have different container runtimes, and handles resolved in one client's runtime are not valid for another. Converting to per-instance caching naturally scopes the cache to each client's StressDataObject entrypoint, since resolveByAbsolutePath is always called on client.entryPoint. No clearCache() needed -- each test seed creates fresh containers with fresh StressDataObject instances, so the cache resets naturally. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>

… remove stale comment - Use factory.attributes.type instead of factory.type in registerDatastore for consistency with ddsModelMap keying - Replace manual Fisher-Yates shuffle with random.shuffle and simplify selectChannelForOperation to shuffle all channel types in a single loop, eliminating the separate fallback path - Remove incorrect staging mode comment (channels/datastores created in staging mode remain valid on the creating client even after discard) - Update skip list for seeds affected by shuffle order change Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>

…ve redundant .StressDataObject access - Rename ResolvedContainerObject.stressDataObject field to datastore - Access getChannel() directly on StressDataObject instance instead of going through the .StressDataObject member (which is only needed for anonymous object discovery) - Update all usages in containerStateTracker.ts, ddsOperations.ts, and dataStoreOperations.ts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>

Revert the changes below line 225 in ddsOperations.ts as the original validation logic (assert size equality + iterate aMap.keys()) is sufficient and doesn't need to change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>

…kip list Frozen containers don't process quorum events (e.g., removeMember), so DDSs like TaskManager whose in-memory state depends on quorum membership will inevitably diverge from a live client. TaskManager's applyStashedOp is also intentionally a no-op, meaning the frozen container isn't expected to reproduce the live client's task queue state. Changes: - Add optional `skipChannelTypes` parameter to `validateConsistencyOfAllDDS` - Define `quorumDependentDdsTypes` set containing the TaskManager type - Pass this set when validating frozen containers in the removeClient reducer - Remove 7 seeds from skip list that now pass (Directory: 1, 14, 45, 155, 184; COC: 54; TaskManager: 137) - Reclassify seed 63 from Directory to COC category - Keep seed 46 (genuine TaskManager live-client consistency bug) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>

When a new client loads from snapshot + saved ops, it processes historical acquire ops that add items to jobTracking. However, it never receives the removeMember quorum events for clients that left before it joined. This causes items to be permanently stuck in jobTracking on the new client, while long-lived clients correctly returned those items to the queue via their removeMember handlers — producing data divergence between live clients. Fix: override onConnect() in ConsensusOrderedCollection to scrub jobTracking entries for clients no longer in the quorum. This runs after all saved ops are processed and the quorum is current, ensuring stale entries are cleaned up. Also add ConsensusOrderedCollection to the quorumDependentDdsTypes set in the stress test harness, since frozen containers (which never connect) can't perform this scrub. Seeds fixed: 37, 60 (data count divergence) Seeds remaining: 56, 63, 180 (queue ordering divergence — the scrub adds items back at the end of the queue, whereas the original removeMember event happened at a different point during op processing, producing different item ordering) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>

anthony-murphy-agent and others added 26 commits February 6, 2026 12:07

chore(local-server-stress-tests): remove saveSuccesses and add skipMi…

2cf1c3a

…nimization comment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix COC consistency: scrub stale jobTracking entries on connect#26423

Fix COC consistency: scrub stale jobTracking entries on connect#26423
anthony-murphy-agent wants to merge 26 commits intomicrosoft:mainfrom
anthony-murphy-agent:fix/coc-consistency

anthony-murphy-agent commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anthony-murphy-agent commented Feb 12, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant