Skip to content

Fix TaskManager frozen container validation in stress tests#26422

Draft
anthony-murphy-agent wants to merge 25 commits intomicrosoft:mainfrom
anthony-murphy-agent:fix/taskmanager-validation
Draft

Fix TaskManager frozen container validation in stress tests#26422
anthony-murphy-agent wants to merge 25 commits intomicrosoft:mainfrom
anthony-murphy-agent:fix/taskmanager-validation

Conversation

@anthony-murphy-agent
Copy link
Contributor

Summary

  • Skip quorum-dependent DDSs (TaskManager) when validating frozen containers against live clients in the removeClient reducer. Frozen containers don't process quorum events (removeMember), so TaskManager's taskQueues (which depend on quorum membership) will inevitably diverge.
  • Add optional skipChannelTypes parameter to validateConsistencyOfAllDDS to support this.
  • Remove 7 seeds from skip list that now pass: Directory (1, 14, 45, 155, 184), COC (54), TaskManager (137). Reclassify seed 63 from Directory to COC.

Depends on #26381.

Test plan

  • Full 200-seed stress test suite passes (194 passing, 6 pending, 0 failing)
  • No regressions on previously-passing seeds

🤖 Generated with Claude Code

anthony-murphy-agent and others added 24 commits February 6, 2026 12:07
…d ops

- Add storeHandle field to createDataStore and createChannel ops that
  stores the handle in root, making objects collaboratively reachable
- Increase detached phase from 20 to 100 ops (20 creation + 80 DDS)
- Increase generator ops from 100 to 200

Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Select DDS op targets by picking a channel type first across all
datastores globally, then picking a channel of that type. This
eliminates directory dominance (80% → 19%) and distributes ops
more evenly across all 10 DDS types.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…then channel

Split the detached creation phase into two sub-phases: datastores first
(ops 0-9), then channels (ops 10-19). This ensures multiple datastores
exist before channels are created, so the global type-first selection
distributes channels across datastores instead of concentrating them
in datastore-0 (64% → 26%).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…nimization comment

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…ory state tracking

Introduces ContainerStateTracker to maintain channel type metadata in
memory, avoiding repeated queries to the system under test. Moves
global type-first channel selection into the tracker class and adds
datastoreTag to LocalServerStressState for cleaner reducer access.

Also skips seed 54 (pre-existing ConsensusOrderedCollection bug) and
updates storeHandle comments per review feedback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
- Fix misleading comment in mixinClientSelection: clarify that only type
  metadata lookup is in-memory, not channel/datastore discovery
- Update SelectedClientSpec.channelTag type to "root" | `channel-${number}`
  to accurately reflect possible values and remove unsafe cast
- Rename datastoreCreationPhaseOps/channelCreationPhaseOps to
  datastoreCreationPhaseEndOp/channelCreationPhaseEndOp to clarify
  they are end thresholds, not counts
- Format test results JSON file with biome

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Add per-client caching of resolved IChannel and StressDataObject instances
in ContainerStateTracker. Channel type selection is now fully in-memory
using an inverse type index, and resolved channels are cached on first
access per client. The reducer also uses the cache via resolveChannel()
instead of calling getContainerObjects()/getChannels() directly.

Also removes accidentally committed test results file (54.json).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
- Remove channelNameMap SharedMap from StressDataObject; channel tracking
  is now fully handled by ContainerStateTracker
- Replace getChannels() with getChannel(name) for single-channel resolution
- Use state tracker for channel names in ddsOperations.ts
- Simplify channelTag typing to `channel-${number}` (ignore root)
- Pass stateTracker to validateConsistency for channel enumeration
- Skip seeds 45, 155 (pre-existing ConsensusOrderedCollection bugs)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…from DefaultStressDataObject

Replace the Fluid SharedMap used for cross-client object discovery with
in-memory tracking in ContainerStateTracker. This eliminates the circular
dependency where the test harness relied on the system under test (Fluid
DDSes) for its own bookkeeping.

Key changes:
- ContainerStateTracker now stores handle absolute paths and resolves
  objects via resolveHandle instead of getContainerObjects()
- Added registerBlob, resolveContainerObject, resolveAllContainerObjects,
  and getAllHandles methods to ContainerStateTracker
- Gutted DefaultStressDataObject: removed containerObjectMap, locally
  created objects tracking, pending registrations, and staging mode
  deferral logic
- Validation uses intersection-based channel comparison, fixing 7
  previously-failing seeds where frozen containers couldn't resolve all
  objects

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
… into StressDataObject

DefaultStressDataObject only added staging mode and an alias after
gutting the containerObjectMap. Move staging mode methods and the
"default" alias onto StressDataObject, delete the subclass, and
simplify createRuntimeFactory to use a single factory.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…hes from state tracker

Remove resolvedChannelCache and resolvedObjectCache. Resolving existing
Fluid objects is fast; caching only helps avoid redundant calls for
objects that already exist. The complexity isn't worth the ~1 minute
speedup across 200 seeds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…results

Address PR feedback:
- Type channelTag as `channel-${number}` in registerChannel
- Type resolveContainerObject tag as `datastore-${number}` | `blob-${number}`
- Type objectPaths map key and ResolvedContainerObject.tag accordingly
- Add src/test/results/ to .gitignore to prevent committing failure files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…ressor and clusterSize

These comments from the original codebase explain important design decisions
about the idCompressor instanceof check and the cluster size override.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
… StressDataObject

- Replace containerRuntimeForTest with resolveByAbsolutePath method
  that encapsulates resolveHandle logic with local caching
- Rename alias to defaultInstanceAlias per review feedback
- Remove unused imports from ContainerStateTracker

The state tracker no longer needs to know about container runtime
internals; resolution is now delegated to StressDataObject.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
… handling

- Replace intersection-based channel validation with strict key-set equality
  so real divergence (missing channels) is caught as an error
- Add .catch() to blob upload promise chain to prevent unhandled rejections
- Eagerly cache created datastores in resolvedObjectCache on the creating client
- Add cacheResolvedObject() method to StressDataObject for pre-populating cache
- Group skip seeds by failure reason using sub-arrays
- Fix stale comment about per-client IChannel caching
- Document staging mode phantom entry tradeoff in state tracker

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Resolve merge conflicts in 4 files, preserving PR's state tracker
approach while incorporating upstream changes (storeHandle, naming
conventions, dataStoreOperations.ts).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
- Cache newly created datastores on state.client.entryPoint (not
  state.datastore) so ContainerStateTracker resolution hits the cache.
- Widen resolveByAbsolutePath typing to not imply StressDataObject
  is always present (blobs don't have it).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Revert the constant renaming (use upstream's datastoreCreationPhaseEnd
/ channelCreationPhaseEnd) and remove the eager cache logic from the
createDataStore reducer -- resolveByAbsolutePath already caches on
first resolution, making eager caching redundant.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Per review feedback, don't gitignore the test results directory so
accidentally committed result files remain visible.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Convert the per-instance resolvedObjectCache to a static class member
shared across all StressDataObject instances. Remove the public
cacheResolvedObject method; add a static clearCache() method that the
harness can call between test runs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
The static resolvedObjectCache was incorrect because different clients
have different container runtimes, and handles resolved in one client's
runtime are not valid for another. Converting to per-instance caching
naturally scopes the cache to each client's StressDataObject entrypoint,
since resolveByAbsolutePath is always called on client.entryPoint.

No clearCache() needed -- each test seed creates fresh containers with
fresh StressDataObject instances, so the cache resets naturally.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
… remove stale comment

- Use factory.attributes.type instead of factory.type in registerDatastore for
  consistency with ddsModelMap keying
- Replace manual Fisher-Yates shuffle with random.shuffle and simplify
  selectChannelForOperation to shuffle all channel types in a single loop,
  eliminating the separate fallback path
- Remove incorrect staging mode comment (channels/datastores created in staging
  mode remain valid on the creating client even after discard)
- Update skip list for seeds affected by shuffle order change

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…ve redundant .StressDataObject access

- Rename ResolvedContainerObject.stressDataObject field to datastore
- Access getChannel() directly on StressDataObject instance instead of
  going through the .StressDataObject member (which is only needed for
  anonymous object discovery)
- Update all usages in containerStateTracker.ts, ddsOperations.ts, and
  dataStoreOperations.ts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Revert the changes below line 225 in ddsOperations.ts as the original
validation logic (assert size equality + iterate aMap.keys()) is
sufficient and doesn't need to change.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
import { makeUnreachableCodePathProxy } from "./utils.js";

/**
* DDS types whose in-memory state depends on quorum membership events (e.g.,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't be true. however, the quorum can change before a dds is loaded

// Stashed ops (from applyStashedOp) aren't tracked in latestPendingOps.
// Just resubmit them with the same sentinel metadata.
if (localOpMetadata === stashedOpMetadata) {
this.submitLocalMessage(content, stashedOpMetadata);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want to resubmit stashed ops. so just return

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the local-server stress test harness to correctly validate “frozen containers” against live clients by skipping quorum-dependent DDS types (notably TaskManager) during consistency checks, and refactors stress-test handle/channel tracking to use an in-memory ContainerStateTracker.

Changes:

  • Add skipChannelTypes support to validateConsistencyOfAllDDS and use it for frozen-container validation (skip TaskManager).
  • Replace SharedMap-based container object/channel discovery with a new in-memory ContainerStateTracker + per-entrypoint resolve caching.
  • Fix/adjust DDS behavior for frozen containers: TaskManager stashed-op handling and ConsensusOrderedCollection quorum scrubbing.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
packages/test/local-server-stress-tests/src/test/localServerStress.spec.ts Updates validation callback signature and modifies the seed skip list.
packages/test/local-server-stress-tests/src/stressDataObject.ts Removes DefaultStressDataObject/SharedMap registry; adds resolve-by-absolute-path caching and simplifies blob/channel APIs.
packages/test/local-server-stress-tests/src/localServerStressHarness.ts Integrates ContainerStateTracker, adds frozen-container skip types, and threads tracker into validation + channel selection.
packages/test/local-server-stress-tests/src/ddsOperations.ts Extends validateConsistencyOfAllDDS to accept a tracker and optional channel-type skip set; switches handle discovery to tracker.
packages/test/local-server-stress-tests/src/dataStoreOperations.ts Uses tracker-based datastore enumeration for dirty-state checks.
packages/test/local-server-stress-tests/src/containerStateTracker.ts New in-memory registry for datastores/blobs/channels, plus type-first channel selection and resolution helpers.
packages/test/local-server-stress-tests/src/baseModel.ts Registers created datastores/channels/blobs with the tracker during reducer operations.
packages/dds/task-manager/src/taskManager.ts Changes stashed-op behavior to resubmit with sentinel metadata and adjusts pending-op tracking accordingly.
packages/dds/ordered-collection/src/consensusOrderedCollection.ts Scrubs stale jobTracking entries based on quorum during load/connect and handles acquire-after-leave ordering.
packages/dds/ordered-collection/api-report/ordered-collection.legacy.beta.api.md Updates API report to reflect newly surfaced onConnect() method.

Comment on lines 43 to 46
// Pre-existing DDS bugs (not introduced by this PR):
skip: [
...[10, 63, 125, 180], // ConsensusOrderedCollection live-client ordering divergence
],
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The skip list here doesn’t match the PR description. The description says several previously-skipped seeds were removed (and implies only the remaining known-bad seeds are skipped), but this file now skips seeds 10, 63, 125, and 180. Please either update the PR description/test plan to reflect the new skip set, or adjust the skip list to match the intended behavior.

Copilot uses AI. Check for mistakes.
// where the snapshot includes an acquire for a client whose ClientLeave was
// sequenced before the acquire (so the snapshot's quorum excludes the client,
// but jobTracking still has the entry). The onConnect scrub handles entries
// added during savedOps replay, but this catch entries already in the snapshot
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo/grammar in this comment: "but this catch entries" should be "but this catches entries".

Suggested change
// added during savedOps replay, but this catch entries already in the snapshot
// added during savedOps replay, but this catches entries already in the snapshot

Copilot uses AI. Check for mistakes.
}
}
// Raise events after all state changes, matching removeClient pattern.
added.map((value) => this.emit("add", value, false /* newlyAdded */));
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.map(...) is being used only for side effects. Using forEach/a for...of loop would avoid allocating an unused array and makes the intent clearer.

Suggested change
added.map((value) => this.emit("add", value, false /* newlyAdded */));
added.forEach((value) => this.emit("add", value, false /* newlyAdded */));

Copilot uses AI. Check for mistakes.
Comment on lines 904 to 909
protected applyStashedOp(content: unknown): void {
// We don't apply any stashed ops since during the rehydration process. Since we lose any assigned tasks
// during rehydration we cannot be assigned any tasks. Additionally, without the in-memory state of the
// previous dds, we also cannot re-volunteer based on a previous subscribeToTask() call. Since we are
// unable to be assigned to any tasks, there is no reason to process abandon/complete ops either.
assertIsTaskManagerOperation(content);
// Resubmit the op so it can be matched on ack. Use stashedOpMetadata so the
// opWatcher handlers skip latestPendingOps tracking for these resubmitted ops.
this.submitLocalMessage(content, stashedOpMetadata);
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes TaskManager’s stashed-op behavior from “ignore” to “resubmit with sentinel localOpMetadata”, and also alters opWatcher/latestPendingOps handling based on that sentinel. There don’t appear to be existing unit tests covering stashed-op rehydration paths in this package; please add coverage to ensure applyStashedOp/resubmit/rollback don’t mutate latestPendingOps and don’t assert when acks arrive for stashed ops.

Copilot uses AI. Check for mistakes.
Implement applyStashedOp to resubmit stashed ops with a sentinel
metadata value (-1), allowing opWatcher handlers to skip
latestPendingOps tracking for rehydrated ops. Handle stashed ops
in reSubmitCore and rollback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants