Per-Partition Automatic Failover: Faster detection of per-partition write region through availability strategy for writes.#48421
Draft
jeet1995 wants to merge 12 commits intoAzure:mainfrom
Conversation
72cd8e0 to
acbf49c
Compare
…riter accounts
Enable proactive write hedging for Per-Partition Automatic Failover (PPAF) on single-writer
Cosmos DB accounts. When a write to the primary region is slow or failing, the SDK now hedges
the write to a read region — reducing time-to-recovery from 60-120s (retry-based) to the
hedging threshold (~1s with default config).
## Problem
In PPAF-enabled single-writer accounts, when a partition fails over, the SDK waits for error
signals (503, 408, 410) which can take 60-120s before marking a region as failed for that
partition via the retry-based path in GlobalPartitionEndpointManagerForPerPartitionAutomaticFailover.
## Solution
Plug the existing availability strategy (hedging) machinery into the write path for PPAF:
1. **Speculation gating** (RxDocumentClientImpl.getApplicableRegionsForSpeculation):
- Relax the canUseMultipleWriteLocations() gate for PPAF single-writer accounts
- Relax the isIdempotentWriteRetriesEnabled gate (PPAF provides partition-level consistency)
- Use ALL account-level read regions (getAvailableReadRoutingContexts) as hedge candidates,
not just preferred regions — PPAF failover can target any read region
2. **Routing** (tryAddPartitionLevelLocationOverride + CrossRegionAvailabilityContext):
- Add ppafWriteHedgeTargetRegion field to CrossRegionAvailabilityContextForRxDocumentServiceRequest
- In tryAddPartitionLevelLocationOverride: when ppafWriteHedgeTargetRegion is set, create the
conchashmap entry via computeIfAbsent and route via hedgeFailoverInfo.getCurrent()
- This is synchronous and deterministic — conchashmap updated in the same request pipeline
- Thread safety: uses getCurrent() from the computeIfAbsent result (not raw hedgeTarget)
to avoid routing to a region the concurrent retry path may have marked as failed
3. **Default E2E policy** (evaluatePpafEnforcedE2eLatencyPolicyCfgForWrites):
- Mirrors the read defaults exactly — symmetric hedging behavior for reads and writes
- Only applied to point write operations (batch excluded via isPointOperation gate)
- DIRECT: timeout=networkRequestTimeout+1s, threshold=min(timeout/2, 1s), step=500ms
- GATEWAY: timeout=min(6s, httpTimeout), threshold=min(timeout/2, 1s), step=500ms
4. **Safety lever** (Configs.isWriteAvailabilityStrategyEnabledWithPpaf):
- System property COSMOS.IS_WRITE_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF (default: true)
- Allows opt-out without code changes if regression is observed
## Files changed (6)
- Configs.java: Write availability strategy PPAF config flag
- RxDocumentClientImpl.java: Speculation gating, region resolution, write E2E policy
- CrossRegionAvailabilityContextForRxDocumentServiceRequest.java: ppafWriteHedgeTargetRegion field
- ClientRetryPolicy.java: Honor ppafWriteHedgeTargetRegion in tryAddPartitionLevelLocationOverride
- GlobalPartitionEndpointManagerForPerPartitionAutomaticFailover.java: Hedge target handling
in tryAddPartitionLevelLocationOverride with computeIfAbsent + getCurrent()
- PerPartitionAutomaticFailoverE2ETests.java: 26 new test cases
## Test coverage
| Op | DIRECT (mocked transport) | GATEWAY (mocked HttpClient) |
|---------|--------------------------|----------------------------|
| Create | 410/21005 + 503/21008 | delayed write region |
| Replace | 410/21005 | delayed write region |
| Upsert | 410/21005 | delayed write region |
| Delete | 410/21005 | delayed write region |
| Patch | 410/21005 | delayed write region |
Additional tests:
- Opt-out via COSMOS.IS_WRITE_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF=false
- Batch bypass verification (batch uses retry-based PPAF, not hedging)
- Explicit conchashmap verification: after hedge success, asserts the PPAF manager's
partitionKeyRangeToFailoverInfo entry points to a region != the failed write region
All assertions are exact match: 2 regions before failover, 1 region after failover.
165 tests total (existing + new), 0 regressions, 0 modifications to existing test logic.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
b07dde3 to
46125f4
Compare
…ible error codes Add 34 new test configurations to write availability strategy hedging tests covering all error codes from the base PPAF E2E test suite: DIRECT mode: - 503/21008 (SERVICE_UNAVAILABLE) for Replace, Upsert, Delete, Patch - 403/3 (FORBIDDEN_WRITEFORBIDDEN) for all 5 write ops - 408/UNKNOWN (REQUEST_TIMEOUT) for all 5 write ops GATEWAY mode: - 403/3 (FORBIDDEN_WRITEFORBIDDEN) for all 5 write ops - 408/UNKNOWN (REQUEST_TIMEOUT) for all 5 write ops - 408/GATEWAY_ENDPOINT_READ_TIMEOUT (network error) for all 5 write ops - 503/GATEWAY_ENDPOINT_UNAVAILABLE (network error) for all 5 write ops Parameterize gateway test method to accept error codes instead of hardcoding 503. Extend setupHttpClientToThrowCosmosException to support combined delay + network error mode for gateway-specific fault types. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
7 tasks
…abilityStrategyForPPAF
validateChangeFeedProcessing called stop() with subscribe() (fire-and-forget), then returned. The caller immediately starts a full fidelity CFP on the same lease container. With the CI optimization (PR Azure#48259) making validateChangeFeedProcessing return faster, the async stop hasn't released leases yet, causing the next start() to hang and timeout. Fix: change stop() from subscribe() to block() so leases are fully released before returning. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Rename isPpafWriteHedging to applyAvailabilityStrategyForWritesForPpaf for clarity (both in wrapPointOperationWithAvailabilityStrategy and getApplicableRegionsForSpeculation) - Scope HashMap allocation: only create when PPAF write availability strategy is applicable, use Collections.emptyMap() otherwise to avoid per-request allocation for reads - Update comments to use 'availability strategy' terminology Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Rename isPpafWriteHedging to applyAvailabilityStrategyForWritesForPpaf - Rename ppafWriteHedgeTargetRegion to writeRegionRoutingContextForPpafAvailabilityStrategy - Move field to top of CrossRegionAvailabilityContext class - Replace conchashmap with ConcurrentHashMap in comments - Scope HashMap allocation: Collections.emptyMap() for reads - Fix unsafe computeIfAbsent: do not create failover override until hedge succeeds - Verify read/write e2e policies produce identical values (confirmed) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
Investigation:
|
| Scenario | nonIdempotentWriteRetryPolicy=TRUE |
nonIdempotentWriteRetryPolicy=FALSE |
|---|---|---|
| Tracking ID generated | ✅ UUID per request | ❌ None |
| Hedging fires (PPAF) | ✅ | ✅ (bypasses idempotent check) |
| Primary + hedge both commit | 409 on duplicate → read doc → verify _trackingId → return 201 |
Both writes persist — no dedup |
| ClientRetryPolicy 503 retry | Cross-region retry allowed | PPAF override: always retries (line 601-603) |
| Cancellation safety | Safe — _trackingId catches late commits |
Unsafe — no dedup mechanism |
Code Path
Mono.firstWithValue(monoList)fires primary + hedged requests in parallel- First response wins; slower subscription is cancelled
- But cancellation is not guaranteed to prevent the write from committing — the request may already be in-flight at the backend
- Without tracking IDs, if both writes commit:
- Create: duplicate document (or 409 conflict with no way to detect it's "our" duplicate)
- Replace/Patch: last-write-wins, but client only sees the first response
- Upsert: duplicate or overwrite depending on timing
Recommendation
When applyAvailabilityStrategyForWritesForPpaf=true and isIdempotentWriteRetriesEnabled=false, the PPAF write availability strategy should ensure tracking IDs are generated for write operations. Options:
- In
evaluatePpafEnforcedE2eLatencyPolicyCfgForWrites: Return a policy config that also setsnonIdempotentWriteRetriesEnabled=true+useTrackingIds=trueon the request options - In
getApplicableRegionsForSpeculation: When the PPAF bypass fires (line 7827), also set tracking IDs on the request - In
wrapPointOperationWithAvailabilityStrategy: WhenapplyAvailabilityStrategyForWritesForPpaf=true, inject tracking ID generation into the callback
Option 3 is cleanest as it's scoped to the hedging path only.
…abilityStrategyForPPAF
…abilityStrategyForPPAF
When write availability strategy hedges a write to a read region, the PPAF ConcurrentHashMap entry must be created only when the hedged request succeeds — not eagerly during routing. Design: - tryAddPartitionLevelLocationOverride routes the hedged write and captures the resolved PartitionKeyRangeWrapper on the CrossRegionAvailabilityContext - doOnNext on the hedged Mono calls tryRecordSuccessfulWriteHedge only when the result is not an error, persisting the failover entry via computeIfAbsent - Failed hedges leave no map entry, preventing bad regions from being persisted Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Charts will be hosted as GitHub issue attachments instead of committed to sdk/cosmos/docs/ which gets bundled into JARs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves #43148
Description
Problem
During per-partition automatic failover (PPAF), the backend takes up to 60s to elect a new write region for an affected server partition. During this window the SDK retries writes round-robin across regions, which is slow and causes elevated latency for customers on single-writer accounts.
Solution
This PR adds write availability strategy (hedging) for PPAF-enabled single-writer accounts. When a write to the current write region is slow or fails with a PPAF-eligible error, the SDK hedges the write to a read region via the existing
ThresholdBasedAvailabilityStrategyinfrastructure.How it works
COSMOS.IS_WRITE_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF(default:true) controls the feature.networkRequestTimeout + 1s, threshold =min(networkRequestTimeout/2, 1s), step =min(threshold/2, 500ms).routeToLocation(RegionalRoutingContext), bypassing the excluded-regions mechanism which cannot route writes to read regions on single-writer accounts.ConcurrentHashMapentry is only created when the hedged write succeeds (viadoOnNextcallback), preventing bad regions from being persisted if the hedge fails.idempotentWriteRetriesenabled.Files changed (8 files, +1279 / -12)
ClientRetryPolicy.javawriteRegionRoutingContextForPpafAvailabilityStrategyConfigs.javaCOSMOS.IS_WRITE_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF(default:true)CrossRegionAvailabilityContextForRxDocumentServiceRequest.javavolatile RegionalRoutingContextfield for hedged write target region +PartitionKeyRangeWrapperfor success callbackRxDocumentClientImpl.javaevaluatePpafEnforcedE2eLatencyPolicyCfgForWrites,enableAvailabilityStrategyForWrites, region-to-routing-context map, hedged write setup withdoOnNextsuccess callback, relaxed write gatesGlobalPartitionEndpointManagerForPerPartitionAutomaticFailover.javatryAddPartitionLevelLocationOverrideroutes hedged writes;tryRecordSuccessfulWriteHedgepersists entry only on successCHANGELOG.mdPerPartitionAutomaticFailoverE2ETests.javaIncrementalChangeFeedProcessorTest.javaFailover Regression Test (DR Drill)
Date: 2026-03-21 12:42-13:28 UTC | Environment: Test14 | Regions: North Central US (Write), West US (Read), East Asia (Read) | Branch:
AzCosmos_WriteAvailabilityStrategyForPPAF@7523b1f7937| SDK:azure-cosmos 4.79.0-beta.1(custom JAR with write hedge fix)Accounts
ppaf-strong-0321ppaf-session-0321noppaf-strong-0321Workloads
Each account runs 2 workloads (Direct + Gateway mode), each with Create + Read + Query operations. User agent suffix identifies each workload in Kusto.
dr-ppaf-strong-0321-directdr-ppaf-strong-0321-gatewaydr-ppaf-session-0321-directdr-ppaf-session-0321-gatewaydr-noppaf-strong-0321-directdr-noppaf-strong-0321-gatewayTimeline
Region Distribution During Failover
Observation: During QL (13:05-13:15), PPAF accounts shift writes to West US. Session consistency account shifts cleanly; Strong consistency accounts experience elevated errors due to quorum requirements. Non-PPAF baseline scatters across all regions with no directed failover.
Success vs Errors
Backend success rates (BackendEndRequest5M):
ppaf-session-0321ppaf-strong-0321noppaf-strong-0321Error Breakdown During Quorum Loss Window
ppaf-session-0321ppaf-session-0321ppaf-session-0321ppaf-strong-0321ppaf-strong-0321ppaf-strong-0321noppaf-strong-0321noppaf-strong-0321noppaf-strong-0321noppaf-strong-0321Failover and Failback
Observation: PPAF accounts show clean failover (NCentUS drops, West US rises) and failback (NCentUS recovers post-QL). Non-PPAF baseline has no directed failover pattern.
PPAF vs No-PPAF Comparison
Verdict
Kusto queries (cluster: cosmosdbtest.kusto.windows.net, database: Test)
Region distribution (BackendEndRequest5M):
Error breakdown:
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines