Per-Partition Automatic Failover: Faster detection of per-partition write region through availability strategy for writes. by jeet1995 · Pull Request #48421 · Azure/azure-sdk-for-java

jeet1995 · 2026-03-14T02:19:59Z

Resolves #43148

Description

Problem

During per-partition automatic failover (PPAF), the backend takes up to 60s to elect a new write region for an affected server partition. During this window the SDK retries writes round-robin across regions, which is slow and causes elevated latency for customers on single-writer accounts.

Solution

This PR adds write availability strategy (hedging) for PPAF-enabled single-writer accounts. When a write to the current write region is slow or fails with a PPAF-eligible error, the SDK hedges the write to a read region via the existing ThresholdBasedAvailabilityStrategy infrastructure.

How it works

New config flag — COSMOS.IS_WRITE_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF (default: true) controls the feature.
PPAF-enforced E2E timeout for writes — mirrors the existing read availability strategy policy. Timeout = networkRequestTimeout + 1s, threshold = min(networkRequestTimeout/2, 1s), step = min(threshold/2, 500ms).
Region resolution — all account-level read regions (not just preferred regions) are used as hedge candidates, since PPAF failover can target any read region.
Hedged write routing — hedged write requests are force-routed to a target read region via routeToLocation(RegionalRoutingContext), bypassing the excluded-regions mechanism which cannot route writes to read regions on single-writer accounts.
Success-only failover entry creation — the PPAF ConcurrentHashMap entry is only created when the hedged write succeeds (via doOnNext callback), preventing bad regions from being persisted if the hedge fails.
Relaxed idempotent-write gate — PPAF provides partition-level exactly-once semantics, so write hedging is allowed even without explicit idempotentWriteRetries enabled.

Files changed (8 files, +1279 / -12)

File	Change
`ClientRetryPolicy.java`	Comment update + PPAF write hedging routing via `writeRegionRoutingContextForPpafAvailabilityStrategy`
`Configs.java`	New system property `COSMOS.IS_WRITE_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF` (default: `true`)
`CrossRegionAvailabilityContextForRxDocumentServiceRequest.java`	New `volatile RegionalRoutingContext` field for hedged write target region + `PartitionKeyRangeWrapper` for success callback
`RxDocumentClientImpl.java`	Core logic: `evaluatePpafEnforcedE2eLatencyPolicyCfgForWrites`, `enableAvailabilityStrategyForWrites`, region-to-routing-context map, hedged write setup with `doOnNext` success callback, relaxed write gates
`GlobalPartitionEndpointManagerForPerPartitionAutomaticFailover.java`	`tryAddPartitionLevelLocationOverride` routes hedged writes; `tryRecordSuccessfulWriteHedge` persists entry only on success
`CHANGELOG.md`	User-facing feature description
`PerPartitionAutomaticFailoverE2ETests.java`	+1054 lines: comprehensive E2E tests
`IncrementalChangeFeedProcessorTest.java`	Unrelated flaky test fix

Failover Regression Test (DR Drill)

Date: 2026-03-21 12:42-13:28 UTC | Environment: Test14 | Regions: North Central US (Write), West US (Read), East Asia (Read) | Branch: AzCosmos_WriteAvailabilityStrategyForPPAF @ 7523b1f7937 | SDK: azure-cosmos 4.79.0-beta.1 (custom JAR with write hedge fix)

Accounts

Account	Consistency	PPAF	Purpose
`ppaf-strong-0321`	Strong	✅ Enabled	PPAF + Strong consistency
`ppaf-session-0321`	Session	✅ Enabled	PPAF + Session consistency
`noppaf-strong-0321`	Strong	❌ Disabled	Baseline (no PPAF)

Workloads

Each account runs 2 workloads (Direct + Gateway mode), each with Create + Read + Query operations. User agent suffix identifies each workload in Kusto.

User Agent	Account	Mode	Operations
`dr-ppaf-strong-0321-direct`	ppaf-strong-0321	Direct	Create, Read, Query
`dr-ppaf-strong-0321-gateway`	ppaf-strong-0321	Gateway	Create, Read, Query
`dr-ppaf-session-0321-direct`	ppaf-session-0321	Direct	Create, Read, Query
`dr-ppaf-session-0321-gateway`	ppaf-session-0321	Gateway	Create, Read, Query
`dr-noppaf-strong-0321-direct`	noppaf-strong-0321	Direct	Create, Read, Query
`dr-noppaf-strong-0321-gateway`	noppaf-strong-0321	Gateway	Create, Read, Query

Timeline

Time (UTC)	Event
12:42:52	Drill start — 6 workloads launched
13:06:44	Quorum loss injected on all 15 partitions across 3 accounts (North Central US, 10 min duration)
~13:16:44	Quorum loss resolved
13:28:01	Drill end — all 6 workloads completed cleanly

Region Distribution During Failover

Observation: During QL (13:05-13:15), PPAF accounts shift writes to West US. Session consistency account shifts cleanly; Strong consistency accounts experience elevated errors due to quorum requirements. Non-PPAF baseline scatters across all regions with no directed failover.

Success vs Errors

Backend success rates (BackendEndRequest5M):

Account	Consistency	PPAF	Total Requests	Successes	Errors	Success Rate
`ppaf-session-0321`	Session	✅	12,185	12,019	166	98.64%
`ppaf-strong-0321`	Strong	✅	37,369	18,853	18,516	50.45%
`noppaf-strong-0321`	Strong	❌	128,777	22,287	106,490	17.31%

Key finding: PPAF + Session consistency achieves 98.64% success rate during quorum loss. PPAF + Strong has 50.45% (limited by quorum requirements for reads). Without PPAF, only 17.31% — a 5.7x improvement with PPAF enabled.

Error Breakdown During Quorum Loss Window

Account	StatusCode/Sub	Region	Count	Explanation
`ppaf-session-0321`	403/3	West US	6	Write forbidden during transition. Auto-retried.
`ppaf-session-0321`	403/3	East Asia	3	Brief write during transition. Auto-retried.
`ppaf-session-0321`	404/1002	NCentUS	7	ReadSessionNotAvailable. Auto-retried.
`ppaf-strong-0321`	410/1022	West US	10,656	PartitionKeyRangeGone — Strong requires quorum.
`ppaf-strong-0321`	503/1337	West US	787	Server busy during failover.
`ppaf-strong-0321`	403/3	West US	27	Write forbidden. Auto-retried.
`noppaf-strong-0321`	410/1022	West US	61,432	No directed failover — round-robin to all regions.
`noppaf-strong-0321`	410/1022	East Asia	36,292	Same — hits failing regions repeatedly.
`noppaf-strong-0321`	503/1337	West US	5,554	Server busy, no PPAF mitigation.
`noppaf-strong-0321`	503/1337	East Asia	2,853	Same.

Failover and Failback

Observation: PPAF accounts show clean failover (NCentUS drops, West US rises) and failback (NCentUS recovers post-QL). Non-PPAF baseline has no directed failover pattern.

PPAF vs No-PPAF Comparison

Verdict

Scenario	PPAF + Session	PPAF + Strong	No PPAF (baseline)	Result
Quorum Loss (write region)	98.64% success, clean failover	50.45% success (quorum-limited)	17.31% success	PPAF PASS
Failback	Clean return to NCentUS	Clean return	N/A	PASS
Write hedging	Writes shift to West US within 1 bucket	Writes shift with elevated 410s	No directed shift	PPAF PASS

PPAF write availability strategy dramatically improves resilience for Session consistency accounts (98.64% vs 17.31% baseline). Strong consistency is inherently limited by quorum requirements during partition-level failures, but PPAF still provides 3x improvement (50.45% vs 17.31%).

Kusto queries (cluster: cosmosdbtest.kusto.windows.net, database: Test)

Region distribution (BackendEndRequest5M):

BackendEndRequest5M
| where TIMESTAMP between (datetime(2026-03-21 12:40) .. datetime(2026-03-21 13:30))
| where GlobalDatabaseAccountName in ('ppaf-strong-0321', 'ppaf-session-0321', 'noppaf-strong-0321')
| where ResourceType == 2
| summarize Requests=sum(SampleCount) by bin(TIMESTAMP, 5m), Region, GlobalDatabaseAccountName
| render timechart

Error breakdown:

BackendEndRequest5M
| where TIMESTAMP between (datetime(2026-03-21 13:05) .. datetime(2026-03-21 13:20))
| where GlobalDatabaseAccountName in ('ppaf-strong-0321', 'ppaf-session-0321', 'noppaf-strong-0321')
| where ResourceType == 2 and StatusCode >= 400
| summarize ErrorCount=sum(SampleCount) by GlobalDatabaseAccountName, StatusCode, SubStatusCode, Region
| order by GlobalDatabaseAccountName, ErrorCount desc

All SDK Contribution checklist:

The pull request does not introduce [breaking changes]
CHANGELOG is updated for new features, bug fixes or other significant changes.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message.

Testing Guidelines

Pull request includes test coverage for the included changes.

…riter accounts Enable proactive write hedging for Per-Partition Automatic Failover (PPAF) on single-writer Cosmos DB accounts. When a write to the primary region is slow or failing, the SDK now hedges the write to a read region — reducing time-to-recovery from 60-120s (retry-based) to the hedging threshold (~1s with default config). ## Problem In PPAF-enabled single-writer accounts, when a partition fails over, the SDK waits for error signals (503, 408, 410) which can take 60-120s before marking a region as failed for that partition via the retry-based path in GlobalPartitionEndpointManagerForPerPartitionAutomaticFailover. ## Solution Plug the existing availability strategy (hedging) machinery into the write path for PPAF: 1. **Speculation gating** (RxDocumentClientImpl.getApplicableRegionsForSpeculation): - Relax the canUseMultipleWriteLocations() gate for PPAF single-writer accounts - Relax the isIdempotentWriteRetriesEnabled gate (PPAF provides partition-level consistency) - Use ALL account-level read regions (getAvailableReadRoutingContexts) as hedge candidates, not just preferred regions — PPAF failover can target any read region 2. **Routing** (tryAddPartitionLevelLocationOverride + CrossRegionAvailabilityContext): - Add ppafWriteHedgeTargetRegion field to CrossRegionAvailabilityContextForRxDocumentServiceRequest - In tryAddPartitionLevelLocationOverride: when ppafWriteHedgeTargetRegion is set, create the conchashmap entry via computeIfAbsent and route via hedgeFailoverInfo.getCurrent() - This is synchronous and deterministic — conchashmap updated in the same request pipeline - Thread safety: uses getCurrent() from the computeIfAbsent result (not raw hedgeTarget) to avoid routing to a region the concurrent retry path may have marked as failed 3. **Default E2E policy** (evaluatePpafEnforcedE2eLatencyPolicyCfgForWrites): - Mirrors the read defaults exactly — symmetric hedging behavior for reads and writes - Only applied to point write operations (batch excluded via isPointOperation gate) - DIRECT: timeout=networkRequestTimeout+1s, threshold=min(timeout/2, 1s), step=500ms - GATEWAY: timeout=min(6s, httpTimeout), threshold=min(timeout/2, 1s), step=500ms 4. **Safety lever** (Configs.isWriteAvailabilityStrategyEnabledWithPpaf): - System property COSMOS.IS_WRITE_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF (default: true) - Allows opt-out without code changes if regression is observed ## Files changed (6) - Configs.java: Write availability strategy PPAF config flag - RxDocumentClientImpl.java: Speculation gating, region resolution, write E2E policy - CrossRegionAvailabilityContextForRxDocumentServiceRequest.java: ppafWriteHedgeTargetRegion field - ClientRetryPolicy.java: Honor ppafWriteHedgeTargetRegion in tryAddPartitionLevelLocationOverride - GlobalPartitionEndpointManagerForPerPartitionAutomaticFailover.java: Hedge target handling in tryAddPartitionLevelLocationOverride with computeIfAbsent + getCurrent() - PerPartitionAutomaticFailoverE2ETests.java: 26 new test cases ## Test coverage | Op | DIRECT (mocked transport) | GATEWAY (mocked HttpClient) | |---------|--------------------------|----------------------------| | Create | 410/21005 + 503/21008 | delayed write region | | Replace | 410/21005 | delayed write region | | Upsert | 410/21005 | delayed write region | | Delete | 410/21005 | delayed write region | | Patch | 410/21005 | delayed write region | Additional tests: - Opt-out via COSMOS.IS_WRITE_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF=false - Batch bypass verification (batch uses retry-based PPAF, not hedging) - Explicit conchashmap verification: after hedge success, asserts the PPAF manager's partitionKeyRangeToFailoverInfo entry points to a region != the failed write region All assertions are exact match: 2 regions before failover, 1 region after failover. 165 tests total (existing + new), 0 regressions, 0 modifications to existing test logic. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ible error codes Add 34 new test configurations to write availability strategy hedging tests covering all error codes from the base PPAF E2E test suite: DIRECT mode: - 503/21008 (SERVICE_UNAVAILABLE) for Replace, Upsert, Delete, Patch - 403/3 (FORBIDDEN_WRITEFORBIDDEN) for all 5 write ops - 408/UNKNOWN (REQUEST_TIMEOUT) for all 5 write ops GATEWAY mode: - 403/3 (FORBIDDEN_WRITEFORBIDDEN) for all 5 write ops - 408/UNKNOWN (REQUEST_TIMEOUT) for all 5 write ops - 408/GATEWAY_ENDPOINT_READ_TIMEOUT (network error) for all 5 write ops - 503/GATEWAY_ENDPOINT_UNAVAILABLE (network error) for all 5 write ops Parameterize gateway test method to accept error codes instead of hardcoding 503. Extend setupHttpClientToThrowCosmosException to support combined delay + network error mode for gateway-specific fault types. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…abilityStrategyForPPAF

validateChangeFeedProcessing called stop() with subscribe() (fire-and-forget), then returned. The caller immediately starts a full fidelity CFP on the same lease container. With the CI optimization (PR Azure#48259) making validateChangeFeedProcessing return faster, the async stop hasn't released leases yet, causing the next start() to hang and timeout. Fix: change stop() from subscribe() to block() so leases are fully released before returning. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Rename isPpafWriteHedging to applyAvailabilityStrategyForWritesForPpaf for clarity (both in wrapPointOperationWithAvailabilityStrategy and getApplicableRegionsForSpeculation) - Scope HashMap allocation: only create when PPAF write availability strategy is applicable, use Collections.emptyMap() otherwise to avoid per-request allocation for reads - Update comments to use 'availability strategy' terminology Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Rename isPpafWriteHedging to applyAvailabilityStrategyForWritesForPpaf - Rename ppafWriteHedgeTargetRegion to writeRegionRoutingContextForPpafAvailabilityStrategy - Move field to top of CrossRegionAvailabilityContext class - Replace conchashmap with ConcurrentHashMap in comments - Scope HashMap allocation: Collections.emptyMap() for reads - Fix unsafe computeIfAbsent: do not create failover override until hedge succeeds - Verify read/write e2e policies produce identical values (confirmed) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jeet1995 · 2026-03-19T15:49:57Z

Investigation: `nonIdempotentWriteRetryPolicy` impact on PPAF write availability strategy

Finding

When PPAF write availability strategy is enabled, write hedging activates even when nonIdempotentWriteRetryPolicy is disabled (i.e., isIdempotentWriteRetriesEnabled=false). This is by design (lines 7818-7829 in getApplicableRegionsForSpeculation) — the comment states PPAF provides "exactly-once semantics for writes to failed-over partitions."

However, this creates a safety gap when both the primary and hedged writes succeed:

The Risk

Scenario	`nonIdempotentWriteRetryPolicy=TRUE`	`nonIdempotentWriteRetryPolicy=FALSE`
Tracking ID generated	✅ UUID per request	❌ None
Hedging fires (PPAF)	✅	✅ (bypasses idempotent check)
Primary + hedge both commit	409 on duplicate → read doc → verify `_trackingId` → return 201	Both writes persist — no dedup
ClientRetryPolicy 503 retry	Cross-region retry allowed	PPAF override: always retries (line 601-603)
Cancellation safety	Safe — `_trackingId` catches late commits	Unsafe — no dedup mechanism

Code Path

Mono.firstWithValue(monoList) fires primary + hedged requests in parallel
First response wins; slower subscription is cancelled
But cancellation is not guaranteed to prevent the write from committing — the request may already be in-flight at the backend
Without tracking IDs, if both writes commit:
- Create: duplicate document (or 409 conflict with no way to detect it's "our" duplicate)
- Replace/Patch: last-write-wins, but client only sees the first response
- Upsert: duplicate or overwrite depending on timing

Recommendation

When applyAvailabilityStrategyForWritesForPpaf=true and isIdempotentWriteRetriesEnabled=false, the PPAF write availability strategy should ensure tracking IDs are generated for write operations. Options:

In evaluatePpafEnforcedE2eLatencyPolicyCfgForWrites: Return a policy config that also sets nonIdempotentWriteRetriesEnabled=true + useTrackingIds=true on the request options
In getApplicableRegionsForSpeculation: When the PPAF bypass fires (line 7827), also set tracking IDs on the request
In wrapPointOperationWithAvailabilityStrategy: When applyAvailabilityStrategyForWritesForPpaf=true, inject tracking ID generation into the callback

Option 3 is cleanest as it's scoped to the hedging path only.

…abilityStrategyForPPAF

When write availability strategy hedges a write to a read region, the PPAF ConcurrentHashMap entry must be created only when the hedged request succeeds — not eagerly during routing. Design: - tryAddPartitionLevelLocationOverride routes the hedged write and captures the resolved PartitionKeyRangeWrapper on the CrossRegionAvailabilityContext - doOnNext on the hedged Mono calls tryRecordSuccessfulWriteHedge only when the result is not an error, persisting the failover entry via computeIfAbsent - Failed hedges leave no map entry, preventing bad regions from being persisted Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jeet1995 · 2026-03-21T04:09:49Z

/azp run java - cosmos - tests

azure-pipelines · 2026-03-21T04:10:20Z

Azure Pipelines successfully started running 1 pipeline(s).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Charts will be hosted as GitHub issue attachments instead of committed to sdk/cosmos/docs/ which gets bundled into JARs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions bot added the Cosmos label Mar 14, 2026

jeet1995 force-pushed the AzCosmos_WriteAvailabilityStrategyForPPAF branch 2 times, most recently from 72cd8e0 to acbf49c Compare March 14, 2026 22:35

jeet1995 force-pushed the AzCosmos_WriteAvailabilityStrategyForPPAF branch from b07dde3 to 46125f4 Compare March 14, 2026 23:28

jeet1995 changed the title ~~Az cosmos write availability strategy for ppaf~~ Per-Partition Automatic Failover: Faster detection of per-partition write region through availability strategy for writes. Mar 16, 2026

jeet1995 and others added 2 commits March 16, 2026 14:21

docs(cosmos): add PPAF write availability strategy to CHANGELOG

629c7b2

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jeet1995 mentioned this pull request Mar 17, 2026

E2E Test Infrastructure: JSON Config, Change Feed Workload, and Custom JAR Support jeet1995/ppaf-dr-drill#4

Open

7 tasks

jeet1995 and others added 4 commits March 18, 2026 19:47

Merge remote-tracking branch 'upstream/main' into AzCosmos_WriteAvail…

c5b74f6

…abilityStrategyForPPAF

jeet1995 and others added 3 commits March 20, 2026 19:38

Merge remote-tracking branch 'upstream/main' into AzCosmos_WriteAvail…

628e9c6

…abilityStrategyForPPAF

Merge remote-tracking branch 'upstream/main' into AzCosmos_WriteAvail…

621f203

…abilityStrategyForPPAF

jeet1995 and others added 2 commits March 21, 2026 09:36

Add PPAF write availability strategy DR drill charts (2026-03-21)

7523b1f

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove DR drill charts from repo to avoid JAR bloating

311edfb

Charts will be hosted as GitHub issue attachments instead of committed to sdk/cosmos/docs/ which gets bundled into JARs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-Partition Automatic Failover: Faster detection of per-partition write region through availability strategy for writes.#48421

Per-Partition Automatic Failover: Faster detection of per-partition write region through availability strategy for writes.#48421
jeet1995 wants to merge 12 commits intoAzure:mainfrom
jeet1995:AzCosmos_WriteAvailabilityStrategyForPPAF

jeet1995 commented Mar 14, 2026 •

edited

Loading

Uh oh!

jeet1995 commented Mar 19, 2026

Uh oh!

jeet1995 commented Mar 21, 2026

Uh oh!

azure-pipelines bot commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeet1995 commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

How it works

Files changed (8 files, +1279 / -12)

Failover Regression Test (DR Drill)

Accounts

Workloads

Timeline

Region Distribution During Failover

Success vs Errors

Error Breakdown During Quorum Loss Window

Failover and Failback

PPAF vs No-PPAF Comparison

Verdict

All SDK Contribution checklist:

General Guidelines and Best Practices

Testing Guidelines

Uh oh!

jeet1995 commented Mar 19, 2026

Investigation: nonIdempotentWriteRetryPolicy impact on PPAF write availability strategy

Finding

The Risk

Code Path

Recommendation

Uh oh!

jeet1995 commented Mar 21, 2026

Uh oh!

azure-pipelines bot commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jeet1995 commented Mar 14, 2026 •

edited

Loading

Investigation: `nonIdempotentWriteRetryPolicy` impact on PPAF write availability strategy