Skip to content

Per-Partition Automatic Failover: Faster detection of per-partition write region through availability strategy for writes.#48421

Draft
jeet1995 wants to merge 12 commits intoAzure:mainfrom
jeet1995:AzCosmos_WriteAvailabilityStrategyForPPAF
Draft

Per-Partition Automatic Failover: Faster detection of per-partition write region through availability strategy for writes.#48421
jeet1995 wants to merge 12 commits intoAzure:mainfrom
jeet1995:AzCosmos_WriteAvailabilityStrategyForPPAF

Conversation

@jeet1995
Copy link
Member

@jeet1995 jeet1995 commented Mar 14, 2026

Resolves #43148

Description

Problem

During per-partition automatic failover (PPAF), the backend takes up to 60s to elect a new write region for an affected server partition. During this window the SDK retries writes round-robin across regions, which is slow and causes elevated latency for customers on single-writer accounts.

Solution

This PR adds write availability strategy (hedging) for PPAF-enabled single-writer accounts. When a write to the current write region is slow or fails with a PPAF-eligible error, the SDK hedges the write to a read region via the existing ThresholdBasedAvailabilityStrategy infrastructure.

How it works

  1. New config flagCOSMOS.IS_WRITE_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF (default: true) controls the feature.
  2. PPAF-enforced E2E timeout for writes — mirrors the existing read availability strategy policy. Timeout = networkRequestTimeout + 1s, threshold = min(networkRequestTimeout/2, 1s), step = min(threshold/2, 500ms).
  3. Region resolution — all account-level read regions (not just preferred regions) are used as hedge candidates, since PPAF failover can target any read region.
  4. Hedged write routing — hedged write requests are force-routed to a target read region via routeToLocation(RegionalRoutingContext), bypassing the excluded-regions mechanism which cannot route writes to read regions on single-writer accounts.
  5. Success-only failover entry creation — the PPAF ConcurrentHashMap entry is only created when the hedged write succeeds (via doOnNext callback), preventing bad regions from being persisted if the hedge fails.
  6. Relaxed idempotent-write gate — PPAF provides partition-level exactly-once semantics, so write hedging is allowed even without explicit idempotentWriteRetries enabled.

Files changed (8 files, +1279 / -12)

File Change
ClientRetryPolicy.java Comment update + PPAF write hedging routing via writeRegionRoutingContextForPpafAvailabilityStrategy
Configs.java New system property COSMOS.IS_WRITE_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF (default: true)
CrossRegionAvailabilityContextForRxDocumentServiceRequest.java New volatile RegionalRoutingContext field for hedged write target region + PartitionKeyRangeWrapper for success callback
RxDocumentClientImpl.java Core logic: evaluatePpafEnforcedE2eLatencyPolicyCfgForWrites, enableAvailabilityStrategyForWrites, region-to-routing-context map, hedged write setup with doOnNext success callback, relaxed write gates
GlobalPartitionEndpointManagerForPerPartitionAutomaticFailover.java tryAddPartitionLevelLocationOverride routes hedged writes; tryRecordSuccessfulWriteHedge persists entry only on success
CHANGELOG.md User-facing feature description
PerPartitionAutomaticFailoverE2ETests.java +1054 lines: comprehensive E2E tests
IncrementalChangeFeedProcessorTest.java Unrelated flaky test fix

Failover Regression Test (DR Drill)

Date: 2026-03-21 12:42-13:28 UTC | Environment: Test14 | Regions: North Central US (Write), West US (Read), East Asia (Read) | Branch: AzCosmos_WriteAvailabilityStrategyForPPAF @ 7523b1f7937 | SDK: azure-cosmos 4.79.0-beta.1 (custom JAR with write hedge fix)

Accounts

Account Consistency PPAF Purpose
ppaf-strong-0321 Strong ✅ Enabled PPAF + Strong consistency
ppaf-session-0321 Session ✅ Enabled PPAF + Session consistency
noppaf-strong-0321 Strong ❌ Disabled Baseline (no PPAF)

Workloads

Each account runs 2 workloads (Direct + Gateway mode), each with Create + Read + Query operations. User agent suffix identifies each workload in Kusto.

User Agent Account Mode Operations
dr-ppaf-strong-0321-direct ppaf-strong-0321 Direct Create, Read, Query
dr-ppaf-strong-0321-gateway ppaf-strong-0321 Gateway Create, Read, Query
dr-ppaf-session-0321-direct ppaf-session-0321 Direct Create, Read, Query
dr-ppaf-session-0321-gateway ppaf-session-0321 Gateway Create, Read, Query
dr-noppaf-strong-0321-direct noppaf-strong-0321 Direct Create, Read, Query
dr-noppaf-strong-0321-gateway noppaf-strong-0321 Gateway Create, Read, Query

Timeline

Time (UTC) Event
12:42:52 Drill start — 6 workloads launched
13:06:44 Quorum loss injected on all 15 partitions across 3 accounts (North Central US, 10 min duration)
~13:16:44 Quorum loss resolved
13:28:01 Drill end — all 6 workloads completed cleanly

Region Distribution During Failover

Region Distribution

Observation: During QL (13:05-13:15), PPAF accounts shift writes to West US. Session consistency account shifts cleanly; Strong consistency accounts experience elevated errors due to quorum requirements. Non-PPAF baseline scatters across all regions with no directed failover.


Success vs Errors

Success vs Errors

Backend success rates (BackendEndRequest5M):

Account Consistency PPAF Total Requests Successes Errors Success Rate
ppaf-session-0321 Session 12,185 12,019 166 98.64%
ppaf-strong-0321 Strong 37,369 18,853 18,516 50.45%
noppaf-strong-0321 Strong 128,777 22,287 106,490 17.31%

Key finding: PPAF + Session consistency achieves 98.64% success rate during quorum loss. PPAF + Strong has 50.45% (limited by quorum requirements for reads). Without PPAF, only 17.31% — a 5.7x improvement with PPAF enabled.


Error Breakdown During Quorum Loss Window

Error Breakdown

Account StatusCode/Sub Region Count Explanation
ppaf-session-0321 403/3 West US 6 Write forbidden during transition. Auto-retried.
ppaf-session-0321 403/3 East Asia 3 Brief write during transition. Auto-retried.
ppaf-session-0321 404/1002 NCentUS 7 ReadSessionNotAvailable. Auto-retried.
ppaf-strong-0321 410/1022 West US 10,656 PartitionKeyRangeGone — Strong requires quorum.
ppaf-strong-0321 503/1337 West US 787 Server busy during failover.
ppaf-strong-0321 403/3 West US 27 Write forbidden. Auto-retried.
noppaf-strong-0321 410/1022 West US 61,432 No directed failover — round-robin to all regions.
noppaf-strong-0321 410/1022 East Asia 36,292 Same — hits failing regions repeatedly.
noppaf-strong-0321 503/1337 West US 5,554 Server busy, no PPAF mitigation.
noppaf-strong-0321 503/1337 East Asia 2,853 Same.

Failover and Failback

Failover Failback

Observation: PPAF accounts show clean failover (NCentUS drops, West US rises) and failback (NCentUS recovers post-QL). Non-PPAF baseline has no directed failover pattern.


PPAF vs No-PPAF Comparison

PPAF vs NoPPAF


Verdict

Scenario PPAF + Session PPAF + Strong No PPAF (baseline) Result
Quorum Loss (write region) 98.64% success, clean failover 50.45% success (quorum-limited) 17.31% success PPAF PASS
Failback Clean return to NCentUS Clean return N/A PASS
Write hedging Writes shift to West US within 1 bucket Writes shift with elevated 410s No directed shift PPAF PASS

PPAF write availability strategy dramatically improves resilience for Session consistency accounts (98.64% vs 17.31% baseline). Strong consistency is inherently limited by quorum requirements during partition-level failures, but PPAF still provides 3x improvement (50.45% vs 17.31%).


Kusto queries (cluster: cosmosdbtest.kusto.windows.net, database: Test)

Region distribution (BackendEndRequest5M):

BackendEndRequest5M
| where TIMESTAMP between (datetime(2026-03-21 12:40) .. datetime(2026-03-21 13:30))
| where GlobalDatabaseAccountName in ('ppaf-strong-0321', 'ppaf-session-0321', 'noppaf-strong-0321')
| where ResourceType == 2
| summarize Requests=sum(SampleCount) by bin(TIMESTAMP, 5m), Region, GlobalDatabaseAccountName
| render timechart

Error breakdown:

BackendEndRequest5M
| where TIMESTAMP between (datetime(2026-03-21 13:05) .. datetime(2026-03-21 13:20))
| where GlobalDatabaseAccountName in ('ppaf-strong-0321', 'ppaf-session-0321', 'noppaf-strong-0321')
| where ResourceType == 2 and StatusCode >= 400
| summarize ErrorCount=sum(SampleCount) by GlobalDatabaseAccountName, StatusCode, SubStatusCode, Region
| order by GlobalDatabaseAccountName, ErrorCount desc

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

@jeet1995 jeet1995 force-pushed the AzCosmos_WriteAvailabilityStrategyForPPAF branch 2 times, most recently from 72cd8e0 to acbf49c Compare March 14, 2026 22:35
…riter accounts

Enable proactive write hedging for Per-Partition Automatic Failover (PPAF) on single-writer
Cosmos DB accounts. When a write to the primary region is slow or failing, the SDK now hedges
the write to a read region — reducing time-to-recovery from 60-120s (retry-based) to the
hedging threshold (~1s with default config).

## Problem

In PPAF-enabled single-writer accounts, when a partition fails over, the SDK waits for error
signals (503, 408, 410) which can take 60-120s before marking a region as failed for that
partition via the retry-based path in GlobalPartitionEndpointManagerForPerPartitionAutomaticFailover.

## Solution

Plug the existing availability strategy (hedging) machinery into the write path for PPAF:

1. **Speculation gating** (RxDocumentClientImpl.getApplicableRegionsForSpeculation):
   - Relax the canUseMultipleWriteLocations() gate for PPAF single-writer accounts
   - Relax the isIdempotentWriteRetriesEnabled gate (PPAF provides partition-level consistency)
   - Use ALL account-level read regions (getAvailableReadRoutingContexts) as hedge candidates,
     not just preferred regions — PPAF failover can target any read region

2. **Routing** (tryAddPartitionLevelLocationOverride + CrossRegionAvailabilityContext):
   - Add ppafWriteHedgeTargetRegion field to CrossRegionAvailabilityContextForRxDocumentServiceRequest
   - In tryAddPartitionLevelLocationOverride: when ppafWriteHedgeTargetRegion is set, create the
     conchashmap entry via computeIfAbsent and route via hedgeFailoverInfo.getCurrent()
   - This is synchronous and deterministic — conchashmap updated in the same request pipeline
   - Thread safety: uses getCurrent() from the computeIfAbsent result (not raw hedgeTarget)
     to avoid routing to a region the concurrent retry path may have marked as failed

3. **Default E2E policy** (evaluatePpafEnforcedE2eLatencyPolicyCfgForWrites):
   - Mirrors the read defaults exactly — symmetric hedging behavior for reads and writes
   - Only applied to point write operations (batch excluded via isPointOperation gate)
   - DIRECT: timeout=networkRequestTimeout+1s, threshold=min(timeout/2, 1s), step=500ms
   - GATEWAY: timeout=min(6s, httpTimeout), threshold=min(timeout/2, 1s), step=500ms

4. **Safety lever** (Configs.isWriteAvailabilityStrategyEnabledWithPpaf):
   - System property COSMOS.IS_WRITE_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF (default: true)
   - Allows opt-out without code changes if regression is observed

## Files changed (6)

- Configs.java: Write availability strategy PPAF config flag
- RxDocumentClientImpl.java: Speculation gating, region resolution, write E2E policy
- CrossRegionAvailabilityContextForRxDocumentServiceRequest.java: ppafWriteHedgeTargetRegion field
- ClientRetryPolicy.java: Honor ppafWriteHedgeTargetRegion in tryAddPartitionLevelLocationOverride
- GlobalPartitionEndpointManagerForPerPartitionAutomaticFailover.java: Hedge target handling
  in tryAddPartitionLevelLocationOverride with computeIfAbsent + getCurrent()
- PerPartitionAutomaticFailoverE2ETests.java: 26 new test cases

## Test coverage

| Op      | DIRECT (mocked transport) | GATEWAY (mocked HttpClient) |
|---------|--------------------------|----------------------------|
| Create  | 410/21005 + 503/21008    | delayed write region       |
| Replace | 410/21005                | delayed write region       |
| Upsert  | 410/21005                | delayed write region       |
| Delete  | 410/21005                | delayed write region       |
| Patch   | 410/21005                | delayed write region       |

Additional tests:
- Opt-out via COSMOS.IS_WRITE_AVAILABILITY_STRATEGY_ENABLED_WITH_PPAF=false
- Batch bypass verification (batch uses retry-based PPAF, not hedging)
- Explicit conchashmap verification: after hedge success, asserts the PPAF manager's
  partitionKeyRangeToFailoverInfo entry points to a region != the failed write region

All assertions are exact match: 2 regions before failover, 1 region after failover.
165 tests total (existing + new), 0 regressions, 0 modifications to existing test logic.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995 jeet1995 force-pushed the AzCosmos_WriteAvailabilityStrategyForPPAF branch from b07dde3 to 46125f4 Compare March 14, 2026 23:28
@jeet1995 jeet1995 changed the title Az cosmos write availability strategy for ppaf Per-Partition Automatic Failover: Faster detection of per-partition write region through availability strategy for writes. Mar 16, 2026
jeet1995 and others added 2 commits March 16, 2026 14:21
…ible error codes

Add 34 new test configurations to write availability strategy hedging
tests covering all error codes from the base PPAF E2E test suite:

DIRECT mode:
- 503/21008 (SERVICE_UNAVAILABLE) for Replace, Upsert, Delete, Patch
- 403/3 (FORBIDDEN_WRITEFORBIDDEN) for all 5 write ops
- 408/UNKNOWN (REQUEST_TIMEOUT) for all 5 write ops

GATEWAY mode:
- 403/3 (FORBIDDEN_WRITEFORBIDDEN) for all 5 write ops
- 408/UNKNOWN (REQUEST_TIMEOUT) for all 5 write ops
- 408/GATEWAY_ENDPOINT_READ_TIMEOUT (network error) for all 5 write ops
- 503/GATEWAY_ENDPOINT_UNAVAILABLE (network error) for all 5 write ops

Parameterize gateway test method to accept error codes instead of
hardcoding 503. Extend setupHttpClientToThrowCosmosException to support
combined delay + network error mode for gateway-specific fault types.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
jeet1995 and others added 4 commits March 18, 2026 19:47
validateChangeFeedProcessing called stop() with subscribe()
(fire-and-forget), then returned. The caller immediately starts
a full fidelity CFP on the same lease container. With the CI
optimization (PR Azure#48259) making validateChangeFeedProcessing
return faster, the async stop hasn't released leases yet, causing
the next start() to hang and timeout.

Fix: change stop() from subscribe() to block() so leases are
fully released before returning.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Rename isPpafWriteHedging to applyAvailabilityStrategyForWritesForPpaf
  for clarity (both in wrapPointOperationWithAvailabilityStrategy and
  getApplicableRegionsForSpeculation)
- Scope HashMap allocation: only create when PPAF write availability
  strategy is applicable, use Collections.emptyMap() otherwise to
  avoid per-request allocation for reads
- Update comments to use 'availability strategy' terminology

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Rename isPpafWriteHedging to applyAvailabilityStrategyForWritesForPpaf
- Rename ppafWriteHedgeTargetRegion to writeRegionRoutingContextForPpafAvailabilityStrategy
- Move field to top of CrossRegionAvailabilityContext class
- Replace conchashmap with ConcurrentHashMap in comments
- Scope HashMap allocation: Collections.emptyMap() for reads
- Fix unsafe computeIfAbsent: do not create failover override until hedge succeeds
- Verify read/write e2e policies produce identical values (confirmed)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Member Author

Investigation: nonIdempotentWriteRetryPolicy impact on PPAF write availability strategy

Finding

When PPAF write availability strategy is enabled, write hedging activates even when nonIdempotentWriteRetryPolicy is disabled (i.e., isIdempotentWriteRetriesEnabled=false). This is by design (lines 7818-7829 in getApplicableRegionsForSpeculation) — the comment states PPAF provides "exactly-once semantics for writes to failed-over partitions."

However, this creates a safety gap when both the primary and hedged writes succeed:

The Risk

Scenario nonIdempotentWriteRetryPolicy=TRUE nonIdempotentWriteRetryPolicy=FALSE
Tracking ID generated ✅ UUID per request ❌ None
Hedging fires (PPAF) ✅ (bypasses idempotent check)
Primary + hedge both commit 409 on duplicate → read doc → verify _trackingId → return 201 Both writes persist — no dedup
ClientRetryPolicy 503 retry Cross-region retry allowed PPAF override: always retries (line 601-603)
Cancellation safety Safe — _trackingId catches late commits Unsafe — no dedup mechanism

Code Path

  1. Mono.firstWithValue(monoList) fires primary + hedged requests in parallel
  2. First response wins; slower subscription is cancelled
  3. But cancellation is not guaranteed to prevent the write from committing — the request may already be in-flight at the backend
  4. Without tracking IDs, if both writes commit:
    • Create: duplicate document (or 409 conflict with no way to detect it's "our" duplicate)
    • Replace/Patch: last-write-wins, but client only sees the first response
    • Upsert: duplicate or overwrite depending on timing

Recommendation

When applyAvailabilityStrategyForWritesForPpaf=true and isIdempotentWriteRetriesEnabled=false, the PPAF write availability strategy should ensure tracking IDs are generated for write operations. Options:

  1. In evaluatePpafEnforcedE2eLatencyPolicyCfgForWrites: Return a policy config that also sets nonIdempotentWriteRetriesEnabled=true + useTrackingIds=true on the request options
  2. In getApplicableRegionsForSpeculation: When the PPAF bypass fires (line 7827), also set tracking IDs on the request
  3. In wrapPointOperationWithAvailabilityStrategy: When applyAvailabilityStrategyForWritesForPpaf=true, inject tracking ID generation into the callback

Option 3 is cleanest as it's scoped to the hedging path only.

jeet1995 and others added 3 commits March 20, 2026 19:38
When write availability strategy hedges a write to a read region, the
PPAF ConcurrentHashMap entry must be created only when the hedged request
succeeds — not eagerly during routing.

Design:
- tryAddPartitionLevelLocationOverride routes the hedged write and captures
  the resolved PartitionKeyRangeWrapper on the CrossRegionAvailabilityContext
- doOnNext on the hedged Mono calls tryRecordSuccessfulWriteHedge only when
  the result is not an error, persisting the failover entry via computeIfAbsent
- Failed hedges leave no map entry, preventing bad regions from being persisted

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 and others added 2 commits March 21, 2026 09:36
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Charts will be hosted as GitHub issue attachments instead of
committed to sdk/cosmos/docs/ which gets bundled into JARs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE REQ]: Parallel write region discovery during per-partition automatic failover.

1 participant