CBG-5220: Freezable cluster compat version by bbrks · Pull Request #8236 · couchbase/sync_gateway

bbrks · 2026-05-06T14:30:34Z

Add ability to freeze current cluster compatibility version to "pin" a cluster to a given version and avoid rolling CCV forwards.
Allows for supported downgrades/rollbacks even if all nodes in the cluster have been upgraded but are pinned behind the frozen version.

REST API changes

GET /_cluster_compat_version returns the cluster-wide version, per-node versions, and the frozen value if set.
POST /_cluster_compat_version/freeze pins the version to the current value to preserve rollback capability across upgrades.
POST /_cluster_compat_version/unfreeze clears the frozen version.

Implementation details

The trickier half of the change is making the existing ClusterCompatVersionHWM ratchet behave correctly in the presence of a freeze. Because HWM is monotonic and gates downgrades, advancing it off transient state would permanently
lock the cluster above the frozen version.

This implementation:

defers the HWM ratchet from synchronous startup registration until after StartOnlineProcesses
gates each per-bucket ratchet on at least one database in that bucket having reached DBOnline (computed once per Refresh to avoid re-acquiring _databasesLock)
clamps HWM to the freeze version inside RegisterNodeVersion so the single CAS-checked write is the only place HWM can move.

Both Freeze and Unfreeze are "success-on-all" (at the bucket level - since there can be multiple registries for a given cluster) - a 503 is returned in case some bucket ops fail (whilst some succeed) and it's expected that the operation is retried until successful.

Diagrams

Freeze and unfreeze admin flow

Both admin endpoints fan out across every tracked bucket, persist a CAS-checked mutation into each bucket's _sync:registry document.

  sequenceDiagram
      participant Admin
      participant Handler as handler_cluster_compat
      participant Mgr as clusterCompatManager
      participant Boot as bootstrapContext
      participant Bkts as Tracked buckets (registry docs)

      Admin->>Handler: POST /freeze
      Handler->>Mgr: Freeze
      Mgr->>Mgr: RLock and snapshot cachedVersion
      loop each tracked bucket
          Mgr->>Boot: SetRegistryFreeze(bucket, version)
          Boot->>Bkts: CAS write Frozen{version, FrozenAt}
          Boot-->>Mgr: freeze record
      end
      Mgr->>Mgr: mergeFreeze aggregate, update cachedFreeze and cachedVersion
      Mgr-->>Handler: aggregate
      Handler-->>Admin: 200 ClusterCompatVersionState

      Admin->>Handler: POST /unfreeze
      Handler->>Mgr: Unfreeze
      Mgr->>Mgr: snapshot previousFreeze
      loop each tracked bucket
          Mgr->>Boot: ClearRegistryFreeze(bucket)
          Boot->>Bkts: CAS write removing Frozen
      end
      Mgr->>Mgr: cachedFreeze = nil, recompute cachedVersion
      Mgr-->>Handler: previousFreeze
      Handler-->>Admin: 200

HWM ratchet under a freeze

ClusterCompatVersionHWM is the monotonic high-water mark each bucket has ever observed its live-node minimum reach, and it's the field the downgrade guardrail uses. Once HWM is advanced, the gate permanently refuses any node whose version is below it.
The freeze acts as a per-bucket cap on HWM incrementing, which is why the freeze operation has to be success-on-all. A bucket that didn't get the freeze record skips the clamp and keeps ratcheting HWM forward on the next Refresh tick, and HWM can't be pulled back.

  flowchart TD
      A[Each Refresh tick, per bucket] --> B{Bucket has a freeze?}
      B -- yes --> C[HWM capped at freeze version]
      B -- no --> D[HWM ratchets up to live-node minimum]
      C --> E[Per-bucket registry write]
      D --> E
      E -.later.-> F{Incoming node version below HWM?}
      F -- yes --> G[Rejected — rollback no longer possible]
      F -- no --> H[Accepted]

Dependencies (if applicable)

CBG-5266: Prevent downgrades across cluster compatibility versions #8235

Integration Tests

https://jenkins.sgwdev.com/job/SyncGatewayIntegration/646/

github-actions · 2026-05-11T17:27:20Z

Redocly previews

Copilot

Pull request overview

This PR introduces an admin-controlled “freeze” mechanism for the Sync Gateway cluster compatibility version (CCV), intended to keep CCV from advancing during upgrades so rollbacks/downgrades remain possible.

Changes:

Adds new admin endpoints to read CCV state and to freeze/unfreeze it, including audit events.
Persists a freeze record into each bucket’s registry document and updates CCV computation to account for the frozen value.
Extends OpenAPI documentation and adds unit + audit REST tests for the new behavior.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
rest/routing.go	Registers new admin routes for cluster compat version read/freeze/unfreeze.
rest/handler_cluster_compat.go	Implements REST handlers and response payload for CCV state + freeze/unfreeze actions.
rest/config_registry.go	Extends the persisted registry document schema with a freeze record pointer.
rest/config_manager.go	Adds CAS-retry helpers to set/clear the freeze record in a bucket registry.
rest/cluster_compat.go	Adds freeze-aware CCV computation and manager Freeze/Unfreeze operations.
rest/cluster_compat_test.go	Adds unit tests for freeze/unfreeze and pinning behavior.
rest/cluster_compat_audit_test.go	Adds end-to-end REST + audit emission coverage for the new endpoints.
docs/api/paths/admin/_cluster_compat_version.yaml	Documents the new GET endpoint.
docs/api/paths/admin/_cluster_compat_version-freeze.yaml	Documents the new freeze endpoint.
docs/api/paths/admin/_cluster_compat_version-unfreeze.yaml	Documents the new unfreeze endpoint.
docs/api/components/schemas.yaml	Adds new OpenAPI schemas and updates GatewayRegistry schema with freeze record.
docs/api/admin.yaml	Wires new paths into the admin OpenAPI spec.
base/version_cluster_compat.go	Adds the `RegistryFreeze` type used to persist the freeze record.
base/audit_events.go	Adds new audit IDs/events for CCV read/freeze/unfreeze.

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

Documents three new admin endpoints under /_cluster_compat_version: GET returns the cluster-wide version, per-node versions, and the frozen value if set; POST /freeze pins the version to the current value to preserve rollback capability across upgrades; POST /unfreeze clears the freeze. Adds ClusterCompatVersionState response schema and RegistryFreeze record on GatewayRegistry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds an admin-controlled freeze for the cluster compatibility version, allowing an operator to pin the reported version to its current value across rolling upgrades and preserve the option to roll back a node. Storage: GatewayRegistry gains a Frozen *RegistryFreeze field stored per bucket; the cluster-wide freeze is the aggregate (any bucket frozen means the cluster is held back). New CAS-safe SetRegistryFreeze and ClearRegistryFreeze methods on bootstrapContext mirror the existing node-registration helpers. Manager: clusterCompatManager tracks the auto-computed live-node minimum and the aggregate freeze separately. ClusterCompatVersion() reports the frozen value when set, otherwise the auto minimum. Refresh and RegisterBucket pick up the freeze record from each tracked registry. Freeze fans out to all tracked buckets and is success-on-any (safe direction); Unfreeze is success-on-all and returns the residual freeze on partial failure. REST: three new admin endpoints under /_cluster_compat_version (GET, POST /freeze, POST /unfreeze), DevOps-permission gated. Unfreeze returns the current state in a 503 body when partially applied. Three new audit events cover the read and state-changing operations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Align GatewayRegistry's Frozen JSON tag with the OpenAPI spec (frozen_cluster_compat_version). - Audit unfreeze attempts unconditionally so partial failures still produce an audit trail. - Verify each REST endpoint emits its audit event in the round-trip test by wiring it through the EE audit-logging test harness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

While a freeze is in effect, the freeze version is a ceiling on ClusterCompatVersionHWM advancement. Without this, all nodes upgrading past the frozen version would ratchet HWM forward, and the downgrade gate would then block rolling any node back to the frozen value — defeating the freeze's whole purpose. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirror Unfreeze's contract: Freeze now requires every tracked bucket to accept the freeze. If one or more buckets fail to accept it, the new ErrFreezePartial is returned alongside whatever aggregate freeze did take effect, and the REST handler responds 503 with the current ClusterCompatVersionState body so the admin can see what is pinned. ErrFreezeNoBucketsWritten is retained only for the "no tracked buckets at all" case; the previously-conflated "every bucket write failed" case is now ErrFreezePartial. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Declare MandatoryFields/OptionalFields on the freeze/unfreeze audit events so audit field validation covers the new fields. - Tighten the GET /_cluster_compat_version 503 description to reflect the actual condition (CCV tracking not enabled on this node). - Document the unfreeze 500 response and broaden the 503 body schema to oneOf (state | HTTP-Error). - Fix the GatewayRegistry.frozen_cluster_compat_version description to reference the schema property path (frozen_cluster_compat_version.version) rather than the historical frozen.version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Hold m.mu.RLock across Freeze's bucket write loop so the snapshotted cluster compat version cannot shift relative to a concurrent Refresh. Change Unfreeze to return the cleared freeze in addition to any residual so the unfreeze audit no longer relies on a separate cached peek that could race with Refresh. Document the cross-bucket drift on retry and the Refresh write-back race window. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s unknown Unfreeze previously returned ErrUnfreezePartial in two cases that the REST handler rendered identically: a verified residual freeze on re-read, and a total failure where the post-clear re-read also failed. In the second case the cache was wiped to nil and the 503 body showed no frozen_cluster_compat_version, leaving admins unable to distinguish "fully cleared" from "we have no idea". Unfreeze now preserves the pre-op cache when residual state can't be verified, and the handler branches on residual: residual != nil keeps the state-body 503, residual == nil returns an HTTP-Error naming the previously-frozen version so the admin has a recovery target. Both body shapes are already covered by the 503 oneOf in the OpenAPI spec. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Splits RegisterNodeVersion via a new ratchetHWM bool so the first registration (RegisterBucket from _applyConfig) only refreshes the node heartbeat — ClusterCompatVersionHWM advancement is held until the database has stabilized. RatchetClusterCompatHWMForBucket runs at end of StartOnlineProcesses (sync + async paths); periodic Refresh still ratchets. HWM is monotonic, so committing an advance off transient startup state would lock the cluster at a too-high value forever — gating prevents that. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… hook The previous post-StartOnlineProcesses ratchet wrote the registry from inside ReloadDatabaseWithConfig — which itself runs inside UpdateConfig's callback — bumping the registry CAS between UpdateConfig's read and its own subsequent write, exhausting its 5-attempt CAS retry and surfacing as "UpdateConfig failed to persist updated registry after 5 attempts" (500) on any config change that triggered a reload. Drop the synchronous hook entirely. Refresh now decides per-bucket whether to pass ratchetHWM=true by inspecting whether any database on that bucket has reached DBOnline (isBucketRatchetEligible reads sc._databases directly — no shadow set to keep in sync with DB state). Heartbeat refresh still happens unconditionally so node entries stay fresh for not-yet-online buckets. Net effect: the cluster-compat manager no longer writes the registry from any code path nested inside UpdateConfig, so the CAS collision is gone; HWM ratcheting is bounded by config_update_frequency (default 10s) after a database transitions to online, which is the same window any future online-dependent input (e.g. legacy-node detection via ISGR/cbgt) would need to wait for anyway. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Freeze previously overwrote cachedFreeze with the aggregate result unconditionally; if every tracked bucket failed (succeeded==0, aggregate==nil), a real persistent freeze was wiped from the reporting endpoint until the next Refresh. Gate the cache mutation on succeeded>0 so transient bucket-I/O failures don't erase visible state. Adds a regression test driving the all-buckets-fail branch via the existing corrupt-registry helper. Also rename Unfreeze's first return from `cleared` to `previousFreeze` and extend the unfreeze audit event description to spell out that cluster_compat_version and frozen_at describe the freeze that was lifted, not the time of the unfreeze action. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Replace the per-bucket isBucketRatchetEligible call with a single ratchetEligibleBuckets sweep so refreshNodeRegistrations doesn't re-acquire _databasesLock and rescan _databases for every tracked bucket. Also clarify the unfreeze OpenAPI 202 description to cover the case where the residual freeze state could not be verified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gregns1

Looks good just one small question on a test assertion then should be good to go

bbrks changed the title ~~Cbg 5220~~ CBG-5220: Freezable cluster compat version May 6, 2026

bbrks force-pushed the CBG-5220 branch from dfff503 to 1a94966 Compare May 6, 2026 14:56

bbrks self-assigned this May 6, 2026

bbrks force-pushed the CBG-5266 branch from 7ab3f01 to de3fecb Compare May 11, 2026 14:30

Base automatically changed from CBG-5266 to main May 11, 2026 15:45

bbrks force-pushed the CBG-5220 branch from bf1202f to 2412c1e Compare May 11, 2026 17:27

bbrks marked this pull request as ready for review May 11, 2026 17:27

Copilot AI review requested due to automatic review settings May 11, 2026 17:27

Copilot started reviewing on behalf of bbrks May 11, 2026 17:28 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

bbrks force-pushed the CBG-5220 branch 2 times, most recently from 1806fd4 to c56b7f6 Compare May 19, 2026 19:02

bbrks assigned gregns1 and unassigned bbrks May 19, 2026

bbrks requested review from Copilot and gregns1 May 19, 2026 19:11

Copilot started reviewing on behalf of bbrks May 19, 2026 19:12 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread rest/cluster_compat.go

Comment thread rest/cluster_compat.go Outdated

Comment thread docs/api/paths/admin/_cluster_compat_version-unfreeze.yaml Outdated

bbrks assigned bbrks and gregns1 and unassigned gregns1 and bbrks May 19, 2026

bbrks and others added 7 commits May 22, 2026 12:20

goimports

6429105

move test to non-race to support audit assertions

d676c17

bbrks and others added 8 commits May 22, 2026 12:23

post-rebase fix

8e2a206

bbrks force-pushed the CBG-5220 branch from 15baa69 to 8e2a206 Compare May 22, 2026 11:28

gregns1 requested changes May 22, 2026

View reviewed changes

Comment thread rest/cluster_compat_test.go

address comments

0151115

bbrks requested a review from gregns1 May 22, 2026 13:51

gregns1 approved these changes May 22, 2026

View reviewed changes

bbrks merged commit f36131e into main May 22, 2026
51 checks passed

bbrks deleted the CBG-5220 branch May 22, 2026 14:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CBG-5220: Freezable cluster compat version#8236

CBG-5220: Freezable cluster compat version#8236
bbrks merged 16 commits into
mainfrom
CBG-5220

bbrks commented May 6, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gregns1 left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bbrks commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

REST API changes

Implementation details

Diagrams

Freeze and unfreeze admin flow

HWM ratchet under a freeze

Dependencies (if applicable)

Integration Tests

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Redocly previews

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gregns1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bbrks commented May 6, 2026 •

edited

Loading

github-actions Bot commented May 11, 2026 •

edited

Loading