Conversation
Redocly previews |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces an admin-controlled “freeze” mechanism for the Sync Gateway cluster compatibility version (CCV), intended to keep CCV from advancing during upgrades so rollbacks/downgrades remain possible.
Changes:
- Adds new admin endpoints to read CCV state and to freeze/unfreeze it, including audit events.
- Persists a freeze record into each bucket’s registry document and updates CCV computation to account for the frozen value.
- Extends OpenAPI documentation and adds unit + audit REST tests for the new behavior.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| rest/routing.go | Registers new admin routes for cluster compat version read/freeze/unfreeze. |
| rest/handler_cluster_compat.go | Implements REST handlers and response payload for CCV state + freeze/unfreeze actions. |
| rest/config_registry.go | Extends the persisted registry document schema with a freeze record pointer. |
| rest/config_manager.go | Adds CAS-retry helpers to set/clear the freeze record in a bucket registry. |
| rest/cluster_compat.go | Adds freeze-aware CCV computation and manager Freeze/Unfreeze operations. |
| rest/cluster_compat_test.go | Adds unit tests for freeze/unfreeze and pinning behavior. |
| rest/cluster_compat_audit_test.go | Adds end-to-end REST + audit emission coverage for the new endpoints. |
| docs/api/paths/admin/_cluster_compat_version.yaml | Documents the new GET endpoint. |
| docs/api/paths/admin/_cluster_compat_version-freeze.yaml | Documents the new freeze endpoint. |
| docs/api/paths/admin/_cluster_compat_version-unfreeze.yaml | Documents the new unfreeze endpoint. |
| docs/api/components/schemas.yaml | Adds new OpenAPI schemas and updates GatewayRegistry schema with freeze record. |
| docs/api/admin.yaml | Wires new paths into the admin OpenAPI spec. |
| base/version_cluster_compat.go | Adds the RegistryFreeze type used to persist the freeze record. |
| base/audit_events.go | Adds new audit IDs/events for CCV read/freeze/unfreeze. |
1806fd4 to
c56b7f6
Compare
Documents three new admin endpoints under /_cluster_compat_version: GET returns the cluster-wide version, per-node versions, and the frozen value if set; POST /freeze pins the version to the current value to preserve rollback capability across upgrades; POST /unfreeze clears the freeze. Adds ClusterCompatVersionState response schema and RegistryFreeze record on GatewayRegistry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an admin-controlled freeze for the cluster compatibility version, allowing an operator to pin the reported version to its current value across rolling upgrades and preserve the option to roll back a node. Storage: GatewayRegistry gains a Frozen *RegistryFreeze field stored per bucket; the cluster-wide freeze is the aggregate (any bucket frozen means the cluster is held back). New CAS-safe SetRegistryFreeze and ClearRegistryFreeze methods on bootstrapContext mirror the existing node-registration helpers. Manager: clusterCompatManager tracks the auto-computed live-node minimum and the aggregate freeze separately. ClusterCompatVersion() reports the frozen value when set, otherwise the auto minimum. Refresh and RegisterBucket pick up the freeze record from each tracked registry. Freeze fans out to all tracked buckets and is success-on-any (safe direction); Unfreeze is success-on-all and returns the residual freeze on partial failure. REST: three new admin endpoints under /_cluster_compat_version (GET, POST /freeze, POST /unfreeze), DevOps-permission gated. Unfreeze returns the current state in a 503 body when partially applied. Three new audit events cover the read and state-changing operations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Align GatewayRegistry's Frozen JSON tag with the OpenAPI spec (frozen_cluster_compat_version). - Audit unfreeze attempts unconditionally so partial failures still produce an audit trail. - Verify each REST endpoint emits its audit event in the round-trip test by wiring it through the EE audit-logging test harness. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
While a freeze is in effect, the freeze version is a ceiling on ClusterCompatVersionHWM advancement. Without this, all nodes upgrading past the frozen version would ratchet HWM forward, and the downgrade gate would then block rolling any node back to the frozen value — defeating the freeze's whole purpose. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror Unfreeze's contract: Freeze now requires every tracked bucket to accept the freeze. If one or more buckets fail to accept it, the new ErrFreezePartial is returned alongside whatever aggregate freeze did take effect, and the REST handler responds 503 with the current ClusterCompatVersionState body so the admin can see what is pinned. ErrFreezeNoBucketsWritten is retained only for the "no tracked buckets at all" case; the previously-conflated "every bucket write failed" case is now ErrFreezePartial. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Declare MandatoryFields/OptionalFields on the freeze/unfreeze audit events so audit field validation covers the new fields. - Tighten the GET /_cluster_compat_version 503 description to reflect the actual condition (CCV tracking not enabled on this node). - Document the unfreeze 500 response and broaden the 503 body schema to oneOf (state | HTTP-Error). - Fix the GatewayRegistry.frozen_cluster_compat_version description to reference the schema property path (frozen_cluster_compat_version.version) rather than the historical frozen.version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hold m.mu.RLock across Freeze's bucket write loop so the snapshotted cluster compat version cannot shift relative to a concurrent Refresh. Change Unfreeze to return the cleared freeze in addition to any residual so the unfreeze audit no longer relies on a separate cached peek that could race with Refresh. Document the cross-bucket drift on retry and the Refresh write-back race window. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s unknown Unfreeze previously returned ErrUnfreezePartial in two cases that the REST handler rendered identically: a verified residual freeze on re-read, and a total failure where the post-clear re-read also failed. In the second case the cache was wiped to nil and the 503 body showed no frozen_cluster_compat_version, leaving admins unable to distinguish "fully cleared" from "we have no idea". Unfreeze now preserves the pre-op cache when residual state can't be verified, and the handler branches on residual: residual != nil keeps the state-body 503, residual == nil returns an HTTP-Error naming the previously-frozen version so the admin has a recovery target. Both body shapes are already covered by the 503 oneOf in the OpenAPI spec. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Splits RegisterNodeVersion via a new ratchetHWM bool so the first registration (RegisterBucket from _applyConfig) only refreshes the node heartbeat — ClusterCompatVersionHWM advancement is held until the database has stabilized. RatchetClusterCompatHWMForBucket runs at end of StartOnlineProcesses (sync + async paths); periodic Refresh still ratchets. HWM is monotonic, so committing an advance off transient startup state would lock the cluster at a too-high value forever — gating prevents that. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… hook The previous post-StartOnlineProcesses ratchet wrote the registry from inside ReloadDatabaseWithConfig — which itself runs inside UpdateConfig's callback — bumping the registry CAS between UpdateConfig's read and its own subsequent write, exhausting its 5-attempt CAS retry and surfacing as "UpdateConfig failed to persist updated registry after 5 attempts" (500) on any config change that triggered a reload. Drop the synchronous hook entirely. Refresh now decides per-bucket whether to pass ratchetHWM=true by inspecting whether any database on that bucket has reached DBOnline (isBucketRatchetEligible reads sc._databases directly — no shadow set to keep in sync with DB state). Heartbeat refresh still happens unconditionally so node entries stay fresh for not-yet-online buckets. Net effect: the cluster-compat manager no longer writes the registry from any code path nested inside UpdateConfig, so the CAS collision is gone; HWM ratcheting is bounded by config_update_frequency (default 10s) after a database transitions to online, which is the same window any future online-dependent input (e.g. legacy-node detection via ISGR/cbgt) would need to wait for anyway. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Freeze previously overwrote cachedFreeze with the aggregate result unconditionally; if every tracked bucket failed (succeeded==0, aggregate==nil), a real persistent freeze was wiped from the reporting endpoint until the next Refresh. Gate the cache mutation on succeeded>0 so transient bucket-I/O failures don't erase visible state. Adds a regression test driving the all-buckets-fail branch via the existing corrupt-registry helper. Also rename Unfreeze's first return from `cleared` to `previousFreeze` and extend the unfreeze audit event description to spell out that cluster_compat_version and frozen_at describe the freeze that was lifted, not the time of the unfreeze action. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the per-bucket isBucketRatchetEligible call with a single ratchetEligibleBuckets sweep so refreshNodeRegistrations doesn't re-acquire _databasesLock and rescan _databases for every tracked bucket. Also clarify the unfreeze OpenAPI 202 description to cover the case where the residual freeze state could not be verified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gregns1
requested changes
May 22, 2026
Contributor
gregns1
left a comment
There was a problem hiding this comment.
Looks good just one small question on a test assertion then should be good to go
gregns1
approved these changes
May 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CBG-5220
Add ability to freeze current cluster compatibility version to "pin" a cluster to a given version and avoid rolling CCV forwards.
Allows for supported downgrades/rollbacks even if all nodes in the cluster have been upgraded but are pinned behind the frozen version.
REST API changes
GET /_cluster_compat_versionreturns the cluster-wide version, per-node versions, and the frozen value if set.POST /_cluster_compat_version/freezepins the version to the current value to preserve rollback capability across upgrades.POST /_cluster_compat_version/unfreezeclears the frozen version.Implementation details
The trickier half of the change is making the existing
ClusterCompatVersionHWMratchet behave correctly in the presence of a freeze. Because HWM is monotonic and gates downgrades, advancing it off transient state would permanentlylock the cluster above the frozen version.
This implementation:
StartOnlineProcessesDBOnline(computed once per Refresh to avoid re-acquiring_databasesLock)RegisterNodeVersionso the single CAS-checked write is the only place HWM can move.Both Freeze and Unfreeze are "success-on-all" (at the bucket level - since there can be multiple registries for a given cluster) - a 503 is returned in case some bucket ops fail (whilst some succeed) and it's expected that the operation is retried until successful.
Diagrams
Freeze and unfreeze admin flow
Both admin endpoints fan out across every tracked bucket, persist a CAS-checked mutation into each bucket's
_sync:registrydocument.sequenceDiagram participant Admin participant Handler as handler_cluster_compat participant Mgr as clusterCompatManager participant Boot as bootstrapContext participant Bkts as Tracked buckets (registry docs) Admin->>Handler: POST /freeze Handler->>Mgr: Freeze Mgr->>Mgr: RLock and snapshot cachedVersion loop each tracked bucket Mgr->>Boot: SetRegistryFreeze(bucket, version) Boot->>Bkts: CAS write Frozen{version, FrozenAt} Boot-->>Mgr: freeze record end Mgr->>Mgr: mergeFreeze aggregate, update cachedFreeze and cachedVersion Mgr-->>Handler: aggregate Handler-->>Admin: 200 ClusterCompatVersionState Admin->>Handler: POST /unfreeze Handler->>Mgr: Unfreeze Mgr->>Mgr: snapshot previousFreeze loop each tracked bucket Mgr->>Boot: ClearRegistryFreeze(bucket) Boot->>Bkts: CAS write removing Frozen end Mgr->>Mgr: cachedFreeze = nil, recompute cachedVersion Mgr-->>Handler: previousFreeze Handler-->>Admin: 200HWM ratchet under a freeze
ClusterCompatVersionHWMis the monotonic high-water mark each bucket has ever observed its live-node minimum reach, and it's the field the downgrade guardrail uses. Once HWM is advanced, the gate permanently refuses any node whose version is below it.The freeze acts as a per-bucket cap on HWM incrementing, which is why the freeze operation has to be success-on-all. A bucket that didn't get the freeze record skips the clamp and keeps ratcheting HWM forward on the next Refresh tick, and HWM can't be pulled back.
flowchart TD A[Each Refresh tick, per bucket] --> B{Bucket has a freeze?} B -- yes --> C[HWM capped at freeze version] B -- no --> D[HWM ratchets up to live-node minimum] C --> E[Per-bucket registry write] D --> E E -.later.-> F{Incoming node version below HWM?} F -- yes --> G[Rejected — rollback no longer possible] F -- no --> H[Accepted]Dependencies (if applicable)
Integration Tests