feat: add two-phase extension upgrade with spec.schemaVersion#328
feat: add two-phase extension upgrade with spec.schemaVersion#328WentingWu666666 wants to merge 22 commits intodocumentdb:mainfrom
Conversation
c0d35b1 to
525700b
Compare
Add spec.schemaVersion field to DocumentDBSpec to decouple binary (image) upgrades from schema (ALTER EXTENSION) upgrades. This provides a rollback-safe window between deploying a new binary and committing the schema change. Three modes: - Empty (default): Two-phase mode. Schema stays at current version until user explicitly sets schemaVersion. Safe by default for production. - "auto": Auto-finalize. Schema updates to match binary version automatically. Simple mode for development and testing. - Explicit version: Schema updates to exactly that version. Must be <= binary. Changes: - api/preview/documentdb_types.go: Add SchemaVersion to DocumentDBSpec - internal/controller/documentdb_controller.go: Add determineSchemaTarget() function, modify upgradeDocumentDBIfNeeded() to gate ALTER EXTENSION on spec.schemaVersion value - internal/utils/util.go: Add SemverToExtensionVersion() inverse conversion - Regenerated CRDs (config/crd + helm chart) - Added unit tests for all three modes and edge cases - Created public upgrade documentation (docs/operator-public-documentation/) - Added Upgrading DocumentDB page to mkdocs.yml navigation Closes documentdb#271 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Add a ValidatingWebhookConfiguration that enforces: - schemaVersion must be <= binary version (on create and update) - Image rollback below installed schema version is blocked (on update) Components added: - internal/webhook/documentdb_webhook.go: ValidateCreate/ValidateUpdate handlers - internal/webhook/documentdb_webhook_test.go: 18 unit tests - Helm template 10_documentdb_webhook.yaml: Issuer, Certificate, Service, ValidatingWebhookConfiguration with cert-manager CA injection - Updated 09_documentdb_operator.yaml: webhook port, cert volume mount, args - Updated cmd/main.go: webhook registration The webhook runs inside the existing operator process on port 9443 using cert-manager for TLS (same pattern as the sidecar injector). failurePolicy is set to Fail for database safety. The controller retains defense-in-depth checks as a secondary safety net. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Restructure webhook to use two validation function slices following the
CNPG ClusterCustomValidator pattern:
- validate(db) spec-level rules run on both create and update
Contains: validateSchemaVersionNotExceedsBinary
- validateChanges(new, old) update-only rules comparing old vs new
Contains: validateImageRollback
This makes it easy to add new validation rules just append a function
to the appropriate slice. Each validator has a consistent signature
returning field.ErrorList, and errors from all validators are aggregated.
Also adds var _ webhook.CustomValidator = &DocumentDBValidator{} compile
check and uses apierrors.NewInvalid for proper Kubernetes error format.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Remove Step 1b (image rollback blocking) from controller webhook handles it - Simplify determineSchemaTarget: keep lightweight defense-in-depth guard - Rename Pg-suffixed variables to full names (schemaExtensionVersion, etc.) - Refactor webhook tests to Ginkgo/Gomega (matching CNPG and controller patterns) - Add suite_test.go for webhook package Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Reframe upgrade types as Operator (control plane) vs DocumentDB (data plane) - Explain documentDBVersion vs schemaVersion relationship clearly - Reorganize data plane upgrade as step-by-step walkthrough with production/dev tabs - Add Multi-Region Upgrades section with coordination guidance - Move Advanced Image Overrides to its own section Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
CNPG default is in-place restart. Switchover promotes a replica first, then restarts the old primary as replica, minimizing write downtime. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Move CNPG callout from overview to operator upgrade section - Fix overview table header and clarify optional schemaVersion - Reword cluster upgrade intro for clarity - Reorder multi-region steps: backup first, binary, then schema - Minor wording improvements throughout Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Add Steps 5-8 to the upgrade & rollback CI workflow: - Step 5: Re-upgrade binary after rollback tests - Step 6: Schema finalization via spec.schemaVersion - Step 7: Webhook rejection for rollback below schema - Step 8: Webhook rejection for schema exceeding binary Also strengthen Step 2 to assert schema version is unchanged after binary upgrade (validates two-phase default behavior). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
6c308c6 to
1ba5f67
Compare
Add tests for uncovered error branches: - Webhook: type assertion errors in ValidateCreate/ValidateUpdate - Webhook: empty binary version in validateImageRollback - Webhook: unparseable versions in validateSchemaVersionNotExceedsBinary - Webhook: unparseable versions in validateImageRollback - Controller: unparseable binary version in determineSchemaTarget Patch coverage: 86.1% -> 91.7% (99/108 statements) Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
There was a problem hiding this comment.
Pull request overview
Adds spec.schemaVersion to decouple DocumentDB binary (image) upgrades from irreversible schema upgrades (ALTER EXTENSION ... UPDATE), enabling a two-phase, rollback-safer upgrade flow. This is implemented end-to-end across API/CRDs, controller logic, admission validation, tests, docs, Helm, and CI.
Changes:
- Introduces
spec.schemaVersionwith modes: default two-phase (manual finalize),"auto"(auto-finalize), and explicit pin (<= binary). - Updates controller upgrade logic to gate
ALTER EXTENSIONbased onspec.schemaVersion, and adds conversion helpers for version formats. - Adds a validating webhook to enforce schema<=binary and block image rollbacks below installed schema; updates Helm manifests and CI/docs accordingly.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| operator/src/internal/webhook/suite_test.go | Adds Ginkgo suite entrypoint for webhook tests. |
| operator/src/internal/webhook/documentdb_webhook.go | Implements validating webhook for schema/binary constraints and rollback blocking. |
| operator/src/internal/webhook/documentdb_webhook_test.go | Adds unit tests for webhook validation behaviors. |
| operator/src/internal/utils/util.go | Adds SemverToExtensionVersion() conversion helper. |
| operator/src/internal/utils/util_test.go | Adds unit tests for semver→extension version conversion. |
| operator/src/internal/controller/documentdb_controller.go | Adds determineSchemaTarget() and gates ALTER EXTENSION on spec.schemaVersion. |
| operator/src/internal/controller/documentdb_controller_test.go | Adds/updates controller tests for two-phase schema upgrade behaviors. |
| operator/src/internal/cnpg/cnpg_cluster.go | Sets CNPG PrimaryUpdateMethod to Switchover. |
| operator/src/config/crd/bases/documentdb.io_dbs.yaml | Regenerates CRD with new spec.schemaVersion field and description. |
| operator/src/cmd/main.go | Registers the validating webhook with the controller manager. |
| operator/src/api/preview/documentdb_types.go | Adds SchemaVersion to DocumentDBSpec with validation pattern and docs. |
| operator/documentdb-helm-chart/templates/10_documentdb_webhook.yaml | Adds cert-manager TLS + Service + ValidatingWebhookConfiguration resources. |
| operator/documentdb-helm-chart/templates/09_documentdb_operator.yaml | Mounts webhook TLS secret and configures webhook cert path/port in the operator deployment. |
| operator/documentdb-helm-chart/crds/documentdb.io_dbs.yaml | Regenerates Helm-packaged CRD with spec.schemaVersion. |
| docs/operator-public-documentation/preview/operations/upgrades.md | Documents two-phase schema upgrades and rollback rules. |
| .github/workflows/test-upgrade-and-rollback.yml | Extends upgrade/rollback CI flow to validate two-phase behavior + webhook enforcement. |
- Webhook: reject (not silently allow) when version comparison fails with explicit schemaVersion set or during rollback validation - resolveBinaryVersion: handle digest-only image references (@sha256:) and strip architecture suffixes from tags (e.g., 0.112.0-amd64) - CI: extract semver from image tag, stripping arch suffix - Controller: reduce SchemaUpdateAvailable event spam by only emitting when schema upgrade is newly detected, and downgrade log to V(1) - Add tests for new extractSemver helper, digest refs, and arch suffixes Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
When spec.schemaVersion is set to an explicit version but neither spec.documentDBVersion nor spec.documentDBImage is specified, the webhook cannot validate the schemaVersion <= binary invariant. Previously this was silently allowed (controller handled it at runtime). Now the webhook rejects with a clear error message telling the user to also set documentDBVersion or documentDBImage. Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
xgerman
left a comment
There was a problem hiding this comment.
Code Review: PR #328 — feat: add two-phase extension upgrade with spec.schemaVersion
Files changed: 16 (+1761, -65) | Closes: #271
Overview
The two-phase approach is the right design — "safe by default" for a database operator. Architecture is sound. The webhook, controller changes, and test coverage are well-structured. Below are findings organized by severity.
🔴 Critical (1)
C1. failurePolicy: Fail blocks all DocumentDB operations if cert-manager is unhealthy
File: operator/documentdb-helm-chart/templates/10_documentdb_webhook.yaml, line 77
File: operator/src/internal/webhook/documentdb_webhook.go, line 25
The webhook uses failurePolicy: Fail. If cert-manager has not issued the certificate yet (e.g., during initial Helm install, cert-manager pod restart, or quota issues), the webhook Service has no valid TLS cert → the API server cannot reach it → all DocumentDB CREATE and UPDATE operations are rejected cluster-wide.
This is particularly dangerous on fresh installs: the Helm chart creates the ValidatingWebhookConfiguration, Certificate, and DocumentDB CRD in one release — but cert-manager Certificate issuance is asynchronous. There is a race window where the webhook is registered but not yet serving.
Recommendation: Either:
- (a) Add a
values.yamltoggle (webhook.enabled: true/webhook.failurePolicy: Fail|Ignore) with a default ofIgnoreduring preview, switching toFailat GA, OR - (b) Document the cert-manager dependency explicitly in the upgrade guide and README prerequisites, AND add a Helm pre-install/pre-upgrade hook that waits for the Certificate to be Ready before proceeding, OR
- (c) At minimum, add a comment in the template explaining why
Failis chosen and the cert-manager timing dependency.
🟠 Major (5)
M1. Breaking change needs CHANGELOG entry and migration section
The PR body acknowledges this is a breaking change — default behavior shifts from auto-update to two-phase. However:
- No
CHANGELOG.mdentry is included - The upgrade docs mention "safe-by-default" but do not have a dedicated "Migrating from pre-0.X to 0.X" section
- Existing users who run
helm upgradewill find their schema version frozen
Recommendation: Add a CHANGELOG entry under ## [Unreleased] with a ### Breaking Changes section. Add a migration callout box in upgrades.md explaining: "If you are upgrading from operator version < X.Y.Z, add spec.schemaVersion: "auto" to your DocumentDB resources to preserve previous behavior."
M2. PrimaryUpdateMethod: cnpgv1.PrimaryUpdateMethodSwitchover is unrelated to schema versioning
File: operator/src/internal/cnpg/cnpg_cluster.go, line 885
This sets the CNPG cluster primary update strategy to switchover (vs. restart). While likely a good change, it is orthogonal to two-phase schema upgrades and should be in a separate PR or at minimum called out in the PR description under a separate "Also includes" section.
Risk: If this PR needs to be reverted for a schema-version bug, the switchover change gets reverted too.
M3. SQL construction uses fmt.Sprintf — add safety comment
File: operator/src/internal/controller/documentdb_controller.go, line ~1055
return targetPgVersion, fmt.Sprintf("ALTER EXTENSION documentdb UPDATE TO '%s'", targetPgVersion)The input is CRD-validated (pattern: ^(auto|[0-9]+\.[0-9]+\.[0-9]+)?$) and then transformed by SemverToExtensionVersion (digits + dots + hyphen only), so SQL injection risk is effectively zero. However, this pattern (string formatting SQL) will get flagged by security scanners and could concern future contributors.
Recommendation: Add a comment explaining why string formatting is safe here:
// Safe: targetPgVersion is derived from CRD-validated spec.schemaVersion
// (pattern: ^(auto|[0-9]+\.[0-9]+\.[0-9]+)?$) via SemverToExtensionVersion.
// ALTER EXTENSION ... TO does not support parameterized queries in PostgreSQL.M4. resolveBinaryVersion returns empty for digest-only images without documentDBVersion
File: operator/src/internal/webhook/documentdb_webhook.go, lines ~191-207
if strings.Contains(ref, "@sha256:") {
return db.Spec.DocumentDBVersion
}If a user specifies documentDBImage: "registry.io/image@sha256:abc..." without setting documentDBVersion, the function returns "". The webhook then compares "" against the schema version, which may produce confusing validation errors.
Recommendation: Add a defensive check in the webhook: if resolveBinaryVersion returns "", reject with a clear message like "cannot determine binary version from image reference; set spec.documentDBVersion explicitly when using digest-only image references."
M5. Webhook test suite uses full envtest — consider lighter approach
File: operator/src/internal/webhook/suite_test.go
The test suite sets up a full envtest environment (API server + etcd) for webhook tests that are primarily unit-testing admission logic. This adds significant test execution time and CI resource usage.
Recommendation: Consider whether the admission handler tests truly need envtest, or if they can use a fake client. The extractSemver, resolveBinaryVersion, and validation functions are pure logic and should be tested without envtest.
🟡 Minor (6)
m1. Event emission guard may fire confusingly on brand-new clusters
File: operator/src/internal/controller/documentdb_controller.go, line ~1035
if r.Recorder != nil && documentdb.Status.SchemaVersion != util.ExtensionVersionToSemver(installedVersion) {On initial cluster creation, status.schemaVersion is empty and installedVersion is the just-installed version. The condition "" != "0.109.0" is true, so the event fires — creating a confusing "Schema update available" event on brand-new clusters where there is nothing to upgrade.
Recommendation: Add a check: only emit the event when installedVersion != binaryVersion (i.e., when the binary is actually ahead of the schema).
m2. extractSemver does not validate that parts[0] and parts[1] are numeric
File: operator/src/internal/webhook/documentdb_webhook.go, lines ~215-231
The function validates that parts[2] starts with digits, but parts[0] and parts[1] could be non-numeric (e.g., tag "abc.def.123" would return "abc.def.123"). While unlikely in practice, it is a gap in defensive validation.
Recommendation: Add a quick strconv.Atoi check for parts[0] and parts[1], or use a semver regex.
m3. Two-phase mode test does not verify the "SchemaUpdateAvailable" event
File: operator/src/internal/controller/documentdb_controller_test.go
The test "should skip schema upgrade when schemaVersion is not set (two-phase mode)" verifies no SQL is called and status is correct, but does not assert that a SchemaUpdateAvailable event was recorded. This is a key user-visible behavior of two-phase mode.
Recommendation: Add an assertion on the fake recorder to verify the event message and reason.
m4. Helm template hardcodes webhook service name
File: operator/documentdb-helm-chart/templates/10_documentdb_webhook.yaml
The service name documentdb-webhook-service is hardcoded. Other templates in the chart use {{ .Release.Name }} for naming. Consider using a consistent naming pattern (e.g., {{ .Release.Name }}-webhook) for multi-instance deployments.
m5. SemverToExtensionVersion does not validate input format
File: operator/src/internal/utils/util.go
The function uses strings.LastIndex(version, ".") to replace the last dot with a hyphen. If the input does not contain a dot (e.g., "auto" or garbage input), LastIndex returns -1 and the function returns the input unchanged. While the caller validates input, a defensive check would improve robustness.
Recommendation: Add input validation or document the expected format in the function comment.
m6. Webhook Go file kubebuilder marker duplicates Helm template
File: operator/src/internal/webhook/documentdb_webhook.go, line 25
The //+kubebuilder:webhook marker generates a kustomize webhook config, but the actual deployment uses the Helm template (10_documentdb_webhook.yaml). The marker serves as documentation but could lead to confusion about the source of truth.
Recommendation: Add a comment clarifying that the Helm template is the authoritative webhook configuration.
🟢 Nitpick (4)
n1. upgrades.md uses future tense in multiple places
Per Microsoft Style Guide, prefer present tense:
- "the operator will wait" → "the operator waits"
- "you will need" → "you need"
n2. Content tabs use inconsistent heading levels
The === "Two-phase" tabs contain #### headings inside, which may not render in mkdocs nav. Consider bold text instead of headings inside tabs.
n3. CI workflow steps have redundant echo statements
File: .github/workflows/test-upgrade-and-rollback.yml
Steps 5-8 have descriptive names, but the echo statements inside repeat the step name. The redundant echo "=== Step N: ..." lines can be removed since GitHub Actions shows the step name.
n4. admissionReviewVersions only lists v1
Acceptable for K8s 1.30+ target per project compatibility matrix. No action needed unless earlier K8s versions need support.
✅ Positive Feedback
- Two-phase design is excellent. This is exactly the right pattern for a database operator — schema changes are irreversible, so decoupling them from binary upgrades provides a critical safety net.
- Test coverage is thorough. 6 new controller tests + 20+ webhook tests cover all three modes and edge cases. Existing tests were updated to add
SchemaVersion: "auto"to maintain backward compatibility — good attention to detail. - Webhook implementation is clean.
resolveBinaryVersionhandles multiple image reference formats (tag, digest, arch suffix) gracefully. TheextractSemverfunction is well-focused. - Documentation is well-structured. The
upgrades.mdrewrite with content tabs for Production/Rolling/Development modes is excellent UX. - PR description is exemplary. Clear summary, mode table, user flow example, and explicit backward compatibility callout.
Summary
| Severity | Count | Action |
|---|---|---|
| 🔴 Critical | 1 | Address webhook failurePolicy race with cert-manager |
| 🟠 Major | 5 | CHANGELOG, PrimaryUpdateMethod separation, SQL comment, digest fallback, test weight |
| �� Minor | 6 | Event guard, validation gaps, naming consistency |
| 🟢 Nitpick | 4 | Documentation style |
Overall assessment: Well-designed feature with high code quality and solid test coverage. The main concerns are operational (C1: cert-manager timing) and communication (M1: breaking change). Recommend addressing C1 and M1 before merge; remaining items can be follow-ups.
/request-changes
- C1: Add readiness/liveness/startup probes + optional cert secret to operator deployment (CNPG pattern). Keep failurePolicy: Fail with webhook StartedChecker() readyz gate. - M1: Add [Unreleased] CHANGELOG entry + migration callout in upgrades.md - M3: Add SQL safety comment for ALTER EXTENSION fmt.Sprintf - m1: Fix event guard to check binaryVersion != installedVersion (avoids firing on new clusters with empty status) - m2: Add strconv.Atoi validation for major/minor in extractSemver - m3: Add SchemaUpdateAvailable event assertion to two-phase test - m6: Add comment clarifying kubebuilder marker vs Helm template - n1: Verified docs already use present tense no changes needed - Added single-instance comments to webhook Service and config Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
There was a problem hiding this comment.
Thanks for the thorough review! Here's how I addressed each finding in commit 9c4822b:
Fixed
| ID | Finding | Resolution |
|---|---|---|
| C1 | FailurePolicy: Fail blocks ops if cert-manager unhealthy |
Kept Fail (aligns with CNPG pattern). Added readiness/liveness/startup probes + optional: true on cert secret + webhookServer.StartedChecker() readyz gate. Pod stays out of Service endpoints until TLS cert is loaded. |
| M1 | Breaking change needs CHANGELOG | Added [Unreleased] section with feature + breaking change entries. Added migration callout in upgrades.md. |
| M3 | SQL construction safety | Added comment explaining why fmt.Sprintf is safe - input is CRD-regex-validated + webhook-parsed, contains only digits/dots/hyphens. |
| M4 | Digest fallback returns empty version | Already fixed in prior commit (288f5c4) webhook rejects explicit schemaVersion when binary version is unresolvable. |
| m1 | Event fires on new clusters | Changed guard to binaryVersion != installedVersion instead of status mismatch. |
| m2 | ExtractSemver missing numeric validation | Added strconv.Atoi checks for major and minor parts. |
| m3 | No event assertion in test | Added SchemaUpdateAvailable event assertion to two-phase mode test. |
| m6 | Kubebuilder marker confusion | Added comment clarifying Helm template is authoritative for deployments; marker is for make run. |
| n1 | Future tense in docs | Verified docs already use present tense throughout - no changes needed. |
Won't fix (with rationale)
| ID | Finding | Rationale |
|---|---|---|
| M2 | PrimaryUpdateMethod in separate PR | Switchover is directly related - it enables rolling restarts without downtime during binary phase. Keeping it in this PR. |
| M5 | Webhook tests should use envtest | Incorrect premise - suite_test.go is a plain Ginkgo suite with no envtest/API server/etcd. The webhook tests are pure unit tests. |
| m4 | Hardcoded webhook service name | Acceptable for current single-instance design. Added comments noting what to template if multi-instance is needed in the future. |
| m5 | SemverToExtensionVersion input validation | Callers already validate via CRD regex pattern. Adding redundant validation inside would violate single-responsibility. |
| n2 | Heading levels in tabs | mkdocs-material handles tab-scoped headings correctly. |
| n3 | Redundant echo in CI | Useful for CI log readability when debugging step boundaries. |
| n4 | admissionReviewVersions should include v1beta1 | Our minimum is K8s 1.30+ where v1 is the standard. v1beta1 was removed in K8s 1.22. |
Summary
Add
spec.schemaVersionfield to DocumentDBSpec to decouple binary (image) upgrades from schema (ALTER EXTENSION) upgrades. This provides a rollback-safe window between deploying a new binary and committing the schema change.Closes #271
Three Modes
User Flow (Two-Phase)
Changes
Backward Compatibility
Breaking change: Default behavior changes from auto-update to two-phase. Existing users upgrading to this operator version will need to set schemaVersion: "auto" to restore previous behavior, or adopt two-phase upgrades. This is intentional -- safe by default for a database operator.