feat: client deployment with ProvisioningRequest + ECR cache (migration disabled) [DON'T MERGE]#66
Open
mike-ainsel wants to merge 40 commits into
Open
feat: client deployment with ProvisioningRequest + ECR cache (migration disabled) [DON'T MERGE]#66mike-ainsel wants to merge 40 commits into
mike-ainsel wants to merge 40 commits into
Conversation
Add optional migration init containers to the Helm chart and corresponding CloudFormation parameters for migrating from an existing single-server Platforma installation. Migration flow: 1. Download database dump from user's S3 (aws-cli init container) 2. Restore database using platforma --restore-db (platforma init container) 3. Invalidate caches using platforma --invalidate-caches (platforma init container) 4. Sync primary storage from old S3 bucket to new (aws-cli init container) 5. Start Platforma normally All steps are idempotent via a marker file on the database PVC. Set migration.enabled: false after successful migration and redeploy.
…rker path - Rename migration → dataMigration in helm values and templates - Separate credentials for database dump and storage sync buckets - Use env var for database path instead of hardcoding in scripts - Increase migration-tmp emptyDir to 30Gi - Align CF template with renamed helm values key
Migration runs as standalone kubectl pods in the CF buildspec before helm install. Four sequential steps: 1. Download database dump from S3 2. Restore database using platforma --restore-db 3. Sync primary storage between S3 buckets 4. Invalidate caches using platforma --invalidate-caches Supports cross-account via MigrationStorageUserArn bucket policy. Idempotent via marker file on the database PVC.
…at/migration-from-single-server
Pre-helm: download dump, restore DB, sync storage Post-helm: wait for Platforma migrations, invalidate caches, restart This fixes the RocksDB migration conflict — Platforma must apply its internal DB migrations before cache invalidation can run.
ProvisioningRequest (prevents Kueue resource fragmentation): - Install CRD before Kueue (version-aligned with CA) - Kueue: ProvisioningACC gate, memory 512Mi→1Gi - CA: --enable-provisioning-requests + --kube-api-content-type=application/json - CA RBAC for provisioningrequests, /status, podtemplates - Platforma: provisioningRequest.enabled=true ECR pull-through cache (optional, EnableECRCache parameter): - Pull-through rule for quay.io with 'quay' prefix - Lifecycle policy: 90-day expiry - Node IAM: BatchImportUpstreamImage + CreateRepository - --default-docker-registry when enabled Migration: disabled (prints info message). Parameters remain in template for schema compatibility but code blocks are no-op. Safe to apply on existing deployments without triggering migration.
Without healthcheck-path and success-codes, ALB defaults to HTTP GET / which always fails on a gRPC backend. This causes ALB to mark targets unhealthy even when the pod is running, preventing client connections. Critical for license renewal: when license expires and is renewed on the server, Platforma restarts and recovers, but ALB never routes traffic back without a proper health check.
Adds ProvisioningRequest documentation to the migration branch: - advanced-installation.md: CRD install, CA flags, RBAC, config table - kueue-values.yaml: memory bump to 1Gi, ProvisioningACC feature gate - upgrade-provisioning-request.md: upgrade guide for manual deployments - provisioning-request-test-results.md: test results from lab cluster
…ningRequest CRD, CA flags, RBAC)
- Replace ForceUpdateInfra with version string properties on TriggerHelmDeploy so any infra version bump auto-triggers the HelmDeployer - Add GKE upgrade guide (native ProvisioningRequest, no CRD/CA setup needed) - Update AWS upgrade guide to reference new trigger mechanism
kubectl apply fails with "cannot change roleRef" when the cluster-autoscaler-provisioning-requests ClusterRoleBinding already exists bound to a different ClusterRole (e.g. the manual provisioning-request-admin role from pre-3.2.x testing). Delete the binding when its roleRef drifts and drop the legacy orphan ClusterRole before re-applying.
Two fixes for the EnableECRCache toggle when flipped true->false on a live stack: - EmptyOnDelete on ECRCacheRepository: lets CF delete the managed quay/milaboratories/pl-containers repo even when it holds cached images. Without this CF refuses the delete and rolls the update back. - EnableECRCache on TriggerPlatformaDeploy: the platforma-deployer build reads ECR_CACHE_ENABLED to decide whether to pass --default-docker-registry to the Platforma chart. Propagating the parameter onto the trigger's properties makes flipping EnableECRCache re-invoke the build automatically, so the running release reconfigs in the same stack update instead of needing a separate ForceUpdatePlatforma bump.
Introduces a client-side-only kill switch for the ECR pull-through cache. EnableECRCache controls whether the cache resources (rule, repo, IAM policy) exist. UseECRCacheForPlatforma controls whether Platforma is configured with --default-docker-registry pointing at the cache. The second flag is only meaningful when the first is true; otherwise it is ignored. Lets operators roll the client back to upstream pulls without tearing down cached images (which take hours to repopulate), and makes debugging cache-specific issues safe. Propagated as an env var on the platforma-deployer CodeBuild and as a property on TriggerPlatformaDeploy so toggling it re-invokes the build automatically.
Standalone walkthrough of the two-phase migration implemented by infrastructure/aws/migration.sh: pre-helm DB restore + S3 sync, helm install, post-helm cache invalidation. Includes the pod YAML the script generates so the procedure can be run manually or wired into Terraform. Adds a reference section covering all Cluster Autoscaler extraArgs, Kueue Helm values + chart-level kueue.* knobs, and the AppWrapper install (incl. webhook-deletion workaround).
…side wiring
Step 1.2.1: dedicated 'platforma-migration' service account with an IAM
policy granting read on the source dump+storage buckets and read/write on
the destination — wired into mgr-download and mgr-sync via
serviceAccountName, replacing inline AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
for the same-account case. Cross-account still supported by re-enabling
the env block.
Component Options Reference rewritten around a single question: what does
the operator have to add to the cluster so the chart's auto-rendered
ProvisioningRequestConfig + AdmissionCheck actually fire?
- Cluster Autoscaler: --enable-provisioning-requests=true,
--kube-api-content-type=application/json (workaround for #8855), the
ProvisioningRequest CRD, and the ClusterRole/Binding granting
provisioningrequests + podtemplates verbs. IAM scoping note about
autoscaling:ResourceTag vs aws:ResourceTag.
- Kueue: ProvisioningACC feature gate + the rendered
ProvisioningRequestConfig / AdmissionCheck / ClusterQueue
admissionChecksStrategy as audit reference.
- AppWrapper: install + webhook-deletion workaround (no ProvReq config
of its own).
- End-to-end verification chain: AppWrapper -> Workload ->
ProvisioningRequest -> ClusterQueue/AdmissionCheck Active.
- Add Phase 0 covering source-side database dump (curl /db/state_raw) and identification of the source primary-storage bucket. - Swap aws s3 sync for s5cmd in Step 1.5 (10-30x faster for many-small- object buckets); document trailing-slash gotcha for SOURCE_PREFIX. - Add Step 1.5b: S3 Batch Replication for multi-TB buckets - versioning prerequisites, replication IAM role, bucket replication config, Batch Operations IAM role, batch job creation with auto-generated manifest, monitoring, and teardown. - Decision table in Step 1.5 directs operators to s5cmd or Batch Replication based on bucket size. - Variables block now exports CLUSTER_NAME, AWS_ACCOUNT_ID, MI_LICENSE, and splits PLATFORMA_VERSION (image) from CHART_VERSION (Helm). - Correct the idempotency claim: only the final invalidate step writes .migration-complete; intermediate steps gate on it. - Gate Step 1.6 cleanup on .migration-complete for consistency. - Comment out mgr-sync env block by default (matches mgr-download IRSA pattern; explicit env would override IRSA credentials). - Drop See Also section that linked the internal core/pl repo.
Primary storage and the database are both required for Platforma to serve projects — restoring the DB alone isn't enough, since blocks reference objects that must already exist in the destination bucket. The storage sync is also the long-pole step (hours for TB-scale via Batch Replication, longer via s5cmd). Move the storage sync ahead of the DB download/restore in Phase 1: - Step 1.3 = Sync Primary Storage (was 1.5) - Step 1.3b = S3 Batch Replication (was 1.5b) - Step 1.4 = Download DB dump (was 1.3) - Step 1.5 = Restore DB (was 1.4) This way operators kick off the slow sync first and run the much faster DB steps in parallel. Updated cross-references in Phase 0, Phase 1 intro, and Cross-Account Sync.
Batch Replication is entirely AWS-side — it does not need the Kubernetes cluster, so it can (and should) be kicked off in Phase 0 alongside the DB dump. For multi-TB source buckets this is the critical change: AWS copies in the background for hours while the operator stands up the cluster, downloads/restores the DB, etc., instead of serializing those steps. - Move Step 1.3b -> Step 0.6 (and substeps 0.6.1-0.6.7). - Step 1.3 (s5cmd) now scoped explicitly to < 1 TB buckets. - Decision table moved to Step 1.3; Phase 0 intro lists the running Batch job as a third artefact. - Updated Phase 1 intro and Step 0.5 cross-refs.
Real bug: mgr-sync (Step 1.3) mounted the platforma-database PVC just to check the .migration-complete marker, but mgr-download/restore/ invalidate also mount the same RWO PVC. K8s would refuse to schedule mgr-sync to a different node while a DB pod is running, breaking the 'runs in parallel' claim. s5cmd is already idempotent via ETag comparison, so the marker check was redundant — drop the PVC mount. Other fixes from the review: - Idempotency intro now distinguishes PVC-bound steps (marker-gated) from Phase 0 / Step 1.3 (idempotent by underlying op). - Step 0.4 makes the forward-reference to Step 1.2.1 explicit. - Step 0.6.3 reworked: heading marked optional/recommended, lead with the rationale (safety net) instead of contradicting it in a callout. - Step 0.6.7 comment clarifies it deletes the whole replication config. - Drop stale 'prefer Batch Replication' tuning note in Step 1.3 — Batch is now Step 0.6, redirected from the decision table at the top. - Rollback claim about untouched source qualified with Step 0.2 caveat. - Phrasing nits (asynchronously/in-the-background redundancy, duplicate 'long-pole' framing across sections).
This section duplicated content from infrastructure/aws/advanced-installation.md (Cluster Autoscaler ProvisioningRequest setup, Kueue feature gates, AppWrapper webhook workaround, end-to-end verification). The Prerequisites table already directs readers to advanced-installation.md Steps 1-8, which is the source of truth — keeping a parallel copy in MIGRATION.md risks drift.
- Correct the re-run safety story for S3 Batch Replication: the ClientRequestToken uses $(date +%s), so create-job is NOT idempotent. Caller must check existing job status before re-running. - Split the dense idempotency paragraph in the intro into a bulleted list of the four distinct mechanisms in play. - Drop s3:PutInventoryConfiguration from the Batch Operations role — the job uses S3JobManifestGenerator, not an S3 Inventory report. - Tighten Step 3.2's 'no-op on re-run' claim: only PVC-mounting steps are marker-gated; Step 1.3 and Phase 0 use other mechanisms. - Drop redundant AWS_REGION env in mgr-sync (already covered by --source-region and --destination-region flags). - Shrink mgr-sync timeout from 24h to 6h (matches the < 1 TB scope). - 'k8s' -> 'Kubernetes'.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Client deployment template with all latest infrastructure improvements. Migration code is disabled for safe redeployment.
Changes
ProvisioningRequest (prevents Kueue resource fragmentation)
ProvisioningACCfeature gate, memory increased to 1Gi (was OOMKilled at 512Mi)--enable-provisioning-requests+--kube-api-content-type=application/jsonprovisioningrequests,/status,podtemplatesprovisioningRequest.enabled: trueTested on
platforma-cluster-lab: scale-from-zero, fragmentation, pool-at-capacity scenarios all pass.ECR Pull-Through Cache (optional)
EnableECRCache(default: false)quay.io, lifecycle policy (90-day expiry)BatchImportUpstreamImage+CreateRepository--default-docker-registrywhen enabledMigration (DISABLED)
.migration-complete) provides additional safetySafe to Apply
A CF stack update with this template will:
Test Plan
EnableECRCache=trueon test cluster