feat: client deployment with ProvisioningRequest + ECR cache (migration disabled) [DON'T MERGE] by mike-ainsel · Pull Request #66 · milaboratory/platforma-helm

mike-ainsel · 2026-04-17T11:43:49Z

Summary

Client deployment template with all latest infrastructure improvements. Migration code is disabled for safe redeployment.

Changes

ProvisioningRequest (prevents Kueue resource fragmentation)

Install ProvisioningRequest CRD before Kueue (version-aligned with CA)
Kueue: ProvisioningACC feature gate, memory increased to 1Gi (was OOMKilled at 512Mi)
Cluster Autoscaler: --enable-provisioning-requests + --kube-api-content-type=application/json
CA RBAC: supplementary ClusterRole for provisioningrequests, /status, podtemplates
Platforma values: provisioningRequest.enabled: true

Tested on platforma-cluster-lab: scale-from-zero, fragmentation, pool-at-capacity scenarios all pass.

ECR Pull-Through Cache (optional)

New parameter EnableECRCache (default: false)
Creates pull-through rule for quay.io, lifecycle policy (90-day expiry)
Node IAM for BatchImportUpstreamImage + CreateRepository
Sets --default-docker-registry when enabled

Migration (DISABLED)

Migration code blocks replaced with info messages
Parameters remain in template for schema compatibility
Safe to apply on existing deployments — migration will NOT run regardless of parameter values
The idempotency marker (.migration-complete) provides additional safety

Safe to Apply

A CF stack update with this template will:

Install ProvisioningRequest CRD + upgrade Kueue (memory, feature gate)
Upgrade CA with provisioning flags + add RBAC
Upgrade Platforma chart with ProvisioningRequest enabled
NOT run any migration code

Test Plan

ProvisioningRequest: scale-from-zero, fragmentation, pool-at-capacity
Migration disabled: code blocks are no-op
ECR pull-through: deploy with EnableECRCache=true on test cluster
Full stack update on client cluster

Add optional migration init containers to the Helm chart and corresponding CloudFormation parameters for migrating from an existing single-server Platforma installation. Migration flow: 1. Download database dump from user's S3 (aws-cli init container) 2. Restore database using platforma --restore-db (platforma init container) 3. Invalidate caches using platforma --invalidate-caches (platforma init container) 4. Sync primary storage from old S3 bucket to new (aws-cli init container) 5. Start Platforma normally All steps are idempotent via a marker file on the database PVC. Set migration.enabled: false after successful migration and redeploy.

…rker path - Rename migration → dataMigration in helm values and templates - Separate credentials for database dump and storage sync buckets - Use env var for database path instead of hardcoding in scripts - Increase migration-tmp emptyDir to 30Gi - Align CF template with renamed helm values key

Migration runs as standalone kubectl pods in the CF buildspec before helm install. Four sequential steps: 1. Download database dump from S3 2. Restore database using platforma --restore-db 3. Sync primary storage between S3 buckets 4. Invalidate caches using platforma --invalidate-caches Supports cross-account via MigrationStorageUserArn bucket policy. Idempotent via marker file on the database PVC.

…ssion

…at/migration-from-single-server

Pre-helm: download dump, restore DB, sync storage Post-helm: wait for Platforma migrations, invalidate caches, restart This fixes the RocksDB migration conflict — Platforma must apply its internal DB migrations before cache invalidation can run.

ProvisioningRequest (prevents Kueue resource fragmentation): - Install CRD before Kueue (version-aligned with CA) - Kueue: ProvisioningACC gate, memory 512Mi→1Gi - CA: --enable-provisioning-requests + --kube-api-content-type=application/json - CA RBAC for provisioningrequests, /status, podtemplates - Platforma: provisioningRequest.enabled=true ECR pull-through cache (optional, EnableECRCache parameter): - Pull-through rule for quay.io with 'quay' prefix - Lifecycle policy: 90-day expiry - Node IAM: BatchImportUpstreamImage + CreateRepository - --default-docker-registry when enabled Migration: disabled (prints info message). Parameters remain in template for schema compatibility but code blocks are no-op. Safe to apply on existing deployments without triggering migration.

Without healthcheck-path and success-codes, ALB defaults to HTTP GET / which always fails on a gRPC backend. This causes ALB to mark targets unhealthy even when the pod is running, preventing client connections. Critical for license renewal: when license expires and is renewed on the server, Platforma restarts and recovers, but ALB never routes traffic back without a proper health check.

…port)

Adds ProvisioningRequest documentation to the migration branch: - advanced-installation.md: CRD install, CA flags, RBAC, config table - kueue-values.yaml: memory bump to 1Gi, ProvisioningACC feature gate - upgrade-provisioning-request.md: upgrade guide for manual deployments - provisioning-request-test-results.md: test results from lab cluster

…ningRequest CRD, CA flags, RBAC)

…ontroller

- Replace ForceUpdateInfra with version string properties on TriggerHelmDeploy so any infra version bump auto-triggers the HelmDeployer - Add GKE upgrade guide (native ProvisioningRequest, no CRD/CA setup needed) - Update AWS upgrade guide to reference new trigger mechanism

kubectl apply fails with "cannot change roleRef" when the cluster-autoscaler-provisioning-requests ClusterRoleBinding already exists bound to a different ClusterRole (e.g. the manual provisioning-request-admin role from pre-3.2.x testing). Delete the binding when its roleRef drifts and drop the legacy orphan ClusterRole before re-applying.

Two fixes for the EnableECRCache toggle when flipped true->false on a live stack: - EmptyOnDelete on ECRCacheRepository: lets CF delete the managed quay/milaboratories/pl-containers repo even when it holds cached images. Without this CF refuses the delete and rolls the update back. - EnableECRCache on TriggerPlatformaDeploy: the platforma-deployer build reads ECR_CACHE_ENABLED to decide whether to pass --default-docker-registry to the Platforma chart. Propagating the parameter onto the trigger's properties makes flipping EnableECRCache re-invoke the build automatically, so the running release reconfigs in the same stack update instead of needing a separate ForceUpdatePlatforma bump.

Introduces a client-side-only kill switch for the ECR pull-through cache. EnableECRCache controls whether the cache resources (rule, repo, IAM policy) exist. UseECRCacheForPlatforma controls whether Platforma is configured with --default-docker-registry pointing at the cache. The second flag is only meaningful when the first is true; otherwise it is ignored. Lets operators roll the client back to upstream pulls without tearing down cached images (which take hours to repopulate), and makes debugging cache-specific issues safe. Propagated as an env var on the platforma-deployer CodeBuild and as a property on TriggerPlatformaDeploy so toggling it re-invokes the build automatically.

Standalone walkthrough of the two-phase migration implemented by infrastructure/aws/migration.sh: pre-helm DB restore + S3 sync, helm install, post-helm cache invalidation. Includes the pod YAML the script generates so the procedure can be run manually or wired into Terraform. Adds a reference section covering all Cluster Autoscaler extraArgs, Kueue Helm values + chart-level kueue.* knobs, and the AppWrapper install (incl. webhook-deletion workaround).

…side wiring Step 1.2.1: dedicated 'platforma-migration' service account with an IAM policy granting read on the source dump+storage buckets and read/write on the destination — wired into mgr-download and mgr-sync via serviceAccountName, replacing inline AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY for the same-account case. Cross-account still supported by re-enabling the env block. Component Options Reference rewritten around a single question: what does the operator have to add to the cluster so the chart's auto-rendered ProvisioningRequestConfig + AdmissionCheck actually fire? - Cluster Autoscaler: --enable-provisioning-requests=true, --kube-api-content-type=application/json (workaround for #8855), the ProvisioningRequest CRD, and the ClusterRole/Binding granting provisioningrequests + podtemplates verbs. IAM scoping note about autoscaling:ResourceTag vs aws:ResourceTag. - Kueue: ProvisioningACC feature gate + the rendered ProvisioningRequestConfig / AdmissionCheck / ClusterQueue admissionChecksStrategy as audit reference. - AppWrapper: install + webhook-deletion workaround (no ProvReq config of its own). - End-to-end verification chain: AppWrapper -> Workload -> ProvisioningRequest -> ClusterQueue/AdmissionCheck Active.

- Add Phase 0 covering source-side database dump (curl /db/state_raw) and identification of the source primary-storage bucket. - Swap aws s3 sync for s5cmd in Step 1.5 (10-30x faster for many-small- object buckets); document trailing-slash gotcha for SOURCE_PREFIX. - Add Step 1.5b: S3 Batch Replication for multi-TB buckets - versioning prerequisites, replication IAM role, bucket replication config, Batch Operations IAM role, batch job creation with auto-generated manifest, monitoring, and teardown. - Decision table in Step 1.5 directs operators to s5cmd or Batch Replication based on bucket size. - Variables block now exports CLUSTER_NAME, AWS_ACCOUNT_ID, MI_LICENSE, and splits PLATFORMA_VERSION (image) from CHART_VERSION (Helm). - Correct the idempotency claim: only the final invalidate step writes .migration-complete; intermediate steps gate on it. - Gate Step 1.6 cleanup on .migration-complete for consistency. - Comment out mgr-sync env block by default (matches mgr-download IRSA pattern; explicit env would override IRSA credentials). - Drop See Also section that linked the internal core/pl repo.

Primary storage and the database are both required for Platforma to serve projects — restoring the DB alone isn't enough, since blocks reference objects that must already exist in the destination bucket. The storage sync is also the long-pole step (hours for TB-scale via Batch Replication, longer via s5cmd). Move the storage sync ahead of the DB download/restore in Phase 1: - Step 1.3 = Sync Primary Storage (was 1.5) - Step 1.3b = S3 Batch Replication (was 1.5b) - Step 1.4 = Download DB dump (was 1.3) - Step 1.5 = Restore DB (was 1.4) This way operators kick off the slow sync first and run the much faster DB steps in parallel. Updated cross-references in Phase 0, Phase 1 intro, and Cross-Account Sync.

Batch Replication is entirely AWS-side — it does not need the Kubernetes cluster, so it can (and should) be kicked off in Phase 0 alongside the DB dump. For multi-TB source buckets this is the critical change: AWS copies in the background for hours while the operator stands up the cluster, downloads/restores the DB, etc., instead of serializing those steps. - Move Step 1.3b -> Step 0.6 (and substeps 0.6.1-0.6.7). - Step 1.3 (s5cmd) now scoped explicitly to < 1 TB buckets. - Decision table moved to Step 1.3; Phase 0 intro lists the running Batch job as a third artefact. - Updated Phase 1 intro and Step 0.5 cross-refs.

Real bug: mgr-sync (Step 1.3) mounted the platforma-database PVC just to check the .migration-complete marker, but mgr-download/restore/ invalidate also mount the same RWO PVC. K8s would refuse to schedule mgr-sync to a different node while a DB pod is running, breaking the 'runs in parallel' claim. s5cmd is already idempotent via ETag comparison, so the marker check was redundant — drop the PVC mount. Other fixes from the review: - Idempotency intro now distinguishes PVC-bound steps (marker-gated) from Phase 0 / Step 1.3 (idempotent by underlying op). - Step 0.4 makes the forward-reference to Step 1.2.1 explicit. - Step 0.6.3 reworked: heading marked optional/recommended, lead with the rationale (safety net) instead of contradicting it in a callout. - Step 0.6.7 comment clarifies it deletes the whole replication config. - Drop stale 'prefer Batch Replication' tuning note in Step 1.3 — Batch is now Step 0.6, redirected from the decision table at the top. - Rollback claim about untouched source qualified with Step 0.2 caveat. - Phrasing nits (asynchronously/in-the-background redundancy, duplicate 'long-pole' framing across sections).

This section duplicated content from infrastructure/aws/advanced-installation.md (Cluster Autoscaler ProvisioningRequest setup, Kueue feature gates, AppWrapper webhook workaround, end-to-end verification). The Prerequisites table already directs readers to advanced-installation.md Steps 1-8, which is the source of truth — keeping a parallel copy in MIGRATION.md risks drift.

- Correct the re-run safety story for S3 Batch Replication: the ClientRequestToken uses $(date +%s), so create-job is NOT idempotent. Caller must check existing job status before re-running. - Split the dense idempotency paragraph in the intro into a bulleted list of the four distinct mechanisms in play. - Drop s3:PutInventoryConfiguration from the Batch Operations role — the job uses S3JobManifestGenerator, not an S3 Inventory report. - Tighten Step 3.2's 'no-op on re-run' claim: only PVC-mounting steps are marker-gated; Step 1.3 and Phase 0 use other mechanisms. - Drop redundant AWS_REGION env in mgr-sync (already covered by --source-region and --destination-region flags). - Shrink mgr-sync timeout from 24h to 6h (matches the < 1 TB scope). - 'k8s' -> 'Kubernetes'.

mike-ainsel added 20 commits March 24, 2026 17:29

fix: compact migration script to fit CodeBuild 25.6KB buildspec limit

a1e217e

fix: rewrite migration.sh with proper YAML and Helm PVC labels

5891e96

fix: rewrite migration.sh with proper YAML generation via functions

fe76f4a

fix: guard against empty PLATFORMA_VERSION in migration script

66a1524

fix: indent script in pod YAML, stop on migration failure

9aeaed0

fix: use /app/platforma full path in migration pods

4f32a99

fix: add --ignore-not-found to pod delete in wait_pod

55870ce

fix: add --copy-props none to s3 sync to avoid GetObjectTagging permi…

9a537b8

…ssion

fix: create gp3 storage class before PVC in migration script

168a3e5

fix: add /data/main emptyDir to platforma pods for --main-root

ed62d01

Merge branch 'main' of github.com:milaboratory/platforma-helm into fe…

a4355db

…at/migration-from-single-server

fix: scale down Platforma before cache invalidation (RWO PVC)

269d956

fix: derive AWS_ACCOUNT_ID at runtime for ECR registry URL

ab33446

feat: bump Platforma chart to 3.2.2 (includes ProvisioningRequest sup…

23accda

…port)

mike-ainsel changed the title ~~feat: client deployment with ProvisioningRequest + ECR cache (migration disabled)~~ feat: client deployment with ProvisioningRequest + ECR cache (migration disabled) [DON'T MERGE] Apr 17, 2026

mike-ainsel added 9 commits April 17, 2026 17:08

fix: correct chart version to 3.2.2 in upgrade guide

d5e0545

fix: bump ForceUpdateInfra to trigger HelmDeployer (installs Provisio…

479bedf

…ningRequest CRD, CA flags, RBAC)

docs: add ForceUpdateInfra explanation for CF infra re-deployment

2f51ce8

fix: restart Kueue after CRD install to pick up ProvisioningRequest c…

e9779b8

…ontroller

chore: bump ForceUpdatePlatforma to re-pull chart 3.3.0

6fac31d

mike-ainsel and others added 11 commits April 21, 2026 16:36

Adjust RAM limits to fit the real limitation on nodes in AWS cluster

599ae0c

bump build-spec-revision

cc528ff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: client deployment with ProvisioningRequest + ECR cache (migration disabled) [DON'T MERGE]#66

feat: client deployment with ProvisioningRequest + ECR cache (migration disabled) [DON'T MERGE]#66
mike-ainsel wants to merge 40 commits into
mainfrom
feat/migration-from-single-server

mike-ainsel commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mike-ainsel commented Apr 17, 2026

Summary

Changes

ProvisioningRequest (prevents Kueue resource fragmentation)

ECR Pull-Through Cache (optional)

Migration (DISABLED)

Safe to Apply

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants