Skip to content

feat: client deployment with ProvisioningRequest + ECR cache (migration disabled) [DON'T MERGE]#66

Open
mike-ainsel wants to merge 40 commits into
mainfrom
feat/migration-from-single-server
Open

feat: client deployment with ProvisioningRequest + ECR cache (migration disabled) [DON'T MERGE]#66
mike-ainsel wants to merge 40 commits into
mainfrom
feat/migration-from-single-server

Conversation

@mike-ainsel
Copy link
Copy Markdown
Member

Summary

Client deployment template with all latest infrastructure improvements. Migration code is disabled for safe redeployment.

Changes

ProvisioningRequest (prevents Kueue resource fragmentation)

  • Install ProvisioningRequest CRD before Kueue (version-aligned with CA)
  • Kueue: ProvisioningACC feature gate, memory increased to 1Gi (was OOMKilled at 512Mi)
  • Cluster Autoscaler: --enable-provisioning-requests + --kube-api-content-type=application/json
  • CA RBAC: supplementary ClusterRole for provisioningrequests, /status, podtemplates
  • Platforma values: provisioningRequest.enabled: true

Tested on platforma-cluster-lab: scale-from-zero, fragmentation, pool-at-capacity scenarios all pass.

ECR Pull-Through Cache (optional)

  • New parameter EnableECRCache (default: false)
  • Creates pull-through rule for quay.io, lifecycle policy (90-day expiry)
  • Node IAM for BatchImportUpstreamImage + CreateRepository
  • Sets --default-docker-registry when enabled

Migration (DISABLED)

  • Migration code blocks replaced with info messages
  • Parameters remain in template for schema compatibility
  • Safe to apply on existing deployments — migration will NOT run regardless of parameter values
  • The idempotency marker (.migration-complete) provides additional safety

Safe to Apply

A CF stack update with this template will:

  1. Install ProvisioningRequest CRD + upgrade Kueue (memory, feature gate)
  2. Upgrade CA with provisioning flags + add RBAC
  3. Upgrade Platforma chart with ProvisioningRequest enabled
  4. NOT run any migration code

Test Plan

  • ProvisioningRequest: scale-from-zero, fragmentation, pool-at-capacity
  • Migration disabled: code blocks are no-op
  • ECR pull-through: deploy with EnableECRCache=true on test cluster
  • Full stack update on client cluster

Add optional migration init containers to the Helm chart and
corresponding CloudFormation parameters for migrating from an
existing single-server Platforma installation.

Migration flow:
1. Download database dump from user's S3 (aws-cli init container)
2. Restore database using platforma --restore-db (platforma init container)
3. Invalidate caches using platforma --invalidate-caches (platforma init container)
4. Sync primary storage from old S3 bucket to new (aws-cli init container)
5. Start Platforma normally

All steps are idempotent via a marker file on the database PVC.
Set migration.enabled: false after successful migration and redeploy.
…rker path

- Rename migration → dataMigration in helm values and templates
- Separate credentials for database dump and storage sync buckets
- Use env var for database path instead of hardcoding in scripts
- Increase migration-tmp emptyDir to 30Gi
- Align CF template with renamed helm values key
Migration runs as standalone kubectl pods in the CF buildspec before
helm install. Four sequential steps:
1. Download database dump from S3
2. Restore database using platforma --restore-db
3. Sync primary storage between S3 buckets
4. Invalidate caches using platforma --invalidate-caches

Supports cross-account via MigrationStorageUserArn bucket policy.
Idempotent via marker file on the database PVC.
Pre-helm: download dump, restore DB, sync storage
Post-helm: wait for Platforma migrations, invalidate caches, restart

This fixes the RocksDB migration conflict — Platforma must apply its
internal DB migrations before cache invalidation can run.
ProvisioningRequest (prevents Kueue resource fragmentation):
- Install CRD before Kueue (version-aligned with CA)
- Kueue: ProvisioningACC gate, memory 512Mi→1Gi
- CA: --enable-provisioning-requests + --kube-api-content-type=application/json
- CA RBAC for provisioningrequests, /status, podtemplates
- Platforma: provisioningRequest.enabled=true

ECR pull-through cache (optional, EnableECRCache parameter):
- Pull-through rule for quay.io with 'quay' prefix
- Lifecycle policy: 90-day expiry
- Node IAM: BatchImportUpstreamImage + CreateRepository
- --default-docker-registry when enabled

Migration: disabled (prints info message). Parameters remain in
template for schema compatibility but code blocks are no-op.
Safe to apply on existing deployments without triggering migration.
Without healthcheck-path and success-codes, ALB defaults to HTTP GET /
which always fails on a gRPC backend. This causes ALB to mark targets
unhealthy even when the pod is running, preventing client connections.

Critical for license renewal: when license expires and is renewed on
the server, Platforma restarts and recovers, but ALB never routes
traffic back without a proper health check.
@mike-ainsel mike-ainsel changed the title feat: client deployment with ProvisioningRequest + ECR cache (migration disabled) feat: client deployment with ProvisioningRequest + ECR cache (migration disabled) [DON'T MERGE] Apr 17, 2026
Adds ProvisioningRequest documentation to the migration branch:
- advanced-installation.md: CRD install, CA flags, RBAC, config table
- kueue-values.yaml: memory bump to 1Gi, ProvisioningACC feature gate
- upgrade-provisioning-request.md: upgrade guide for manual deployments
- provisioning-request-test-results.md: test results from lab cluster
- Replace ForceUpdateInfra with version string properties on TriggerHelmDeploy
  so any infra version bump auto-triggers the HelmDeployer
- Add GKE upgrade guide (native ProvisioningRequest, no CRD/CA setup needed)
- Update AWS upgrade guide to reference new trigger mechanism
kubectl apply fails with "cannot change roleRef" when the
cluster-autoscaler-provisioning-requests ClusterRoleBinding already
exists bound to a different ClusterRole (e.g. the manual
provisioning-request-admin role from pre-3.2.x testing). Delete the
binding when its roleRef drifts and drop the legacy orphan ClusterRole
before re-applying.
Two fixes for the EnableECRCache toggle when flipped true->false on a
live stack:

- EmptyOnDelete on ECRCacheRepository: lets CF delete the managed
  quay/milaboratories/pl-containers repo even when it holds cached
  images. Without this CF refuses the delete and rolls the update back.
- EnableECRCache on TriggerPlatformaDeploy: the platforma-deployer
  build reads ECR_CACHE_ENABLED to decide whether to pass
  --default-docker-registry to the Platforma chart. Propagating the
  parameter onto the trigger's properties makes flipping EnableECRCache
  re-invoke the build automatically, so the running release reconfigs
  in the same stack update instead of needing a separate
  ForceUpdatePlatforma bump.
mike-ainsel and others added 11 commits April 21, 2026 16:36
Introduces a client-side-only kill switch for the ECR pull-through cache.
EnableECRCache controls whether the cache resources (rule, repo, IAM
policy) exist. UseECRCacheForPlatforma controls whether Platforma is
configured with --default-docker-registry pointing at the cache. The
second flag is only meaningful when the first is true; otherwise it is
ignored.

Lets operators roll the client back to upstream pulls without tearing
down cached images (which take hours to repopulate), and makes debugging
cache-specific issues safe. Propagated as an env var on the
platforma-deployer CodeBuild and as a property on TriggerPlatformaDeploy
so toggling it re-invokes the build automatically.
Standalone walkthrough of the two-phase migration implemented by
infrastructure/aws/migration.sh: pre-helm DB restore + S3 sync,
helm install, post-helm cache invalidation. Includes the pod YAML
the script generates so the procedure can be run manually or wired
into Terraform. Adds a reference section covering all Cluster
Autoscaler extraArgs, Kueue Helm values + chart-level kueue.* knobs,
and the AppWrapper install (incl. webhook-deletion workaround).
…side wiring

Step 1.2.1: dedicated 'platforma-migration' service account with an IAM
policy granting read on the source dump+storage buckets and read/write on
the destination — wired into mgr-download and mgr-sync via
serviceAccountName, replacing inline AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
for the same-account case. Cross-account still supported by re-enabling
the env block.

Component Options Reference rewritten around a single question: what does
the operator have to add to the cluster so the chart's auto-rendered
ProvisioningRequestConfig + AdmissionCheck actually fire?

  - Cluster Autoscaler: --enable-provisioning-requests=true,
    --kube-api-content-type=application/json (workaround for #8855), the
    ProvisioningRequest CRD, and the ClusterRole/Binding granting
    provisioningrequests + podtemplates verbs. IAM scoping note about
    autoscaling:ResourceTag vs aws:ResourceTag.
  - Kueue: ProvisioningACC feature gate + the rendered
    ProvisioningRequestConfig / AdmissionCheck / ClusterQueue
    admissionChecksStrategy as audit reference.
  - AppWrapper: install + webhook-deletion workaround (no ProvReq config
    of its own).
  - End-to-end verification chain: AppWrapper -> Workload ->
    ProvisioningRequest -> ClusterQueue/AdmissionCheck Active.
- Add Phase 0 covering source-side database dump (curl /db/state_raw)
  and identification of the source primary-storage bucket.
- Swap aws s3 sync for s5cmd in Step 1.5 (10-30x faster for many-small-
  object buckets); document trailing-slash gotcha for SOURCE_PREFIX.
- Add Step 1.5b: S3 Batch Replication for multi-TB buckets - versioning
  prerequisites, replication IAM role, bucket replication config,
  Batch Operations IAM role, batch job creation with auto-generated
  manifest, monitoring, and teardown.
- Decision table in Step 1.5 directs operators to s5cmd or Batch
  Replication based on bucket size.
- Variables block now exports CLUSTER_NAME, AWS_ACCOUNT_ID, MI_LICENSE,
  and splits PLATFORMA_VERSION (image) from CHART_VERSION (Helm).
- Correct the idempotency claim: only the final invalidate step writes
  .migration-complete; intermediate steps gate on it.
- Gate Step 1.6 cleanup on .migration-complete for consistency.
- Comment out mgr-sync env block by default (matches mgr-download IRSA
  pattern; explicit env would override IRSA credentials).
- Drop See Also section that linked the internal core/pl repo.
Primary storage and the database are both required for Platforma to
serve projects — restoring the DB alone isn't enough, since blocks
reference objects that must already exist in the destination bucket.
The storage sync is also the long-pole step (hours for TB-scale via
Batch Replication, longer via s5cmd).

Move the storage sync ahead of the DB download/restore in Phase 1:
- Step 1.3 = Sync Primary Storage (was 1.5)
- Step 1.3b = S3 Batch Replication (was 1.5b)
- Step 1.4 = Download DB dump (was 1.3)
- Step 1.5 = Restore DB (was 1.4)

This way operators kick off the slow sync first and run the much
faster DB steps in parallel. Updated cross-references in Phase 0,
Phase 1 intro, and Cross-Account Sync.
Batch Replication is entirely AWS-side — it does not need the
Kubernetes cluster, so it can (and should) be kicked off in Phase 0
alongside the DB dump. For multi-TB source buckets this is the
critical change: AWS copies in the background for hours while the
operator stands up the cluster, downloads/restores the DB, etc.,
instead of serializing those steps.

- Move Step 1.3b -> Step 0.6 (and substeps 0.6.1-0.6.7).
- Step 1.3 (s5cmd) now scoped explicitly to < 1 TB buckets.
- Decision table moved to Step 1.3; Phase 0 intro lists the running
  Batch job as a third artefact.
- Updated Phase 1 intro and Step 0.5 cross-refs.
Real bug: mgr-sync (Step 1.3) mounted the platforma-database PVC just
to check the .migration-complete marker, but mgr-download/restore/
invalidate also mount the same RWO PVC. K8s would refuse to schedule
mgr-sync to a different node while a DB pod is running, breaking the
'runs in parallel' claim. s5cmd is already idempotent via ETag
comparison, so the marker check was redundant — drop the PVC mount.

Other fixes from the review:
- Idempotency intro now distinguishes PVC-bound steps (marker-gated)
  from Phase 0 / Step 1.3 (idempotent by underlying op).
- Step 0.4 makes the forward-reference to Step 1.2.1 explicit.
- Step 0.6.3 reworked: heading marked optional/recommended, lead with
  the rationale (safety net) instead of contradicting it in a callout.
- Step 0.6.7 comment clarifies it deletes the whole replication config.
- Drop stale 'prefer Batch Replication' tuning note in Step 1.3 — Batch
  is now Step 0.6, redirected from the decision table at the top.
- Rollback claim about untouched source qualified with Step 0.2 caveat.
- Phrasing nits (asynchronously/in-the-background redundancy, duplicate
  'long-pole' framing across sections).
This section duplicated content from infrastructure/aws/advanced-installation.md
(Cluster Autoscaler ProvisioningRequest setup, Kueue feature gates,
AppWrapper webhook workaround, end-to-end verification). The Prerequisites
table already directs readers to advanced-installation.md Steps 1-8, which
is the source of truth — keeping a parallel copy in MIGRATION.md risks
drift.
- Correct the re-run safety story for S3 Batch Replication: the
  ClientRequestToken uses $(date +%s), so create-job is NOT
  idempotent. Caller must check existing job status before re-running.
- Split the dense idempotency paragraph in the intro into a bulleted
  list of the four distinct mechanisms in play.
- Drop s3:PutInventoryConfiguration from the Batch Operations role —
  the job uses S3JobManifestGenerator, not an S3 Inventory report.
- Tighten Step 3.2's 'no-op on re-run' claim: only PVC-mounting steps
  are marker-gated; Step 1.3 and Phase 0 use other mechanisms.
- Drop redundant AWS_REGION env in mgr-sync (already covered by
  --source-region and --destination-region flags).
- Shrink mgr-sync timeout from 24h to 6h (matches the < 1 TB scope).
- 'k8s' -> 'Kubernetes'.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants