You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Adds an opt-in mechanism to route a configurable percentage of
organizations onto the compute (MicroVM) backing of their region at
trigger time, without changing their stored region settings.
Routing is gated by three global feature flags -
`computeMigrationEnabled`, `computeMigrationFreePercentage`,
`computeMigrationPaidPercentage` - plus a per-org
`computeMigrationEnabled` override that wins in both directions. A
region's compute backing is resolved from a new
`WorkerInstanceGroup.region` column: a container group and its MicroVM
group share one geo `region`, so the migration swaps the resolved worker
queue to the backing group's queue. Orgs are bucketed deterministically
by id, so ramping a percentage down keeps a strict subset rather than
reshuffling, and a region with no compute backing is never touched.
Everything is off by default - behaviour is unchanged unless the flags
are set.
The flags and the worker-region groups are read on the trigger hot path
from in-memory snapshots rather than the database: a small
`createReloadingRegistry` helper loads each at startup and refreshes
them on an interval, so no per-trigger query is added and a percentage
or kill-switch change propagates within the reload interval. A cold
replica whose snapshot hasn't loaded yet reads as not-migrated (the
container path) and self-corrects on the next load - the same cold-start
contract as the datastore / LLM-pricing registries, with a
`reloading_registry_loaded` metric so a never-loaded registry is
alertable.
The same migration decision is consulted at deploy-time template
creation so a migrated org gets a compute template built ahead of its
first run. This runs in shadow mode (best-effort, never fails the
deploy) by default, or - when the `computeMigrationRequireTemplate` flag
is on - in required mode, built synchronously at deploy so the first run
never builds on-demand and template errors surface at deploy time.
So operators keep "which runs ran where" while customers only see
geography: the run's actual worker queue is stored raw, and the geo
region is stamped separately on `TaskRun.region` (and a new ClickHouse
`region` column) at trigger time. Read surfaces - the dashboard, the
API, and the Query/Logs page - show the geo region, falling back to the
worker queue for runs written before the column existed.
Minor follow-ups left out of scope: the percentage flags render as text
inputs on the admin flags page (the catalog UI has no numeric control
type yet), and `createReloadingRegistry` could later gain pub/sub for
sub-second cross-replica propagation if the reload interval proves too
slow.
0 commit comments