feat(cloudrun): Cloud Run hub deployment with co-located GKE broker by scion-gteam[bot] · Pull Request #145 · ptone/scion

scion-gteam · 2026-06-05T22:49:01Z

Summary

Adds scripts and config for deploying the Scion hub as a Cloud Run service (min=max=1, SQLite, authenticated-only) with a co-located GKE broker targeting a GKE Autopilot cluster. Also adds /health endpoint alias for Cloud Run compatibility.

Reshaping of upstream PR GoogleCloudPlatform#307 — this PR contains only the pure Cloud Run deployment infrastructure. The dual-layer auth / OIDC transport work from GoogleCloudPlatform#307 has been separated out and is now handled by PR GoogleCloudPlatform#310 (scion/auth-proxy-mode), which provides a more comprehensive pluggable transport layer.

What's included

scripts/cloudrun/ — End-to-end deployment:
- Dockerfile: 3-stage build (Node web assets → Go binary → slim runtime with gke-gcloud-auth-plugin)
- deploy.sh: Creates service account, builds image via Docker, generates kubeconfig from live cluster (endpoint+CA, ADC handles credentials), stores secrets in Secret Manager, deploys Cloud Run service
- entrypoint.sh: Copies secret-mounted settings via cat to bypass symlink limitations
- hub-settings-template.yaml: Hub config with session secret placeholder, SQLite, GKE broker profile
- README.md: Architecture diagram, prerequisites, quick start, configuration reference
/health endpoint alias (pkg/hub/web.go, pkg/hub/auth.go) — Cloud Run's Google Frontend intercepts /healthz before it reaches the container and returns 404. /health is an unauthenticated alias that passes through.
hubclient /health fallback (pkg/hubclient/client.go) — Falls back to /health when /healthz returns 404, for backward-compatible Cloud Run support.

What was separated out (now in PR GoogleCloudPlatform#310)

OIDC identity token transport (pkg/sciontool/hub/oidc.go) — PR feat(hub): auth proxy mode (Google IAP) GoogleCloudPlatform/scion#310 reimplemented this with a pluggable architecture supporting both hub-minted (injected) and metadata-server token sources
Dual-layer auth concept — PR feat(hub): auth proxy mode (Google IAP) GoogleCloudPlatform/scion#310 provides the full auth proxy mode with IAP JWT verification and transport token minting

Test plan

go build ./... passes
GET /health returns 200 without auth; GET /healthz returns 200 with auth
./scripts/cloudrun/deploy.sh deploys successfully
GKE broker pod schedules agents on target cluster

…GoogleCloudPlatform#303) * fix: atomic session-guarded broker disconnect to prevent reconnect race (#131) The onDisconnect callback previously used separate ReleaseRuntimeBrokerConnection and UpdateRuntimeBrokerHeartbeat calls. When a broker disconnects and reconnects rapidly, the stale disconnect's offline stamp can clobber the new connection's online status because UpdateRuntimeBrokerHeartbeat has no session guard — it unconditionally overwrites status. Provider statuses are also clobbered and never restored by heartbeats, leaving the broker permanently invisible until hub restart. Add ReleaseAndMarkBrokerOffline which atomically clears affinity AND stamps status=offline in a single CAS write. If a concurrent reconnect has already claimed the broker with a new session, the compare fails and the callback is a no-op. Also add a re-check guard before updating provider statuses. * docs: add project log for broker disconnect race fix unification

…rm#301) * docs(design): reduced resource clone/delete design (resolved review) * refactor: remove dead Locked field from Template and HarnessConfig models Remove the Locked bool field, all 16 enforcement sites across 6 handler files, the force query parameter from delete endpoints, 3 locked-template tests, and add a DB migration to drop the column. No production code ever set Locked=true — this simplifies the handlers for the upcoming clone/delete feature. * feat: add harness-config clone endpoint, authz hardening, and slug uniqueness - Add handleHarnessConfigClone mirroring template clone - Add CheckAccess authz to deleteTemplateV2, handleTemplateClone, deleteHarnessConfig, handleHarnessConfigClone - Add DB migration V55: UNIQUE constraint on (slug, scope, scope_id) - Return 409 Conflict on slug collision during clone - Add clone failure cleanup - Add tests for clone, authz, and slug collision * feat(web): add Clone/Delete row actions and clone-from-global to resource list - Add Clone and Delete action menu to shared resource-list component - Add delete confirmation dialog with deleteFiles checkbox (default on) - Add clone dialog with name input and 409 collision handling - Add clone-from-global picker in project settings view - Unify on resource-changed event (migrate resource-imported) - Gate actions on capabilities (canClone, canDelete properties) * fix: address PR review — cleanup orphaned files on DB create failure, remove redundant clone method - Add stor.DeletePrefix cleanup when CreateTemplate/CreateHarnessConfig fails after files were already copied (prevents orphaned storage files) - Remove redundant confirmCloneFromGlobal method — confirmClone already handles cross-scope clone via the component's scope/scopeId properties * fix: adapt Locked removal and slug constraint to Ent-based schema Remove Locked references from entadapter, remove stale sqlite.go (replaced by Ent ORM upstream), add UNIQUE(slug, scope, scope_id) to Ent schema indexes, and regenerate Ent code. * fix: adapt tests and entadapter for Ent-based store (UUID IDs, no Locked) - Use api.NewUUID() for all test entity IDs (Ent enforces UUID format) - Remove Locked field from entadapter create/update calls - Remove stale sqlite.go (replaced by Ent ORM upstream) - Add UNIQUE(slug, scope, scope_id) to Ent schema indexes

…form#309) * fix(hub): make web session replica-portable to fix OAuth state_mismatch OAuth login behind the load balancer intermittently failed with state_mismatch: the CSRF state token (and the entire web session) was stored in a gorilla FilesystemStore on the handling replica's local disk, while the browser only carried a session-ID cookie. When the LB routed /auth/login and /auth/callback to different replicas, the callback replica had no matching session file -> empty state -> state_mismatch. It only "worked" when both hops happened to hit the same backend. The same flaw affected the post-login session: sessionToBearerMiddleware reads the Hub access/refresh JWTs from that disk-local store on every API request, so sessions silently dropped whenever a follow-up request landed on a different replica. Replace the FilesystemStore with an encrypted, signed gorilla CookieStore so the whole session lives in the client's cookie and any replica sharing SESSION_SECRET can read it. Keys are derived deterministically from SESSION_SECRET (32-byte HMAC auth key + 32-byte AES-256 encryption key, domain-separated). No DB, no migration; works with N replicas. The original switch to disk was motivated by a "JWT tokens exceed 4096 bytes" concern. Measured against the current compact HS256 tokens the full session (identity + access + refresh) encodes to ~2.6 KB, well under the browser's ~4 KB per-cookie cap, so the securecookie length limit is left in force (oversize would now error+log, not silently drop). Tests: replace the obsolete NoMaxLengthLimit test with a cross-replica round-trip regression test (cookie minted by replica A decodes on replica B with the same secret; carries OAuth state + post-login tokens) plus a negative test (a different secret cannot decode the cookie). * fix(hub): derive JWT signing keys from shared SESSION_SECRET to fix cross-replica login loop The cookie-store fix (0515e2a) made the web session replica-portable, but the Hub JWT *inside* the cookie is still signed with a per-replica key: ensureSigningKey scopes signing keys to (scope=hub, scope_id=hubID) and hubID = sha256(hostname)[:12]. The integration env runs two replicas of one logical hub behind a single LB, sharing one Postgres DB and one SESSION_SECRET but with different hostnames -> different hubIDs -> different HS256 signing keys. So a user JWT minted on replica A failed signature verification on replica B (go-jose: error in cryptographic primitive); refresh failed too (refresh token signed with the same foreign key), so sessionToBearerMiddleware declared the session irrecoverably invalid, DELETED the cookie (MaxAge=-1) and returned session_expired. The cookie deletion turns it into a redirect loop: dashboard flashes, then /login?error=session_expired. Fix: extend the 0515e2a approach (replica-portable via the shared secret) from the cookie to the keys inside it. Add ServerConfig.SharedSigningSecret; when set, ensureSigningKey derives the agent and user signing keys deterministically from it (domain-separated by key name) and bypasses per-host secret-backend storage. cmd feeds the same --session-secret / SESSION_SECRET value into both the web cookie store and the hub config via a new resolveSessionSecret() helper. Empty secret keeps the existing per-hub behavior (no regression for single-node/local dev). Tests: cross-replica round trip (different hubID + same secret -> identical keys, token minted on A validates on B; different secret cannot) plus pre-configured-key precedence. Note: rollout rotates the signing keys (now derived from SESSION_SECRET), so existing web/CLI tokens are invalidated once and users re-login. --------- Co-authored-by: Scion <agent@scion.dev>

Add scripts and config for deploying the Scion hub as a Cloud Run service (min=max=1, SQLite, authenticated-only) with a co-located GKE broker targeting a GKE Autopilot cluster. What's included: - scripts/cloudrun/: Dockerfile (3-stage: Node web → Go binary → slim runtime), deploy.sh (service account, Cloud Build, kubeconfig from live cluster, Secret Manager, Cloud Run deploy), entrypoint.sh (symlink-safe secret copy), hub-settings-template.yaml, README. - /health endpoint alias: Cloud Run's Google Frontend intercepts /healthz before it reaches the container; /health passes through. Added as route, public route, and health endpoint check. - hubclient /health fallback: falls back to /health when /healthz returns 404 for Cloud Run compatibility. Auth transport (OIDC identity token for agents) is handled separately by the auth-proxy-mode PR (GoogleCloudPlatform#310) which provides a more comprehensive pluggable transport layer.

ptone and others added 4 commits June 5, 2026 13:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cloudrun): Cloud Run hub deployment with co-located GKE broker#145

feat(cloudrun): Cloud Run hub deployment with co-located GKE broker#145
scion-gteam[bot] wants to merge 4 commits into
mainfrom
scion/cloudrun-deploy

scion-gteam Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

scion-gteam Bot commented Jun 5, 2026

Summary

What's included

What was separated out (now in PR GoogleCloudPlatform#310)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant