Skip to content

feat(cloudrun): Cloud Run hub deployment with co-located GKE broker#145

Open
scion-gteam[bot] wants to merge 4 commits into
mainfrom
scion/cloudrun-deploy
Open

feat(cloudrun): Cloud Run hub deployment with co-located GKE broker#145
scion-gteam[bot] wants to merge 4 commits into
mainfrom
scion/cloudrun-deploy

Conversation

@scion-gteam
Copy link
Copy Markdown

@scion-gteam scion-gteam Bot commented Jun 5, 2026

Summary

Adds scripts and config for deploying the Scion hub as a Cloud Run service (min=max=1, SQLite, authenticated-only) with a co-located GKE broker targeting a GKE Autopilot cluster. Also adds /health endpoint alias for Cloud Run compatibility.

Reshaping of upstream PR GoogleCloudPlatform#307 — this PR contains only the pure Cloud Run deployment infrastructure. The dual-layer auth / OIDC transport work from GoogleCloudPlatform#307 has been separated out and is now handled by PR GoogleCloudPlatform#310 (scion/auth-proxy-mode), which provides a more comprehensive pluggable transport layer.

What's included

  • scripts/cloudrun/ — End-to-end deployment:

    • Dockerfile: 3-stage build (Node web assets → Go binary → slim runtime with gke-gcloud-auth-plugin)
    • deploy.sh: Creates service account, builds image via Docker, generates kubeconfig from live cluster (endpoint+CA, ADC handles credentials), stores secrets in Secret Manager, deploys Cloud Run service
    • entrypoint.sh: Copies secret-mounted settings via cat to bypass symlink limitations
    • hub-settings-template.yaml: Hub config with session secret placeholder, SQLite, GKE broker profile
    • README.md: Architecture diagram, prerequisites, quick start, configuration reference
  • /health endpoint alias (pkg/hub/web.go, pkg/hub/auth.go) — Cloud Run's Google Frontend intercepts /healthz before it reaches the container and returns 404. /health is an unauthenticated alias that passes through.

  • hubclient /health fallback (pkg/hubclient/client.go) — Falls back to /health when /healthz returns 404, for backward-compatible Cloud Run support.

What was separated out (now in PR GoogleCloudPlatform#310)

Test plan

  • go build ./... passes
  • GET /health returns 200 without auth; GET /healthz returns 200 with auth
  • ./scripts/cloudrun/deploy.sh deploys successfully
  • GKE broker pod schedules agents on target cluster

ptone and others added 4 commits June 5, 2026 13:54
…GoogleCloudPlatform#303)

* fix: atomic session-guarded broker disconnect to prevent reconnect race (#131)

The onDisconnect callback previously used separate ReleaseRuntimeBrokerConnection
and UpdateRuntimeBrokerHeartbeat calls. When a broker disconnects and reconnects
rapidly, the stale disconnect's offline stamp can clobber the new connection's
online status because UpdateRuntimeBrokerHeartbeat has no session guard — it
unconditionally overwrites status. Provider statuses are also clobbered and never
restored by heartbeats, leaving the broker permanently invisible until hub restart.

Add ReleaseAndMarkBrokerOffline which atomically clears affinity AND stamps
status=offline in a single CAS write. If a concurrent reconnect has already
claimed the broker with a new session, the compare fails and the callback is
a no-op. Also add a re-check guard before updating provider statuses.

* docs: add project log for broker disconnect race fix unification
…rm#301)

* docs(design): reduced resource clone/delete design (resolved review)

* refactor: remove dead Locked field from Template and HarnessConfig models

Remove the Locked bool field, all 16 enforcement sites across 6 handler
files, the force query parameter from delete endpoints, 3 locked-template
tests, and add a DB migration to drop the column. No production code ever
set Locked=true — this simplifies the handlers for the upcoming clone/delete
feature.

* feat: add harness-config clone endpoint, authz hardening, and slug uniqueness

- Add handleHarnessConfigClone mirroring template clone
- Add CheckAccess authz to deleteTemplateV2, handleTemplateClone, deleteHarnessConfig, handleHarnessConfigClone
- Add DB migration V55: UNIQUE constraint on (slug, scope, scope_id)
- Return 409 Conflict on slug collision during clone
- Add clone failure cleanup
- Add tests for clone, authz, and slug collision

* feat(web): add Clone/Delete row actions and clone-from-global to resource list

- Add Clone and Delete action menu to shared resource-list component
- Add delete confirmation dialog with deleteFiles checkbox (default on)
- Add clone dialog with name input and 409 collision handling
- Add clone-from-global picker in project settings view
- Unify on resource-changed event (migrate resource-imported)
- Gate actions on capabilities (canClone, canDelete properties)

* fix: address PR review — cleanup orphaned files on DB create failure, remove redundant clone method

- Add stor.DeletePrefix cleanup when CreateTemplate/CreateHarnessConfig fails
  after files were already copied (prevents orphaned storage files)
- Remove redundant confirmCloneFromGlobal method — confirmClone already
  handles cross-scope clone via the component's scope/scopeId properties

* fix: adapt Locked removal and slug constraint to Ent-based schema

Remove Locked references from entadapter, remove stale sqlite.go
(replaced by Ent ORM upstream), add UNIQUE(slug, scope, scope_id)
to Ent schema indexes, and regenerate Ent code.

* fix: adapt tests and entadapter for Ent-based store (UUID IDs, no Locked)

- Use api.NewUUID() for all test entity IDs (Ent enforces UUID format)
- Remove Locked field from entadapter create/update calls
- Remove stale sqlite.go (replaced by Ent ORM upstream)
- Add UNIQUE(slug, scope, scope_id) to Ent schema indexes
…form#309)

* fix(hub): make web session replica-portable to fix OAuth state_mismatch

OAuth login behind the load balancer intermittently failed with
state_mismatch: the CSRF state token (and the entire web session) was
stored in a gorilla FilesystemStore on the handling replica's local
disk, while the browser only carried a session-ID cookie. When the LB
routed /auth/login and /auth/callback to different replicas, the
callback replica had no matching session file -> empty state ->
state_mismatch. It only "worked" when both hops happened to hit the
same backend.

The same flaw affected the post-login session: sessionToBearerMiddleware
reads the Hub access/refresh JWTs from that disk-local store on every API
request, so sessions silently dropped whenever a follow-up request
landed on a different replica.

Replace the FilesystemStore with an encrypted, signed gorilla
CookieStore so the whole session lives in the client's cookie and any
replica sharing SESSION_SECRET can read it. Keys are derived
deterministically from SESSION_SECRET (32-byte HMAC auth key + 32-byte
AES-256 encryption key, domain-separated). No DB, no migration; works
with N replicas.

The original switch to disk was motivated by a "JWT tokens exceed 4096
bytes" concern. Measured against the current compact HS256 tokens the
full session (identity + access + refresh) encodes to ~2.6 KB, well
under the browser's ~4 KB per-cookie cap, so the securecookie length
limit is left in force (oversize would now error+log, not silently drop).

Tests: replace the obsolete NoMaxLengthLimit test with a cross-replica
round-trip regression test (cookie minted by replica A decodes on
replica B with the same secret; carries OAuth state + post-login tokens)
plus a negative test (a different secret cannot decode the cookie).

* fix(hub): derive JWT signing keys from shared SESSION_SECRET to fix cross-replica login loop

The cookie-store fix (0515e2a) made the web session replica-portable, but
the Hub JWT *inside* the cookie is still signed with a per-replica key:
ensureSigningKey scopes signing keys to (scope=hub, scope_id=hubID) and
hubID = sha256(hostname)[:12]. The integration env runs two replicas of one
logical hub behind a single LB, sharing one Postgres DB and one
SESSION_SECRET but with different hostnames -> different hubIDs -> different
HS256 signing keys.

So a user JWT minted on replica A failed signature verification on replica B
(go-jose: error in cryptographic primitive); refresh failed too (refresh
token signed with the same foreign key), so sessionToBearerMiddleware
declared the session irrecoverably invalid, DELETED the cookie (MaxAge=-1)
and returned session_expired. The cookie deletion turns it into a redirect
loop: dashboard flashes, then /login?error=session_expired.

Fix: extend the 0515e2a approach (replica-portable via the shared secret)
from the cookie to the keys inside it. Add ServerConfig.SharedSigningSecret;
when set, ensureSigningKey derives the agent and user signing keys
deterministically from it (domain-separated by key name) and bypasses
per-host secret-backend storage. cmd feeds the same --session-secret /
SESSION_SECRET value into both the web cookie store and the hub config via a
new resolveSessionSecret() helper. Empty secret keeps the existing per-hub
behavior (no regression for single-node/local dev).

Tests: cross-replica round trip (different hubID + same secret -> identical
keys, token minted on A validates on B; different secret cannot) plus
pre-configured-key precedence.

Note: rollout rotates the signing keys (now derived from SESSION_SECRET), so
existing web/CLI tokens are invalidated once and users re-login.

---------

Co-authored-by: Scion <agent@scion.dev>
Add scripts and config for deploying the Scion hub as a Cloud Run
service (min=max=1, SQLite, authenticated-only) with a co-located
GKE broker targeting a GKE Autopilot cluster.

What's included:
- scripts/cloudrun/: Dockerfile (3-stage: Node web → Go binary → slim
  runtime), deploy.sh (service account, Cloud Build, kubeconfig from
  live cluster, Secret Manager, Cloud Run deploy), entrypoint.sh
  (symlink-safe secret copy), hub-settings-template.yaml, README.
- /health endpoint alias: Cloud Run's Google Frontend intercepts
  /healthz before it reaches the container; /health passes through.
  Added as route, public route, and health endpoint check.
- hubclient /health fallback: falls back to /health when /healthz
  returns 404 for Cloud Run compatibility.

Auth transport (OIDC identity token for agents) is handled separately
by the auth-proxy-mode PR (GoogleCloudPlatform#310) which provides a more comprehensive
pluggable transport layer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant