automation: add cloud run deploy pattern for hub by ptone · Pull Request #330 · GoogleCloudPlatform/scion

ptone · 2026-06-06T16:42:17Z

initial MVP scale of Cloud Run hub deploy

google-cla · 2026-06-06T16:42:27Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

gemini-code-assist

Code Review

This pull request implements Phase 1 of the Auth Proxy Mode, enabling Google IAP signed-header authentication. It introduces the IAPAuthenticator with JWKS caching, extracts a shared provisionUser method to deduplicate user provisioning across OAuth and proxy paths, and adds Cloud Run deployment scripts. Feedback on the changes highlights a critical performance bottleneck where a write lock is held during external HTTP requests in the JWKS cache, a potential hang due to the lack of a timeout on the default HTTP client, and a memory leak in the proxy user cache where expired entries are never evicted.

gemini-code-assist · 2026-06-06T16:44:25Z

+func (c *jwksCache) refresh() error {
+	c.mu.Lock()
+	defer c.mu.Unlock()
+
+	// Debounce: don't re-fetch if we just fetched very recently
+	if time.Since(c.lastFetched) < 5*time.Second {
+		return nil
+	}
+
+	resp, err := c.client.Get(c.url)
+	if err != nil {
+		slog.Warn("jwks fetch failed, serving last-good keys", "url", c.url, "error", err)
+		return err
+	}
+	defer resp.Body.Close()
+
+	if resp.StatusCode != http.StatusOK {
+		body, _ := io.ReadAll(io.LimitReader(resp.Body, 512))
+		slog.Warn("jwks fetch non-200", "url", c.url, "status", resp.StatusCode, "body", string(body))
+		return fmt.Errorf("jwks fetch returned %d", resp.StatusCode)
+	}
+
+	body, err := io.ReadAll(io.LimitReader(resp.Body, 1<<20)) // 1MB limit
+	if err != nil {
+		return fmt.Errorf("jwks read body: %w", err)
+	}
+
+	var jwks jose.JSONWebKeySet
+	if err := json.Unmarshal(body, &jwks); err != nil {
+		return fmt.Errorf("jwks parse: %w", err)
+	}
+
+	newKeys := make(map[string]jose.JSONWebKey, len(jwks.Keys))
+	for _, k := range jwks.Keys {
+		if k.KeyID != "" {
+			newKeys[k.KeyID] = k
+		}
+	}
+
+	c.keys = newKeys
+	c.lastFetched = time.Now()
+
+	slog.Debug("jwks cache refreshed", "url", c.url, "keyCount", len(newKeys))
+	return nil
+}


Holding a write lock (c.mu.Lock()) across an external network request (c.client.Get(c.url)) is a major performance bottleneck and availability risk. If the JWKS endpoint is slow or hangs, all other authentication requests will block indefinitely on c.mu.RLock() in GetKey.

Additionally, if we simply release the lock during the HTTP request, multiple concurrent requests could trigger duplicate HTTP requests (cache stampede) because c.lastFetched is only updated after the request completes.

To resolve both issues, we can introduce an isFetching boolean flag to track active fetches, release the lock during the HTTP request, and re-acquire it only when updating the cache.

func (c *jwksCache) refresh() error { c.mu.Lock() if c.isFetching { c.mu.Unlock() return nil } // Debounce: don't re-fetch if we just fetched very recently if time.Since(c.lastFetched) < 5*time.Second { c.mu.Unlock() return nil } c.isFetching = true c.mu.Unlock() defer func() { c.mu.Lock() c.isFetching = false c.mu.Unlock() }() resp, err := c.client.Get(c.url) if err != nil { slog.Warn("jwks fetch failed, serving last-good keys", "url", c.url, "error", err) return err } defer resp.Body.Close() if resp.StatusCode != http.StatusOK { body, _ := io.ReadAll(io.LimitReader(resp.Body, 512)) slog.Warn("jwks fetch non-200", "url", c.url, "status", resp.StatusCode, "body", string(body)) return fmt.Errorf("jwks fetch returned %d", resp.StatusCode) } body, err := io.ReadAll(io.LimitReader(resp.Body, 1<<20)) // 1MB limit if err != nil { return fmt.Errorf("jwks read body: %w", err) } var jwks jose.JSONWebKeySet if err := json.Unmarshal(body, &jwks); err != nil { return fmt.Errorf("jwks parse: %w", err) } newKeys := make(map[string]jose.JSONWebKey, len(jwks.Keys)) for _, k := range jwks.Keys { if k.KeyID != "" { newKeys[k.KeyID] = k } } c.mu.Lock() c.keys = newKeys c.lastFetched = time.Now() c.mu.Unlock() slog.Debug("jwks cache refreshed", "url", c.url, "keyCount", len(newKeys)) return nil }

gemini-code-assist · 2026-06-06T16:44:25Z

+func (a *IAPAuthenticator) resolveHTTPClient() *http.Client {
+	if a.HTTPClient != nil {
+		return a.HTTPClient
+	}
+	return http.DefaultClient
+}


Using http.DefaultClient with no timeout can cause the application to hang indefinitely if the JWKS endpoint becomes unresponsive. It is highly recommended to use a default client with a reasonable timeout (e.g., 10 seconds).

var defaultHTTPClient = &http.Client{ Timeout: 10 * time.Second, } func (a *IAPAuthenticator) resolveHTTPClient() *http.Client { if a.HTTPClient != nil { return a.HTTPClient } return defaultHTTPClient }

gemini-code-assist · 2026-06-06T16:44:25Z

+// Get returns a cached user identity if present and not expired.
+func (c *ProxyUserCache) Get(email string) (UserIdentity, bool) {
+	c.mu.RLock()
+	defer c.mu.RUnlock()
+	entry, ok := c.cache[email]
+	if !ok || time.Now().After(entry.expiresAt) {
+		return nil, false
+	}
+	return entry.identity, true
+}


The ProxyUserCache map only grows and never evicts expired entries. Over time, as unique users authenticate, this will cause a slow memory leak. We can evict expired entries on-access during Get by upgrading to a write lock and deleting the expired key.

Suggested change

// Get returns a cached user identity if present and not expired.

func (c *ProxyUserCache) Get(email string) (UserIdentity, bool) {

c.mu.RLock()

defer c.mu.RUnlock()

entry, ok := c.cache[email]

if !ok || time.Now().After(entry.expiresAt) {

return nil, false

}

return entry.identity, true

}

// Get returns a cached user identity if present and not expired.

func (c *ProxyUserCache) Get(email string) (UserIdentity, bool) {

c.mu.RLock()

entry, ok := c.cache[email]

if !ok {

c.mu.RUnlock()

return nil, false

}

if time.Now().After(entry.expiresAt) {

c.mu.RUnlock()

c.mu.Lock()

if entry, ok = c.cache[email]; ok && time.Now().After(entry.expiresAt) {

delete(c.cache, email)

}

c.mu.Unlock()

return nil, false

}

defer c.mu.RUnlock()

return entry.identity, true

}

ptone · 2026-06-06T17:29:01Z

/gemini review

gemini-code-assist

Code Review

This pull request adds support for deploying the Scion Hub as a Cloud Run service with Identity-Aware Proxy (IAP) authentication, including a new /health endpoint, a multi-stage Dockerfile, and deployment scripts. The review feedback highlights two important improvements in deploy.sh: ensuring kubectl commands use the generated kubeconfig to avoid running against the wrong local context, and optimizing GKE cluster description calls to reduce network overhead.

gemini-code-assist · 2026-06-06T17:30:20Z

+log "Ensuring namespace ${K8S_NAMESPACE} exists in ${GKE_CLUSTER}"
+kubectl create namespace "$K8S_NAMESPACE" --dry-run=client -o yaml | kubectl apply -f - || true


The script runs kubectl create namespace and kubectl apply without specifying the target cluster context or using the generated kubeconfig. This means it will run against the user's currently active local kubectl context, which could be a completely different cluster (or even a production cluster), leading to accidental modifications or failures. Since the script already generates a complete KUBECONFIG_CONTENT on lines 134-155, we should write it to a temporary file and use it for the kubectl commands to guarantee they run against the correct GKE_CLUSTER.

Suggested change

log "Ensuring namespace ${K8S_NAMESPACE} exists in ${GKE_CLUSTER}"

kubectl create namespace "$K8S_NAMESPACE" --dry-run=client -o yaml | kubectl apply -f - || true

log "Ensuring namespace ${K8S_NAMESPACE} exists in ${GKE_CLUSTER}"

LOCAL_KUBECONFIG=$(mktemp)

echo "$KUBECONFIG_CONTENT" > "$LOCAL_KUBECONFIG"

KUBECONFIG="$LOCAL_KUBECONFIG" kubectl create namespace "$K8S_NAMESPACE" --dry-run=client -o yaml | KUBECONFIG="$LOCAL_KUBECONFIG" kubectl apply -f - || true

rm -f "$LOCAL_KUBECONFIG"

gemini-code-assist · 2026-06-06T17:30:20Z

+ENDPOINT=$(gcloud container clusters describe "$GKE_CLUSTER" \
+  --region "$REGION" --project "$PROJECT" \
+  --format="value(endpoint)")
+CA_CERT=$(gcloud container clusters describe "$GKE_CLUSTER" \
+  --region "$REGION" --project "$PROJECT" \
+  --format="value(masterAuth.clusterCaCertificate)")


The script makes two separate external API calls to gcloud container clusters describe to fetch the endpoint and the CA certificate. Each external call is slow and adds network overhead. You can optimize this into a single call and extract both values using bash's read command.

Suggested change

ENDPOINT=$(gcloud container clusters describe "$GKE_CLUSTER" \

--region "$REGION" --project "$PROJECT" \

--format="value(endpoint)")

CA_CERT=$(gcloud container clusters describe "$GKE_CLUSTER" \

--region "$REGION" --project "$PROJECT" \

--format="value(masterAuth.clusterCaCertificate)")

read -r ENDPOINT CA_CERT < <(gcloud container clusters describe "$GKE_CLUSTER" \

--region "$REGION" --project "$PROJECT" \

--format="value(endpoint,masterAuth.clusterCaCertificate)")

Add --no-cpu-throttling to keep CPU allocated even between requests. Cloud Run's default CPU throttling (only during request processing) causes the localhost WebSocket control channel between the co-located hub and broker to drop, because: 1. Between external requests, CPU is throttled to near-zero 2. The broker's localhost heartbeat HTTP requests fail with EOF 3. The control channel WebSocket misses pings and gets closed 4. During agent creation, if the WebSocket drops while waiting for GKE Autopilot node provisioning (~30s), the dispatch fails with "connection closed" and returns 502 With CPU always allocated, the broker heartbeat and control channel remain stable, and agent dispatch can wait for Autopilot without the connection dropping.

- Release write lock during JWKS response body processing (not just the HTTP GET) to avoid blocking auth requests on JSON parsing - Add expired-entry eviction to ProxyUserCache.Get via double-check locking to prevent unbounded memory growth - Use generated kubeconfig for kubectl commands in deploy.sh to avoid running against the wrong cluster context - Consolidate two gcloud cluster-describe calls into one

gemini-code-assist Bot reviewed Jun 6, 2026

View reviewed changes

scion-gteam Bot force-pushed the scion/cloudrun-iap-deploy branch from 3ef548e to 5bae128 Compare June 6, 2026 17:05

gemini-code-assist Bot reviewed Jun 6, 2026

View reviewed changes

Scion and others added 2 commits June 6, 2026 22:27

scion-gteam Bot force-pushed the scion/cloudrun-iap-deploy branch from 5bae128 to bc0ba87 Compare June 6, 2026 22:27

ptone merged commit 2b71d8c into GoogleCloudPlatform:main Jun 7, 2026
4 of 5 checks passed

ptone mentioned this pull request Jun 7, 2026

feat(cloudrun): Cloud Run hub deployment with co-located GKE broker ptone/scion#145

Closed

4 tasks

scion-gteam Bot deleted the scion/cloudrun-iap-deploy branch June 7, 2026 02:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

automation: add cloud run deploy pattern for hub#330

automation: add cloud run deploy pattern for hub#330
ptone merged 2 commits into
GoogleCloudPlatform:mainfrom
ptone:scion/cloudrun-iap-deploy

ptone commented Jun 6, 2026

Uh oh!

google-cla Bot commented Jun 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 6, 2026

Uh oh!

gemini-code-assist Bot Jun 6, 2026

Uh oh!

gemini-code-assist Bot Jun 6, 2026

Uh oh!

ptone commented Jun 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 6, 2026

Uh oh!

gemini-code-assist Bot Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		log "Ensuring namespace ${K8S_NAMESPACE} exists in ${GKE_CLUSTER}"
		kubectl create namespace "$K8S_NAMESPACE" --dry-run=client -o yaml \| kubectl apply -f - \|\| true

-log "Ensuring namespace ${K8S_NAMESPACE} exists in ${GKE_CLUSTER}"
-kubectl create namespace "$K8S_NAMESPACE" --dry-run=client -o yaml | kubectl apply -f - || true
+log "Ensuring namespace ${K8S_NAMESPACE} exists in ${GKE_CLUSTER}"
+LOCAL_KUBECONFIG=$(mktemp)
+echo "$KUBECONFIG_CONTENT" > "$LOCAL_KUBECONFIG"
+KUBECONFIG="$LOCAL_KUBECONFIG" kubectl create namespace "$K8S_NAMESPACE" --dry-run=client -o yaml | KUBECONFIG="$LOCAL_KUBECONFIG" kubectl apply -f - || true
+rm -f "$LOCAL_KUBECONFIG"

Conversation

ptone commented Jun 6, 2026

Uh oh!

google-cla Bot commented Jun 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

ptone commented Jun 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant