Skip to content

automation: add cloud run deploy pattern for hub#330

Merged
ptone merged 2 commits into
GoogleCloudPlatform:mainfrom
ptone:scion/cloudrun-iap-deploy
Jun 7, 2026
Merged

automation: add cloud run deploy pattern for hub#330
ptone merged 2 commits into
GoogleCloudPlatform:mainfrom
ptone:scion/cloudrun-iap-deploy

Conversation

@ptone
Copy link
Copy Markdown
Member

@ptone ptone commented Jun 6, 2026

initial MVP scale of Cloud Run hub deploy

@google-cla
Copy link
Copy Markdown

google-cla Bot commented Jun 6, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements Phase 1 of the Auth Proxy Mode, enabling Google IAP signed-header authentication. It introduces the IAPAuthenticator with JWKS caching, extracts a shared provisionUser method to deduplicate user provisioning across OAuth and proxy paths, and adds Cloud Run deployment scripts. Feedback on the changes highlights a critical performance bottleneck where a write lock is held during external HTTP requests in the JWKS cache, a potential hang due to the lack of a timeout on the default HTTP client, and a memory leak in the proxy user cache where expired entries are never evicted.

Comment thread pkg/hub/proxyauth.go
Comment on lines +280 to +324
func (c *jwksCache) refresh() error {
c.mu.Lock()
defer c.mu.Unlock()

// Debounce: don't re-fetch if we just fetched very recently
if time.Since(c.lastFetched) < 5*time.Second {
return nil
}

resp, err := c.client.Get(c.url)
if err != nil {
slog.Warn("jwks fetch failed, serving last-good keys", "url", c.url, "error", err)
return err
}
defer resp.Body.Close()

if resp.StatusCode != http.StatusOK {
body, _ := io.ReadAll(io.LimitReader(resp.Body, 512))
slog.Warn("jwks fetch non-200", "url", c.url, "status", resp.StatusCode, "body", string(body))
return fmt.Errorf("jwks fetch returned %d", resp.StatusCode)
}

body, err := io.ReadAll(io.LimitReader(resp.Body, 1<<20)) // 1MB limit
if err != nil {
return fmt.Errorf("jwks read body: %w", err)
}

var jwks jose.JSONWebKeySet
if err := json.Unmarshal(body, &jwks); err != nil {
return fmt.Errorf("jwks parse: %w", err)
}

newKeys := make(map[string]jose.JSONWebKey, len(jwks.Keys))
for _, k := range jwks.Keys {
if k.KeyID != "" {
newKeys[k.KeyID] = k
}
}

c.keys = newKeys
c.lastFetched = time.Now()

slog.Debug("jwks cache refreshed", "url", c.url, "keyCount", len(newKeys))
return nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Holding a write lock (c.mu.Lock()) across an external network request (c.client.Get(c.url)) is a major performance bottleneck and availability risk. If the JWKS endpoint is slow or hangs, all other authentication requests will block indefinitely on c.mu.RLock() in GetKey.

Additionally, if we simply release the lock during the HTTP request, multiple concurrent requests could trigger duplicate HTTP requests (cache stampede) because c.lastFetched is only updated after the request completes.

To resolve both issues, we can introduce an isFetching boolean flag to track active fetches, release the lock during the HTTP request, and re-acquire it only when updating the cache.

func (c *jwksCache) refresh() error {
	c.mu.Lock()
	if c.isFetching {
		c.mu.Unlock()
		return nil
	}
	// Debounce: don't re-fetch if we just fetched very recently
	if time.Since(c.lastFetched) < 5*time.Second {
		c.mu.Unlock()
		return nil
	}
	c.isFetching = true
	c.mu.Unlock()

	defer func() {
		c.mu.Lock()
		c.isFetching = false
		c.mu.Unlock()
	}()

	resp, err := c.client.Get(c.url)
	if err != nil {
		slog.Warn("jwks fetch failed, serving last-good keys", "url", c.url, "error", err)
		return err
	}
	defer resp.Body.Close()

	if resp.StatusCode != http.StatusOK {
		body, _ := io.ReadAll(io.LimitReader(resp.Body, 512))
		slog.Warn("jwks fetch non-200", "url", c.url, "status", resp.StatusCode, "body", string(body))
		return fmt.Errorf("jwks fetch returned %d", resp.StatusCode)
	}

	body, err := io.ReadAll(io.LimitReader(resp.Body, 1<<20)) // 1MB limit
	if err != nil {
		return fmt.Errorf("jwks read body: %w", err)
	}

	var jwks jose.JSONWebKeySet
	if err := json.Unmarshal(body, &jwks); err != nil {
		return fmt.Errorf("jwks parse: %w", err)
	}

	newKeys := make(map[string]jose.JSONWebKey, len(jwks.Keys))
	for _, k := range jwks.Keys {
		if k.KeyID != "" {
			newKeys[k.KeyID] = k
		}
	}

	c.mu.Lock()
	c.keys = newKeys
	c.lastFetched = time.Now()
	c.mu.Unlock()

	slog.Debug("jwks cache refreshed", "url", c.url, "keyCount", len(newKeys))
	return nil
}

Comment thread pkg/hub/proxyauth.go
Comment on lines +205 to +210
func (a *IAPAuthenticator) resolveHTTPClient() *http.Client {
if a.HTTPClient != nil {
return a.HTTPClient
}
return http.DefaultClient
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using http.DefaultClient with no timeout can cause the application to hang indefinitely if the JWKS endpoint becomes unresponsive. It is highly recommended to use a default client with a reasonable timeout (e.g., 10 seconds).

var defaultHTTPClient = &http.Client{
	Timeout: 10 * time.Second,
}

func (a *IAPAuthenticator) resolveHTTPClient() *http.Client {
	if a.HTTPClient != nil {
		return a.HTTPClient
	}
	return defaultHTTPClient
}

Comment thread pkg/hub/auth.go
Comment on lines +530 to +539
// Get returns a cached user identity if present and not expired.
func (c *ProxyUserCache) Get(email string) (UserIdentity, bool) {
c.mu.RLock()
defer c.mu.RUnlock()
entry, ok := c.cache[email]
if !ok || time.Now().After(entry.expiresAt) {
return nil, false
}
return entry.identity, true
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ProxyUserCache map only grows and never evicts expired entries. Over time, as unique users authenticate, this will cause a slow memory leak. We can evict expired entries on-access during Get by upgrading to a write lock and deleting the expired key.

Suggested change
// Get returns a cached user identity if present and not expired.
func (c *ProxyUserCache) Get(email string) (UserIdentity, bool) {
c.mu.RLock()
defer c.mu.RUnlock()
entry, ok := c.cache[email]
if !ok || time.Now().After(entry.expiresAt) {
return nil, false
}
return entry.identity, true
}
// Get returns a cached user identity if present and not expired.
func (c *ProxyUserCache) Get(email string) (UserIdentity, bool) {
c.mu.RLock()
entry, ok := c.cache[email]
if !ok {
c.mu.RUnlock()
return nil, false
}
if time.Now().After(entry.expiresAt) {
c.mu.RUnlock()
c.mu.Lock()
if entry, ok = c.cache[email]; ok && time.Now().After(entry.expiresAt) {
delete(c.cache, email)
}
c.mu.Unlock()
return nil, false
}
defer c.mu.RUnlock()
return entry.identity, true
}

@scion-gteam scion-gteam Bot force-pushed the scion/cloudrun-iap-deploy branch from 3ef548e to 5bae128 Compare June 6, 2026 17:05
@ptone
Copy link
Copy Markdown
Member Author

ptone commented Jun 6, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for deploying the Scion Hub as a Cloud Run service with Identity-Aware Proxy (IAP) authentication, including a new /health endpoint, a multi-stage Dockerfile, and deployment scripts. The review feedback highlights two important improvements in deploy.sh: ensuring kubectl commands use the generated kubeconfig to avoid running against the wrong local context, and optimizing GKE cluster description calls to reduce network overhead.

Comment thread scripts/cloudrun/deploy.sh Outdated
Comment on lines +184 to +185
log "Ensuring namespace ${K8S_NAMESPACE} exists in ${GKE_CLUSTER}"
kubectl create namespace "$K8S_NAMESPACE" --dry-run=client -o yaml | kubectl apply -f - || true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The script runs kubectl create namespace and kubectl apply without specifying the target cluster context or using the generated kubeconfig. This means it will run against the user's currently active local kubectl context, which could be a completely different cluster (or even a production cluster), leading to accidental modifications or failures. Since the script already generates a complete KUBECONFIG_CONTENT on lines 134-155, we should write it to a temporary file and use it for the kubectl commands to guarantee they run against the correct GKE_CLUSTER.

Suggested change
log "Ensuring namespace ${K8S_NAMESPACE} exists in ${GKE_CLUSTER}"
kubectl create namespace "$K8S_NAMESPACE" --dry-run=client -o yaml | kubectl apply -f - || true
log "Ensuring namespace ${K8S_NAMESPACE} exists in ${GKE_CLUSTER}"
LOCAL_KUBECONFIG=$(mktemp)
echo "$KUBECONFIG_CONTENT" > "$LOCAL_KUBECONFIG"
KUBECONFIG="$LOCAL_KUBECONFIG" kubectl create namespace "$K8S_NAMESPACE" --dry-run=client -o yaml | KUBECONFIG="$LOCAL_KUBECONFIG" kubectl apply -f - || true
rm -f "$LOCAL_KUBECONFIG"

Comment thread scripts/cloudrun/deploy.sh Outdated
Comment on lines +124 to +129
ENDPOINT=$(gcloud container clusters describe "$GKE_CLUSTER" \
--region "$REGION" --project "$PROJECT" \
--format="value(endpoint)")
CA_CERT=$(gcloud container clusters describe "$GKE_CLUSTER" \
--region "$REGION" --project "$PROJECT" \
--format="value(masterAuth.clusterCaCertificate)")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script makes two separate external API calls to gcloud container clusters describe to fetch the endpoint and the CA certificate. Each external call is slow and adds network overhead. You can optimize this into a single call and extract both values using bash's read command.

Suggested change
ENDPOINT=$(gcloud container clusters describe "$GKE_CLUSTER" \
--region "$REGION" --project "$PROJECT" \
--format="value(endpoint)")
CA_CERT=$(gcloud container clusters describe "$GKE_CLUSTER" \
--region "$REGION" --project "$PROJECT" \
--format="value(masterAuth.clusterCaCertificate)")
read -r ENDPOINT CA_CERT < <(gcloud container clusters describe "$GKE_CLUSTER" \
--region "$REGION" --project "$PROJECT" \
--format="value(endpoint,masterAuth.clusterCaCertificate)")

Scion and others added 2 commits June 6, 2026 22:27
Add --no-cpu-throttling to keep CPU allocated even between requests.
Cloud Run's default CPU throttling (only during request processing)
causes the localhost WebSocket control channel between the co-located
hub and broker to drop, because:

1. Between external requests, CPU is throttled to near-zero
2. The broker's localhost heartbeat HTTP requests fail with EOF
3. The control channel WebSocket misses pings and gets closed
4. During agent creation, if the WebSocket drops while waiting for
   GKE Autopilot node provisioning (~30s), the dispatch fails with
   "connection closed" and returns 502

With CPU always allocated, the broker heartbeat and control channel
remain stable, and agent dispatch can wait for Autopilot without
the connection dropping.
- Release write lock during JWKS response body processing (not just the
  HTTP GET) to avoid blocking auth requests on JSON parsing
- Add expired-entry eviction to ProxyUserCache.Get via double-check
  locking to prevent unbounded memory growth
- Use generated kubeconfig for kubectl commands in deploy.sh to avoid
  running against the wrong cluster context
- Consolidate two gcloud cluster-describe calls into one
@scion-gteam scion-gteam Bot force-pushed the scion/cloudrun-iap-deploy branch from 5bae128 to bc0ba87 Compare June 6, 2026 22:27
@ptone ptone merged commit 2b71d8c into GoogleCloudPlatform:main Jun 7, 2026
4 of 5 checks passed
@scion-gteam scion-gteam Bot deleted the scion/cloudrun-iap-deploy branch June 7, 2026 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant