automation: add cloud run deploy pattern for hub#330
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
There was a problem hiding this comment.
Code Review
This pull request implements Phase 1 of the Auth Proxy Mode, enabling Google IAP signed-header authentication. It introduces the IAPAuthenticator with JWKS caching, extracts a shared provisionUser method to deduplicate user provisioning across OAuth and proxy paths, and adds Cloud Run deployment scripts. Feedback on the changes highlights a critical performance bottleneck where a write lock is held during external HTTP requests in the JWKS cache, a potential hang due to the lack of a timeout on the default HTTP client, and a memory leak in the proxy user cache where expired entries are never evicted.
| func (c *jwksCache) refresh() error { | ||
| c.mu.Lock() | ||
| defer c.mu.Unlock() | ||
|
|
||
| // Debounce: don't re-fetch if we just fetched very recently | ||
| if time.Since(c.lastFetched) < 5*time.Second { | ||
| return nil | ||
| } | ||
|
|
||
| resp, err := c.client.Get(c.url) | ||
| if err != nil { | ||
| slog.Warn("jwks fetch failed, serving last-good keys", "url", c.url, "error", err) | ||
| return err | ||
| } | ||
| defer resp.Body.Close() | ||
|
|
||
| if resp.StatusCode != http.StatusOK { | ||
| body, _ := io.ReadAll(io.LimitReader(resp.Body, 512)) | ||
| slog.Warn("jwks fetch non-200", "url", c.url, "status", resp.StatusCode, "body", string(body)) | ||
| return fmt.Errorf("jwks fetch returned %d", resp.StatusCode) | ||
| } | ||
|
|
||
| body, err := io.ReadAll(io.LimitReader(resp.Body, 1<<20)) // 1MB limit | ||
| if err != nil { | ||
| return fmt.Errorf("jwks read body: %w", err) | ||
| } | ||
|
|
||
| var jwks jose.JSONWebKeySet | ||
| if err := json.Unmarshal(body, &jwks); err != nil { | ||
| return fmt.Errorf("jwks parse: %w", err) | ||
| } | ||
|
|
||
| newKeys := make(map[string]jose.JSONWebKey, len(jwks.Keys)) | ||
| for _, k := range jwks.Keys { | ||
| if k.KeyID != "" { | ||
| newKeys[k.KeyID] = k | ||
| } | ||
| } | ||
|
|
||
| c.keys = newKeys | ||
| c.lastFetched = time.Now() | ||
|
|
||
| slog.Debug("jwks cache refreshed", "url", c.url, "keyCount", len(newKeys)) | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Holding a write lock (c.mu.Lock()) across an external network request (c.client.Get(c.url)) is a major performance bottleneck and availability risk. If the JWKS endpoint is slow or hangs, all other authentication requests will block indefinitely on c.mu.RLock() in GetKey.
Additionally, if we simply release the lock during the HTTP request, multiple concurrent requests could trigger duplicate HTTP requests (cache stampede) because c.lastFetched is only updated after the request completes.
To resolve both issues, we can introduce an isFetching boolean flag to track active fetches, release the lock during the HTTP request, and re-acquire it only when updating the cache.
func (c *jwksCache) refresh() error {
c.mu.Lock()
if c.isFetching {
c.mu.Unlock()
return nil
}
// Debounce: don't re-fetch if we just fetched very recently
if time.Since(c.lastFetched) < 5*time.Second {
c.mu.Unlock()
return nil
}
c.isFetching = true
c.mu.Unlock()
defer func() {
c.mu.Lock()
c.isFetching = false
c.mu.Unlock()
}()
resp, err := c.client.Get(c.url)
if err != nil {
slog.Warn("jwks fetch failed, serving last-good keys", "url", c.url, "error", err)
return err
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
body, _ := io.ReadAll(io.LimitReader(resp.Body, 512))
slog.Warn("jwks fetch non-200", "url", c.url, "status", resp.StatusCode, "body", string(body))
return fmt.Errorf("jwks fetch returned %d", resp.StatusCode)
}
body, err := io.ReadAll(io.LimitReader(resp.Body, 1<<20)) // 1MB limit
if err != nil {
return fmt.Errorf("jwks read body: %w", err)
}
var jwks jose.JSONWebKeySet
if err := json.Unmarshal(body, &jwks); err != nil {
return fmt.Errorf("jwks parse: %w", err)
}
newKeys := make(map[string]jose.JSONWebKey, len(jwks.Keys))
for _, k := range jwks.Keys {
if k.KeyID != "" {
newKeys[k.KeyID] = k
}
}
c.mu.Lock()
c.keys = newKeys
c.lastFetched = time.Now()
c.mu.Unlock()
slog.Debug("jwks cache refreshed", "url", c.url, "keyCount", len(newKeys))
return nil
}| func (a *IAPAuthenticator) resolveHTTPClient() *http.Client { | ||
| if a.HTTPClient != nil { | ||
| return a.HTTPClient | ||
| } | ||
| return http.DefaultClient | ||
| } |
There was a problem hiding this comment.
Using http.DefaultClient with no timeout can cause the application to hang indefinitely if the JWKS endpoint becomes unresponsive. It is highly recommended to use a default client with a reasonable timeout (e.g., 10 seconds).
var defaultHTTPClient = &http.Client{
Timeout: 10 * time.Second,
}
func (a *IAPAuthenticator) resolveHTTPClient() *http.Client {
if a.HTTPClient != nil {
return a.HTTPClient
}
return defaultHTTPClient
}| // Get returns a cached user identity if present and not expired. | ||
| func (c *ProxyUserCache) Get(email string) (UserIdentity, bool) { | ||
| c.mu.RLock() | ||
| defer c.mu.RUnlock() | ||
| entry, ok := c.cache[email] | ||
| if !ok || time.Now().After(entry.expiresAt) { | ||
| return nil, false | ||
| } | ||
| return entry.identity, true | ||
| } |
There was a problem hiding this comment.
The ProxyUserCache map only grows and never evicts expired entries. Over time, as unique users authenticate, this will cause a slow memory leak. We can evict expired entries on-access during Get by upgrading to a write lock and deleting the expired key.
| // Get returns a cached user identity if present and not expired. | |
| func (c *ProxyUserCache) Get(email string) (UserIdentity, bool) { | |
| c.mu.RLock() | |
| defer c.mu.RUnlock() | |
| entry, ok := c.cache[email] | |
| if !ok || time.Now().After(entry.expiresAt) { | |
| return nil, false | |
| } | |
| return entry.identity, true | |
| } | |
| // Get returns a cached user identity if present and not expired. | |
| func (c *ProxyUserCache) Get(email string) (UserIdentity, bool) { | |
| c.mu.RLock() | |
| entry, ok := c.cache[email] | |
| if !ok { | |
| c.mu.RUnlock() | |
| return nil, false | |
| } | |
| if time.Now().After(entry.expiresAt) { | |
| c.mu.RUnlock() | |
| c.mu.Lock() | |
| if entry, ok = c.cache[email]; ok && time.Now().After(entry.expiresAt) { | |
| delete(c.cache, email) | |
| } | |
| c.mu.Unlock() | |
| return nil, false | |
| } | |
| defer c.mu.RUnlock() | |
| return entry.identity, true | |
| } |
3ef548e to
5bae128
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds support for deploying the Scion Hub as a Cloud Run service with Identity-Aware Proxy (IAP) authentication, including a new /health endpoint, a multi-stage Dockerfile, and deployment scripts. The review feedback highlights two important improvements in deploy.sh: ensuring kubectl commands use the generated kubeconfig to avoid running against the wrong local context, and optimizing GKE cluster description calls to reduce network overhead.
| log "Ensuring namespace ${K8S_NAMESPACE} exists in ${GKE_CLUSTER}" | ||
| kubectl create namespace "$K8S_NAMESPACE" --dry-run=client -o yaml | kubectl apply -f - || true |
There was a problem hiding this comment.
The script runs kubectl create namespace and kubectl apply without specifying the target cluster context or using the generated kubeconfig. This means it will run against the user's currently active local kubectl context, which could be a completely different cluster (or even a production cluster), leading to accidental modifications or failures. Since the script already generates a complete KUBECONFIG_CONTENT on lines 134-155, we should write it to a temporary file and use it for the kubectl commands to guarantee they run against the correct GKE_CLUSTER.
| log "Ensuring namespace ${K8S_NAMESPACE} exists in ${GKE_CLUSTER}" | |
| kubectl create namespace "$K8S_NAMESPACE" --dry-run=client -o yaml | kubectl apply -f - || true | |
| log "Ensuring namespace ${K8S_NAMESPACE} exists in ${GKE_CLUSTER}" | |
| LOCAL_KUBECONFIG=$(mktemp) | |
| echo "$KUBECONFIG_CONTENT" > "$LOCAL_KUBECONFIG" | |
| KUBECONFIG="$LOCAL_KUBECONFIG" kubectl create namespace "$K8S_NAMESPACE" --dry-run=client -o yaml | KUBECONFIG="$LOCAL_KUBECONFIG" kubectl apply -f - || true | |
| rm -f "$LOCAL_KUBECONFIG" |
| ENDPOINT=$(gcloud container clusters describe "$GKE_CLUSTER" \ | ||
| --region "$REGION" --project "$PROJECT" \ | ||
| --format="value(endpoint)") | ||
| CA_CERT=$(gcloud container clusters describe "$GKE_CLUSTER" \ | ||
| --region "$REGION" --project "$PROJECT" \ | ||
| --format="value(masterAuth.clusterCaCertificate)") |
There was a problem hiding this comment.
The script makes two separate external API calls to gcloud container clusters describe to fetch the endpoint and the CA certificate. Each external call is slow and adds network overhead. You can optimize this into a single call and extract both values using bash's read command.
| ENDPOINT=$(gcloud container clusters describe "$GKE_CLUSTER" \ | |
| --region "$REGION" --project "$PROJECT" \ | |
| --format="value(endpoint)") | |
| CA_CERT=$(gcloud container clusters describe "$GKE_CLUSTER" \ | |
| --region "$REGION" --project "$PROJECT" \ | |
| --format="value(masterAuth.clusterCaCertificate)") | |
| read -r ENDPOINT CA_CERT < <(gcloud container clusters describe "$GKE_CLUSTER" \ | |
| --region "$REGION" --project "$PROJECT" \ | |
| --format="value(endpoint,masterAuth.clusterCaCertificate)") |
Add --no-cpu-throttling to keep CPU allocated even between requests. Cloud Run's default CPU throttling (only during request processing) causes the localhost WebSocket control channel between the co-located hub and broker to drop, because: 1. Between external requests, CPU is throttled to near-zero 2. The broker's localhost heartbeat HTTP requests fail with EOF 3. The control channel WebSocket misses pings and gets closed 4. During agent creation, if the WebSocket drops while waiting for GKE Autopilot node provisioning (~30s), the dispatch fails with "connection closed" and returns 502 With CPU always allocated, the broker heartbeat and control channel remain stable, and agent dispatch can wait for Autopilot without the connection dropping.
- Release write lock during JWKS response body processing (not just the HTTP GET) to avoid blocking auth requests on JSON parsing - Add expired-entry eviction to ProxyUserCache.Get via double-check locking to prevent unbounded memory growth - Use generated kubeconfig for kubectl commands in deploy.sh to avoid running against the wrong cluster context - Consolidate two gcloud cluster-describe calls into one
5bae128 to
bc0ba87
Compare
initial MVP scale of Cloud Run hub deploy