Skip to content

Add retry transport for transient Google API errors#84

Open
c1-dev-bot[bot] wants to merge 2 commits into
mainfrom
fix/retry-transient-google-api-errors
Open

Add retry transport for transient Google API errors#84
c1-dev-bot[bot] wants to merge 2 commits into
mainfrom
fix/retry-transient-google-api-errors

Conversation

@c1-dev-bot
Copy link
Copy Markdown

@c1-dev-bot c1-dev-bot Bot commented Feb 20, 2026

Summary

  • Adds a retryTransport HTTP round-tripper that wraps the Google API client's transport with exponential backoff retry logic for transient network errors (EOF, connection reset, broken pipe, etc.)
  • Updates wrapGoogleApiErrorWithContext to wrap transient network errors as gRPC Unavailable status, so the baton framework treats them as retryable even after the retry transport exhausts its attempts
  • Includes comprehensive tests for transient error detection and retry behavior

Root Cause

The Google Admin SDK's generated code uses gensupport.SendRequest (without retry), not gensupport.SendRequestWithRetry. This means transient network errors like io.EOF and connection resets propagate directly as unrecoverable sync failures. The existing wrapGoogleApiErrorWithContext helper only handled *googleapi.Error (HTTP errors with status codes) and returned raw network errors unchanged, so the baton framework couldn't identify them as retryable.

Changes

  1. retry_transport.go — New retryTransport wrapping the outermost HTTP transport (around oauth2) with up to 3 retries and exponential backoff with jitter. Handles io.EOF, io.ErrUnexpectedEOF, ECONNRESET, ECONNREFUSED, EPIPE, net.ErrClosed, and other temporary network errors.

  2. connector.go — Wraps the Google API HTTP client's transport with newRetryTransport() so all Google Admin SDK calls benefit from retry logic.

  3. error_helpers.gowrapGoogleApiErrorWithContext now detects transient network errors and wraps them as codes.Unavailable gRPC status, providing a safety net if retries are exhausted.

  4. retry_transport_test.go — Tests for transient error detection and retry behavior including EOF, connection reset, retry exhaustion, and non-transient error passthrough.

Test plan

  • All existing tests pass (go test ./...)
  • New unit tests for isTransientError cover EOF, connection reset, broken pipe, wrapped errors
  • New unit tests for retryTransport verify retry on EOF, no retry on non-transient, retry exhaustion, connection reset recovery, success on first attempt
  • go vet ./... passes cleanly
  • Manual testing with a real Google Workspace account to verify intermittent EOF errors are retried

Automated PR Notice

This PR was automatically created by c1-dev-bot as a potential implementation.

This code requires:

  • Human review of the implementation approach
  • Manual testing to verify correctness
  • Approval from the appropriate team before merging

…reset)

The Google Admin SDK's generated code uses gensupport.SendRequest (without
retry), so transient network errors like EOF and connection reset propagate
directly as sync failures. This is the root cause of intermittent sync
failures reported against the GCP/Google Workspace connector.

This change:
- Adds a retryTransport that wraps the HTTP transport used by Google API
  clients with exponential backoff retry logic for transient errors (EOF,
  connection reset, broken pipe, etc.)
- Updates wrapGoogleApiErrorWithContext to wrap transient network errors
  as gRPC Unavailable status, so the baton framework treats them as
  retryable even if the retry transport exhausts its attempts
- Includes comprehensive tests for transient error detection and retry
  behavior
@c1-dev-bot c1-dev-bot Bot requested a review from a team February 20, 2026 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants