fix: retry Dial and StatusMonitor poll on transient UNAVAILABLE#606
fix: retry Dial and StatusMonitor poll on transient UNAVAILABLE#606raballew wants to merge 5 commits intojumpstarter-dev:mainfrom
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| if e.code() == grpc.StatusCode.UNAVAILABLE: | ||
| remaining = deadline - time.monotonic() | ||
| if remaining <= 0: | ||
| logger.debug( |
There was a problem hiding this comment.
| logger.debug( | |
| logger.warning( |
May be even a warning?
| ) | ||
| raise | ||
| delay = min(base_delay * (2**attempt), max_delay, remaining) | ||
| logger.debug( |
There was a problem hiding this comment.
| logger.debug( | |
| logger.warning( |
? WDYT?
| if condition_present_and_equal( | ||
| result.conditions, "Unsatisfiable", "True", "NoExporter" | ||
| ): | ||
| if condition_present_and_equal(result.conditions, "Unsatisfiable", "True", "NoExporter"): |
There was a problem hiding this comment.
unrelated format change (we should avoid format changes, it makes patches harder to backport later in time, or increase the chances of conflict with other patches), unless there is a good reason of course (linter broken)..
| logger.debug( | ||
| "Exporter not ready and dial timeout (%.1fs) exceeded after %d attempts", | ||
| self.dial_timeout, attempt + 1 | ||
| self.dial_timeout, |
| ) | ||
| raise | ||
| delay = min(base_delay * (2 ** attempt), max_delay, remaining) | ||
| delay = min(base_delay * (2**attempt), max_delay, remaining) |
| delay = min(base_delay * (2**attempt), max_delay, remaining) | ||
| logger.debug( | ||
| "Exporter not ready, retrying Dial in %.1fs (attempt %d, %.1fs remaining)", | ||
| delay, attempt + 1, remaining |
| logger.warning( | ||
| "Lease %s has been transferred to another client. " | ||
| "Your session is no longer valid.", | ||
| "Lease %s has been transferred to another client. Your session is no longer valid.", |
Instead of immediately marking the connection as permanently lost on a single gRPC UNAVAILABLE error, the poll loop now retries up to 10 times (mirroring the existing DEADLINE_EXCEEDED retry pattern). This prevents premature lease termination when an exporter briefly restarts. The retry counter resets on any successful GetStatus response. Only sustained failures (10+ consecutive UNAVAILABLE) mark connection_lost. Fixes jumpstarter-dev#242 Generated-By: Forge/20260416_202053_681470_11575359_i242 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the exporter briefly restarts, the Dial RPC may fail with UNAVAILABLE. Instead of immediately giving up, retry with exponential backoff bounded by the existing dial_timeout parameter. This mirrors the existing FAILED_PRECONDITION retry logic. Fixes jumpstarter-dev#242 Generated-By: Forge/20260416_202053_681470_11575359_i242 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ction in test Make UNAVAILABLE timeout in handle_async raise instead of returning silently, matching the FAILED_PRECONDITION timeout behavior. Add assertion that connect_router_stream is called after successful UNAVAILABLE retry. Generated-By: Forge/20260416_202053_681470_11575359_i242 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ll loop Remove the `continue` statement from the UNAVAILABLE handler in _poll_loop so it falls through to the standard sleep block. Previously, UNAVAILABLE retries had no delay between attempts, so 10 retries could be exhausted in under 1ms -- far too fast to tolerate an exporter restart that takes several seconds. Now retries use the poll_interval sleep, making the 10-retry threshold span a meaningful duration. Generated-By: Forge/20260416_202053_681470_11575359_i242 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Revert unrelated formatting changes to minimize backport conflicts. Change UNAVAILABLE timeout log from debug to warning per reviewer request. Restore removed comment for context. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5cd25c3 to
519bcf2
Compare
Summary
dial_timeout, mirroring existing FAILED_PRECONDITION retry logiccontinueCloses #242
Test plan
make pkg-test-jumpstarter🤖 Generated with Claude Code