Skip to content

fix: retry Dial and StatusMonitor poll on transient UNAVAILABLE#606

Draft
raballew wants to merge 5 commits intojumpstarter-dev:mainfrom
raballew:242-unavailable-retry-resilience
Draft

fix: retry Dial and StatusMonitor poll on transient UNAVAILABLE#606
raballew wants to merge 5 commits intojumpstarter-dev:mainfrom
raballew:242-unavailable-retry-resilience

Conversation

@raballew
Copy link
Copy Markdown
Member

Summary

  • Retry Dial on transient UNAVAILABLE with exponential backoff bounded by dial_timeout, mirroring existing FAILED_PRECONDITION retry logic
  • StatusMonitor poll loop now retries up to 10 times on UNAVAILABLE (matching DEADLINE_EXCEEDED pattern) instead of immediately marking connection lost
  • Add inter-retry delay for UNAVAILABLE errors by removing premature continue

Closes #242

Test plan

  • Verify Dial retries on UNAVAILABLE and succeeds after exporter restart
  • Verify StatusMonitor tolerates transient UNAVAILABLE without terminating lease
  • Run make pkg-test-jumpstarter

🤖 Generated with Claude Code

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 17, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f554a632-86ec-4a27-81ba-6b8c5754f289

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

if e.code() == grpc.StatusCode.UNAVAILABLE:
remaining = deadline - time.monotonic()
if remaining <= 0:
logger.debug(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.debug(
logger.warning(

May be even a warning?

)
raise
delay = min(base_delay * (2**attempt), max_delay, remaining)
logger.debug(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.debug(
logger.warning(

? WDYT?

Copy link
Copy Markdown
Member

@mangelajo mangelajo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some nits

if condition_present_and_equal(
result.conditions, "Unsatisfiable", "True", "NoExporter"
):
if condition_present_and_equal(result.conditions, "Unsatisfiable", "True", "NoExporter"):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated format change (we should avoid format changes, it makes patches harder to backport later in time, or increase the chances of conflict with other patches), unless there is a good reason of course (linter broken)..

logger.debug(
"Exporter not ready and dial timeout (%.1fs) exceeded after %d attempts",
self.dial_timeout, attempt + 1
self.dial_timeout,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated format change

)
raise
delay = min(base_delay * (2 ** attempt), max_delay, remaining)
delay = min(base_delay * (2**attempt), max_delay, remaining)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated format change

delay = min(base_delay * (2**attempt), max_delay, remaining)
logger.debug(
"Exporter not ready, retrying Dial in %.1fs (attempt %d, %.1fs remaining)",
delay, attempt + 1, remaining
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated format change

logger.warning(
"Lease %s has been transferred to another client. "
"Your session is no longer valid.",
"Lease %s has been transferred to another client. Your session is no longer valid.",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated format change

raballew and others added 5 commits April 17, 2026 12:29
Instead of immediately marking the connection as permanently lost on a
single gRPC UNAVAILABLE error, the poll loop now retries up to 10 times
(mirroring the existing DEADLINE_EXCEEDED retry pattern). This prevents
premature lease termination when an exporter briefly restarts.

The retry counter resets on any successful GetStatus response. Only
sustained failures (10+ consecutive UNAVAILABLE) mark connection_lost.

Fixes jumpstarter-dev#242

Generated-By: Forge/20260416_202053_681470_11575359_i242
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the exporter briefly restarts, the Dial RPC may fail with
UNAVAILABLE. Instead of immediately giving up, retry with exponential
backoff bounded by the existing dial_timeout parameter. This mirrors
the existing FAILED_PRECONDITION retry logic.

Fixes jumpstarter-dev#242

Generated-By: Forge/20260416_202053_681470_11575359_i242
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ction in test

Make UNAVAILABLE timeout in handle_async raise instead of returning silently,
matching the FAILED_PRECONDITION timeout behavior. Add assertion that
connect_router_stream is called after successful UNAVAILABLE retry.

Generated-By: Forge/20260416_202053_681470_11575359_i242
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ll loop

Remove the `continue` statement from the UNAVAILABLE handler in
_poll_loop so it falls through to the standard sleep block. Previously,
UNAVAILABLE retries had no delay between attempts, so 10 retries could
be exhausted in under 1ms -- far too fast to tolerate an exporter
restart that takes several seconds. Now retries use the poll_interval
sleep, making the 10-retry threshold span a meaningful duration.

Generated-By: Forge/20260416_202053_681470_11575359_i242
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Revert unrelated formatting changes to minimize backport conflicts.
Change UNAVAILABLE timeout log from debug to warning per reviewer request.
Restore removed comment for context.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@raballew raballew force-pushed the 242-unavailable-retry-resilience branch from 5cd25c3 to 519bcf2 Compare April 17, 2026 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

hooks enabled: restarting an exporter is causing the client to kill the lease prematurely

2 participants