Skip to content

E2E test failure: Dial rejected with Available status despite server-side retry fix #538

@ambient-code

Description

@ambient-code

Summary

E2E test can lease and connect to exporters is failing in PR #535 with repeated Dial rejected due to exporter status errors, followed by authentication failures with context cancellation.

Error Details

Failed run: https://github.com/jumpstarter-dev/jumpstarter/actions/runs/24201578910/job/70646355079?pr=535#step:6:698

Test: Core E2E Tests > Lease and connect > can lease and connect to exporters
Location: /home/runner/work/jumpstarter/jumpstarter/e2e/test/e2e_test.go:406
Duration: Failed after 31.5 seconds

Key error logs:

1. Repeated Dial rejections with Available status:

2026-04-09T16:50:00Z  INFO  Dial rejected due to exporter status  
  {"peer": "10.244.0.1:21066", 
   "client": {"name":"test-client-oidc","namespace":"jumpstarter-lab"}, 
   "lease": {"name":"019d7326-6a7f-711e-bfed-b5bd8bbd90f4","namespace":"jumpstarter-lab"}, 
   "status": "Available", 
   "error": "rpc error: code = FailedPrecondition desc = exporter is not ready (status: Available)"}

2. Authentication failure with context canceled:

2026-04-09T16:50:10Z  ERROR  unable to authenticate client  
  {"peer": "10.244.0.1:21066", 
   "error": "Get \"https://10.96.0.1:443/apis/jumpstarter.dev/v1alpha1/namespaces/jumpstarter-lab/clients/test-client-oidc\": context canceled"}

Final error:

Error: Connection to exporter lost

Timeline:

  1. 16:49:50 - Lease acquired successfully
  2. 16:49:50 - Client waits for ready connection at socket
  3. 16:50:00 - Multiple Dial rejections with "Available" status (~10 seconds after lease)
  4. 16:50:10 - Dial rejections continue + authentication errors with "context canceled" (~20 seconds after lease)
  5. 16:50:10 - Connection lost, test fails

Context

Related PR #440 (MERGED)

PR #440 specifically addressed this race condition by adding server-side retry in the controller's Dial handler:

  • Up to 10 retry attempts with 300ms delay (~3 second max)
  • Only retries for Available status (transient state during lease setup)
  • Other statuses (Offline, HookFailed) fail immediately

However, the error logs show the Dial is still being rejected — suggesting either:

  1. The server-side retry isn't being triggered
  2. The retry timeout (3s) is insufficient for this scenario
  3. The exporter is stuck in Available status longer than expected
  4. A regression was introduced in PR Remove opendal dependency from QEMU driver #535
  5. The context is being canceled before retries can complete

Expected Behavior

The server-side retry logic from PR #440 should handle the transient Available status by retrying the Dial request until the exporter transitions to LeaseReady (within the 3-second retry window).

Actual Behavior

The Dial requests are being rejected repeatedly over a 20-second period, all with the same Available status error. After ~20 seconds, authentication attempts fail with "context canceled", eventually causing a connection failure.

Questions to Investigate

  1. Is the context deadline too short? The "context canceled" error suggests the context might be timing out before the exporter can transition to LeaseReady
  2. Is the server-side retry logic actually executing? (Check controller logs for retry attempts)
  3. Why is the exporter remaining in Available status for 20+ seconds?
  4. Is the authentication failure a symptom or cause of the connection failure?
  5. Is there a timing issue in the client?
  6. Are multiple concurrent Dial attempts exhausting the context deadline?

Steps to Reproduce

Run the e2e tests on PR #535:

make e2e-run

The test should fail with the Dial rejection errors followed by authentication failures as shown above.

Environment


Related:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions