Skip to content

e2e: jmp shell fails with "Connection to exporter lost" after ~20s waiting for ready connection #425

@ambient-code

Description

@ambient-code

Summary

The e2e tests can lease and connect to exporters (test 47) and can lease and connect to exporters by name (test 48) fail intermittently with Error: Connection to exporter lost. The client successfully acquires a lease but then times out waiting for the ready connection on the Unix socket, never reaching the beforeLease hook completion or LEASE_READY status monitoring on the client side.

Failing CI Run

Reproduction Timeline (test 47)

08:23:01  INFO  [jumpstarter.client.lease] Acquiring lease 019d6c30-0a9b-7b81-bc30-cbd918008be8
08:23:01  INFO  Lease acquired successfully! (0:00:00)
08:23:01  INFO  Waiting for ready connection at /run/user/1001/jumpstarter-nkjy63wk/socket
           ← 20 seconds of silence — no beforeLease hook log, no status_monitor update on the client side
08:23:21  INFO  Releasing Lease 019d6c30-0a9b-7b81-bc30-cbd918008be8
           Error: Connection to exporter lost

The same pattern repeats for test 48 (jmp shell --client test-client-oidc --name test-exporter-oidc j power on), also dying after ~20s with the same error.

Observations

  1. The exporter side appears healthy — the exporter logs from this test show it had been successfully handling previous leases (sessions created, power on commands executed, sessions closed cleanly).

  2. No beforeLease hook activity on the client — In passing runs, the client logs show Waiting for beforeLease hook to complete... followed by Status changed: None -> LEASE_READY. In the failing run, neither of these messages appears — the client goes straight from "Waiting for ready connection" to "Releasing Lease" after ~20s.

  3. No exporter-side log entry for the failing lease — The exporter logs dumped on failure don't show a Starting new lease: 019d6c30-0a9b... entry, suggesting the exporter never received or processed the lease assignment for this specific lease.

  4. Flaky, not deterministic — The re-run (run 24125318534) passed all 52 tests on the same commit, suggesting a race condition or transient infrastructure issue.

Possible Root Causes

  • Race condition in lease routing: The controller assigned the lease, but the exporter hadn't fully re-registered after the previous lease teardown, causing the router to fail to connect the client to the exporter.
  • Socket readiness timeout: The client may have a hardcoded ~20s timeout waiting for the Unix socket to become ready, and the exporter-side session setup took too long or never started.
  • Router/controller propagation delay: The lease was marked Ready in k8s but the router hadn't yet updated its routing table for the new lease.

Environment

  • Runner: ubuntu-24.04 (x86_64)
  • Test file: e2e/tests.bats, lines 471-484

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions