Skip to content

Flaky E2E: listen queue race causes 'Connection to exporter lost' in Dial/Listen handoff #572

@raballew

Description

@raballew

Summary

The can lease and connect to exporters E2E test fails intermittently on main with Error: Connection to exporter lost. The root cause is a race condition in the listenQueues sync.Map in controller_service.go that can lose router tokens during the Dial/Listen handoff.

Symptoms

  • Test Core E2E Tests > Lease and connect > can lease and connect to exporters fails at e2e/test/e2e_test.go:406
  • Controller logs show repeated Exporter in Available status, waiting for lease setup retries (up to 9 attempts over 30 seconds)
  • Exporter never transitions from Available to LeaseReady
  • Client eventually gives up: Error: Connection to exporter lost

Evidence on main

  • 2026-04-15 upstream run 24468864095 failed with this exact pattern
  • 2026-04-16 fork stress test run 24512850927 reproduced it (job 71649063409)
  • The previous fix attempt (commit f473ede1, "Increase Dial retry timeout from 10s to 30s") only widened the retry window but does not address the underlying race

Root cause

Listen() and Dial() in controller/internal/service/controller_service.go both use sync.Map.LoadOrStore on listenQueues to share a buffered channel keyed by lease name. There are two race scenarios:

Scenario 1: Two readers, one channel

When the exporter's Listen gRPC stream disconnects and reconnects, the new Listen() call hits LoadOrStore which returns the existing channel (the old entry was never removed). Both the old goroutine (still exiting) and the new goroutine read from the same channel. When Dial sends a router token, the old goroutine can consume it, attempt stream.Send() on its dead gRPC stream, get an error, and discard the token. The new goroutine never sees it.

Scenario 2: Orphaned queue

If the old Listen() goroutine exits before the new one connects, the channel sits in the map with no active reader. Dial() sends a token into the buffered channel successfully (no blocking), but nothing ever drains it.

In both cases the exporter never receives the router token, never establishes a connection, and stays in Available status indefinitely.

Vulnerable code

// Listen() -- line 442
queue, _ := s.listenQueues.LoadOrStore(leaseName, make(chan *pb.ListenResponse, 8))
for {
    select {
    case <-ctx.Done():
        return nil  // exits without removing queue from map
    case msg := <-queue.(chan *pb.ListenResponse):
        if err := stream.Send(msg); err != nil {
            return err  // exits without removing queue from map
        }
    }
}

// Dial() -- line 735
queue, _ := s.listenQueues.LoadOrStore(leaseName, make(chan *pb.ListenResponse, 8))
select {
case <-ctx.Done():
    return nil, ctx.Err()
case queue.(chan *pb.ListenResponse) <- response:  // sends to potentially stale queue
}

The fundamental issue: Listen() never cleans up its queue entry on exit, and there is no mechanism to detect or prevent a stale reader from consuming tokens.

Reproduction

A unit test confirms this is not a rare edge case. With two goroutines reading from the same channel (modeling old + new Listen), the wrong goroutine consumes the token ~50% of the time:

=== RUN   TestListenQueueRace_OverlapStress
    old goroutine stole the token: 2510 / 5000 (50.2%)
    new goroutine got the token:   2490 / 5000 (49.8%)
    RACE CONFIRMED: in 50.2% of iterations the dying Listen goroutine
    consumed the Dial token before the live one could
--- PASS: TestListenQueueRace_OverlapStress (3.78s)

The test file is at controller/internal/service/listen_queue_race_test.go on the fix-listen-queue-race branch.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions