Summary
The can lease and connect to exporters E2E test fails intermittently on main with Error: Connection to exporter lost. The root cause is a race condition in the listenQueues sync.Map in controller_service.go that can lose router tokens during the Dial/Listen handoff.
Symptoms
- Test
Core E2E Tests > Lease and connect > can lease and connect to exporters fails at e2e/test/e2e_test.go:406
- Controller logs show repeated
Exporter in Available status, waiting for lease setup retries (up to 9 attempts over 30 seconds)
- Exporter never transitions from Available to LeaseReady
- Client eventually gives up:
Error: Connection to exporter lost
Evidence on main
- 2026-04-15 upstream run
24468864095 failed with this exact pattern
- 2026-04-16 fork stress test run
24512850927 reproduced it (job 71649063409)
- The previous fix attempt (commit
f473ede1, "Increase Dial retry timeout from 10s to 30s") only widened the retry window but does not address the underlying race
Root cause
Listen() and Dial() in controller/internal/service/controller_service.go both use sync.Map.LoadOrStore on listenQueues to share a buffered channel keyed by lease name. There are two race scenarios:
Scenario 1: Two readers, one channel
When the exporter's Listen gRPC stream disconnects and reconnects, the new Listen() call hits LoadOrStore which returns the existing channel (the old entry was never removed). Both the old goroutine (still exiting) and the new goroutine read from the same channel. When Dial sends a router token, the old goroutine can consume it, attempt stream.Send() on its dead gRPC stream, get an error, and discard the token. The new goroutine never sees it.
Scenario 2: Orphaned queue
If the old Listen() goroutine exits before the new one connects, the channel sits in the map with no active reader. Dial() sends a token into the buffered channel successfully (no blocking), but nothing ever drains it.
In both cases the exporter never receives the router token, never establishes a connection, and stays in Available status indefinitely.
Vulnerable code
// Listen() -- line 442
queue, _ := s.listenQueues.LoadOrStore(leaseName, make(chan *pb.ListenResponse, 8))
for {
select {
case <-ctx.Done():
return nil // exits without removing queue from map
case msg := <-queue.(chan *pb.ListenResponse):
if err := stream.Send(msg); err != nil {
return err // exits without removing queue from map
}
}
}
// Dial() -- line 735
queue, _ := s.listenQueues.LoadOrStore(leaseName, make(chan *pb.ListenResponse, 8))
select {
case <-ctx.Done():
return nil, ctx.Err()
case queue.(chan *pb.ListenResponse) <- response: // sends to potentially stale queue
}
The fundamental issue: Listen() never cleans up its queue entry on exit, and there is no mechanism to detect or prevent a stale reader from consuming tokens.
Reproduction
A unit test confirms this is not a rare edge case. With two goroutines reading from the same channel (modeling old + new Listen), the wrong goroutine consumes the token ~50% of the time:
=== RUN TestListenQueueRace_OverlapStress
old goroutine stole the token: 2510 / 5000 (50.2%)
new goroutine got the token: 2490 / 5000 (49.8%)
RACE CONFIRMED: in 50.2% of iterations the dying Listen goroutine
consumed the Dial token before the live one could
--- PASS: TestListenQueueRace_OverlapStress (3.78s)
The test file is at controller/internal/service/listen_queue_race_test.go on the fix-listen-queue-race branch.
Related
Summary
The
can lease and connect to exportersE2E test fails intermittently on main withError: Connection to exporter lost. The root cause is a race condition in thelistenQueuessync.Map incontroller_service.gothat can lose router tokens during the Dial/Listen handoff.Symptoms
Core E2E Tests > Lease and connect > can lease and connect to exportersfails ate2e/test/e2e_test.go:406Exporter in Available status, waiting for lease setupretries (up to 9 attempts over 30 seconds)Error: Connection to exporter lostEvidence on main
24468864095failed with this exact pattern24512850927reproduced it (job71649063409)f473ede1, "Increase Dial retry timeout from 10s to 30s") only widened the retry window but does not address the underlying raceRoot cause
Listen()andDial()incontroller/internal/service/controller_service.goboth usesync.Map.LoadOrStoreonlistenQueuesto share a buffered channel keyed by lease name. There are two race scenarios:Scenario 1: Two readers, one channel
When the exporter's Listen gRPC stream disconnects and reconnects, the new
Listen()call hitsLoadOrStorewhich returns the existing channel (the old entry was never removed). Both the old goroutine (still exiting) and the new goroutine read from the same channel. When Dial sends a router token, the old goroutine can consume it, attemptstream.Send()on its dead gRPC stream, get an error, and discard the token. The new goroutine never sees it.Scenario 2: Orphaned queue
If the old
Listen()goroutine exits before the new one connects, the channel sits in the map with no active reader.Dial()sends a token into the buffered channel successfully (no blocking), but nothing ever drains it.In both cases the exporter never receives the router token, never establishes a connection, and stays in
Availablestatus indefinitely.Vulnerable code
The fundamental issue:
Listen()never cleans up its queue entry on exit, and there is no mechanism to detect or prevent a stale reader from consuming tokens.Reproduction
A unit test confirms this is not a rare edge case. With two goroutines reading from the same channel (modeling old + new Listen), the wrong goroutine consumes the token ~50% of the time:
The test file is at
controller/internal/service/listen_queue_race_test.goon thefix-listen-queue-racebranch.Related
f473ede1increased the Dial retry timeout as a workaround but does not fix the race