Skip to content

fix(remoting): use atomic.Bool for ExchangeClient.init to fix data race#3438

Open
Aias00 wants to merge 5 commits into
apache:developfrom
Aias00:worktree-fix-exchange-client-init-race
Open

fix(remoting): use atomic.Bool for ExchangeClient.init to fix data race#3438
Aias00 wants to merge 5 commits into
apache:developfrom
Aias00:worktree-fix-exchange-client-init-race

Conversation

@Aias00

@Aias00 Aias00 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Problem

ExchangeClient.init is a plain bool field with a // FIXME atomic operation comment acknowledging the race. It is written in doInit/Close and read (indirectly via doInit) in Request/AsyncRequest/Send without any synchronization.

Race scenarios:

  • Multiple goroutines concurrently call Request/AsyncRequest/Send → all enter doInit → all see init == falseConnect is called multiple times
  • Close() writes init = false concurrently with doInit reading/writing init → memory visibility issue

Fix

Replace init bool with uatomic.Bool (already imported in this file as go.uber.org/atomic), consistent with the existing activeNum uatomic.Uint32 field in the same struct:

Change Detail
init boolinit uatomic.Bool Same package as activeNum, already imported
doInit check-then-set → CAS(false, true) Atomic compare-and-swap ensures only one goroutine performs Connect
CAS failure → spin-wait Goroutines that lose the CAS wait for init to complete instead of skipping or duplicating
Connect failure → Store(false) Reset flag on failure so future calls can retry (important behavior missing in original code)
CloseStore(false) Atomic write ensures visibility

This follows the same pattern used by BaseInvoker (uatomic.Bool), BaseClusterInvoker (atomic.Bool with CAS), and Directory (atomic.Bool with CAS + mutex) elsewhere in the codebase.

Tests

  • TestExchangeClientConcurrentDoInit: 20 goroutines concurrently call doInit — verifies Connect is called exactly once
  • TestExchangeClientDoInitFailureResets: verifies failed doInit resets the init flag and retry succeeds
  • All existing tests pass with -race flag enabled

🤖 Generated with Claude Code

The init field was a plain bool with a FIXME comment acknowledging the
race. It is written in doInit/Close and read in Request/AsyncRequest/Send
without synchronization. This causes a data race when multiple goroutines
concurrently trigger lazy initialization.

Replace init bool with uatomic.Bool (already imported in this file) and:
- Use CAS in doInit to ensure only one goroutine performs Connect
- Spin-wait for in-flight initialization to complete
- Reset init flag on Connect failure so retries can succeed
- Use atomic Store in Close

This follows the same pattern used by BaseInvoker (uatomic.Bool),
BaseClusterInvoker (atomic.Bool with CAS), and Directory
(atomic.Bool with CAS + mutex) elsewhere in the codebase.

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 17, 2026 07:12

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR aims to make ExchangeClient initialization safe under concurrent access by switching the init flag to an atomic type and adding tests for concurrent initialization and retry behavior.

Changes:

  • Replace ExchangeClient.init from bool to uatomic.Bool and use atomic operations in doInit/Close.
  • Add a concurrency test to ensure only one Connect occurs under concurrent doInit calls.
  • Add a failure/retry test to ensure a failed doInit resets state and can be retried.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
remoting/exchange_client.go Converts init flag to atomic and adds CAS-based coordination for doInit, plus reset on failure/close.
remoting/exchange_client_test.go Updates existing init assertion and adds concurrent-init + failure-retry tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 84 to 108
func (cl *ExchangeClient) doInit(url *common.URL) error {
if cl.init {
if cl.init.Load() {
return nil
}
// Use CompareAndSwap to ensure only one goroutine performs initialization.
// If CAS fails, another goroutine already initialized; spin-wait until init completes.
if !cl.init.CAS(false, true) {
// Another goroutine is initializing; wait for it to complete.
for !cl.init.Load() {
time.Sleep(10 * time.Millisecond)
}
return nil
}
// This goroutine won the CAS — perform the actual initialization.
if cl.client.Connect(url) != nil {
// retry for a while
time.Sleep(100 * time.Millisecond)
if cl.client.Connect(url) != nil {
logger.Errorf("[Remoting] failed to connect server, url=%v", url.Location)
cl.init.Store(false) // reset on failure so future calls can retry
return errors.New("Failed to connect server " + url.Location)
}
}
// FIXME atomic operation
cl.init = true
return nil
}
Comment thread remoting/exchange_client.go Outdated
Comment on lines +92 to +94
for !cl.init.Load() {
time.Sleep(10 * time.Millisecond)
}
Comment on lines +116 to +124
for i := 0; i < goroutines; i++ {
go func() {
defer wg.Done()
// All goroutines trigger lazy init concurrently via Request-like paths
if err := ec.doInit(testURL()); err != nil {
t.Errorf("doInit failed: %v", err)
}
}()
}
Address Copilot review feedback:
- Replace uatomic.Bool with uatomic.Int32 for 3-state init tracking
  (0=uninitialized, 1=initializing, 2=initialized) to prevent callers
  from observing a partially-initialized state
- Replace spin-wait with sync.Cond for proper coordination: waiters
  block until signaled instead of polling
- All concurrent waiters now receive the same error result when init
  fails, rather than silently returning nil
- Close() properly resets state and broadcasts to wake any waiters
- Fix test: use buffered channel for concurrent error collection instead
  of calling t.Errorf from multiple goroutines

Co-Authored-By: Claude <noreply@anthropic.com>
@Aias00 Aias00 force-pushed the worktree-fix-exchange-client-init-race branch from 2f7d6cb to 8bb4ab5 Compare June 17, 2026 08:15
@Aias00

Aias00 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Addressed Copilot Review Feedback

All three review comments have been addressed in the latest commit (8bb4ab5):

1. Three-state atomic instead of two-state bool

Before: uatomic.Bool with CAS(false→true) before Connect, causing waiters to see init==true before initialization completes.

After: Replaced with uatomic.Int32 using three states:

  • 0 = uninitialized
  • 1 = initializing
  • 2 = initialized

CAS now transitions 0→1 (not 0→2), so waiters never observe a "ready" state until Connect actually succeeds. On failure, state resets to 0 for retry.

2. sync.Cond instead of spin-wait

Before: for !cl.init.Load() { time.Sleep(10ms) } — polling with fixed sleep.

After: sync.Cond with Broadcast() when initialization completes (success or failure). Waiters call cl.initCond.Wait() and are woken immediately when the initializing goroutine finishes, eliminating both latency and CPU waste.

3. Test: buffered channel instead of t.Errorf from goroutines

Before: t.Errorf called directly from concurrent goroutines.

After: Errors are collected into a buffered channel and asserted from the main test goroutine after wg.Wait().

Additionally, all concurrent waiters now receive the same error result when initialization fails (previously they silently returned nil).

Aias00 and others added 2 commits June 17, 2026 16:20
testifylint requires require.Error/require.NoError instead of
assert.Error/assert.NoError for error assertions.

Co-Authored-By: Claude <noreply@anthropic.com>
@codecov-commenter

codecov-commenter commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 53.60%. Comparing base (60d1c2a) to head (68ccd9a).
⚠️ Report is 861 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3438      +/-   ##
===========================================
+ Coverage    46.76%   53.60%   +6.83%     
===========================================
  Files          295      493     +198     
  Lines        17172    38411   +21239     
===========================================
+ Hits          8031    20589   +12558     
- Misses        8287    16171    +7884     
- Partials       854     1651     +797     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants