Skip to content

fix(cluster): eliminate data race on StickyInvoker with atomic.Value#3439

Open
Aias00 wants to merge 7 commits into
apache:developfrom
Aias00:worktree-fix-sticky-invoker-race
Open

fix(cluster): eliminate data race on StickyInvoker with atomic.Value#3439
Aias00 wants to merge 7 commits into
apache:developfrom
Aias00:worktree-fix-sticky-invoker-race

Conversation

@Aias00

@Aias00 Aias00 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Problem

StickyInvoker in BaseClusterInvoker was a plain base.Invoker field read and written by multiple goroutines without any synchronization, causing a data race on the RPC hot path.

Race details

  • Write: DoSelect() sets StickyInvoker = nil (line 104) and StickyInvoker = selectedInvoker (line 115)
  • Read: DoSelect() reads StickyInvoker multiple times (lines 103, 108-110), and IsAvailable() reads it (lines 65-66)
  • Concurrent scenario: failbackClusterInvoker calls DoSelect() from both the main Invoke() goroutine (line 143) and the retry goroutine spawned via go invoker.tryTimerTaskProc() (line 118, calling DoSelect at line 85). Even without failback, concurrent Invoke() calls on the same invoker would also race.

Fix

Replace the unexported StickyInvoker field with sync/atomic.Value, accessed via getStickyInvoker()/setStickyInvoker() helpers:

Change Description
StickyInvoker base.InvokerstickyInvoker atomic.Value Field uses atomic.Value, unexported to prevent direct access
Add getStickyInvoker() / setStickyInvoker() Lock-free read/write wrapper methods
IsAvailable() uses getStickyInvoker() Eliminates read race
DoSelect() uses local variable + atomic accessors Eliminates read/write race; local variable ensures consistent logic within one call
Add TestStickyConcurrentDoSelect 100 goroutines concurrently call DoSelect with sticky=true
Add TestStickyConcurrentIsAvailableAndDoSelect 100 goroutines concurrently read sticky invoker while 100 goroutines call DoSelect

Both new tests pass with go test -race.

Semantic note

The sticky invoker mechanism is inherently "best-effort" — concurrent goroutines may select different invokers in the same call, but this is acceptable since sticky is merely a preference, not a correctness guarantee. The atomic.Value approach eliminates the undefined behavior (Go data race) while preserving the sticky semantics.

Test results

$ go test ./cluster/cluster/base/... -race -count=1 -v
=== RUN   TestStickyNormal
--- PASS
=== RUN   TestStickyNormalWhenError
--- PASS
=== RUN   TestStickyConcurrentDoSelect
--- PASS
=== RUN   TestStickyConcurrentIsAvailableAndDoSelect
--- PASS
PASS

All other cluster package tests also pass with -race (the pre-existing TestFailbackOutOfLimit race on the ticker field in failback is unrelated).

StickyInvoker in BaseClusterInvoker was a plain field read/written by
multiple goroutines (e.g. failback retry goroutine + Invoke goroutine)
without any synchronization, causing a data race on the RPC hot path.

Replace the unexported StickyInvoker field with an atomic.Value,
accessed via getStickyInvoker()/setStickyInvoker() helpers. This
eliminates the race while keeping the lock-free read path fast.

Key changes:
- StickyInvoker base.Invoker -> stickyInvoker atomic.Value (unexported)
- Add getStickyInvoker() / setStickyInvoker() helpers
- Update IsAvailable() and DoSelect() to use atomic accessors
- DoSelect reads sticky invoker into a local variable to avoid
  redundant atomic loads and ensure consistent logic within one call
- Add concurrent tests (TestStickyConcurrentDoSelect,
  TestStickyConcurrentIsAvailableAndDoSelect) that pass with -race

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 17, 2026 07:12

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR aims to make BaseClusterInvoker’s sticky invoker selection safe under concurrent access and adds race-focused tests around sticky selection.

Changes:

  • Replaced the StickyInvoker field with an atomic-backed storage and added getter/setter helpers.
  • Updated IsAvailable and DoSelect to use the atomic sticky invoker accessors.
  • Added concurrent tests intended to surface data races in sticky invoker selection.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File Description
cluster/cluster/base/cluster_invoker.go Switches sticky invoker storage to an atomic container and updates call sites to use it.
cluster/cluster/base/cluster_invoker_test.go Adds concurrency tests intended to detect race conditions around sticky selection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cluster/cluster/base/cluster_invoker.go Outdated
Comment thread cluster/cluster/base/cluster_invoker.go Outdated
Comment thread cluster/cluster/base/cluster_invoker.go Outdated
Comment on lines 131 to 134
selectedInvoker = invoker.doSelectInvoker(lb, invocation, invokers, invoked)
if sticky {
invoker.StickyInvoker = selectedInvoker
invoker.setStickyInvoker(selectedInvoker)
}
Comment thread cluster/cluster/base/cluster_invoker.go Outdated
Comment thread cluster/cluster/base/cluster_invoker_test.go Outdated
Comment on lines +110 to +111
// TestStickyConcurrentIsAvailableAndDoSelect verifies that concurrent
// IsAvailable and DoSelect calls do not cause a data race on StickyInvoker.
Comment thread cluster/cluster/base/cluster_invoker_test.go Outdated
…and type mismatch panics

Address Copilot review feedback:
1. atomic.Value.Store(nil) panics - wrap base.Invoker in stickyInvokerWrapper
   struct so nil invoker is stored as {invoker: nil} (non-nil wrapper)
2. atomic.Value requires consistent concrete type - the wrapper struct
   ensures the same type is always stored regardless of Invoker impl
3. Fix test TestStickyConcurrentIsAvailableAndDoSelect to actually call
   IsAvailable() with a proper mock Directory (instead of getStickyInvoker)
4. Fix imports formatting per dubbo-go convention (separate import groups)

Co-Authored-By: Claude <noreply@anthropic.com>
@Aias00

Aias00 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Addressing Copilot Review Feedback

Thanks for the thorough review! All three issues have been fixed in the latest push:

1. atomic.Value.Store(nil) panic ✅ Fixed

Problem: atomic.Value.Store(nil) panics at runtime.

Fix: Introduced stickyInvokerWrapper struct to wrap base.Invoker:

type stickyInvokerWrapper struct {
    invoker base.Invoker
}

Now clearing the sticky invoker stores stickyInvokerWrapper{invoker: nil} (non-nil wrapper with nil inner value), which is safe.

2. Inconsistent concrete type panic ✅ Fixed

Problem: atomic.Value requires all stored values to have the same concrete dynamic type. Different base.Invoker implementations would cause a panic.

Fix: The stickyInvokerWrapper type is always the same regardless of the concrete base.Invoker implementation stored inside.

3. Test calling getStickyInvoker() instead of IsAvailable() ✅ Fixed

Problem: TestStickyConcurrentIsAvailableAndDoSelect called the unexported getStickyInvoker() instead of the public API IsAvailable().

Fix: Test now uses NewBaseClusterInvoker() with a proper mockDirectory implementation so IsAvailable() works end-to-end without panicking on nil Directory.

4. Import formatting ✅ Fixed

Ran make fmt (imports-formatter) to match project convention (separate import groups for stdlib, third-party, and internal packages).

CI Notes

  • Check Code Format: Should pass now after running make fmt
  • Integration Test failure: Pre-existing infrastructure timeout, not related to this change

Co-Authored-By: Claude <noreply@anthropic.com>
@codecov-commenter

codecov-commenter commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 76.92308% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.75%. Comparing base (60d1c2a) to head (d506e66).
⚠️ Report is 861 commits behind head on develop.

Files with missing lines Patch % Lines
cluster/cluster/base/cluster_invoker.go 76.92% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3439      +/-   ##
===========================================
+ Coverage    46.76%   53.75%   +6.99%     
===========================================
  Files          295      491     +196     
  Lines        17172    37824   +20652     
===========================================
+ Hits          8031    20334   +12303     
- Misses        8287    15857    +7570     
- Partials       854     1633     +779     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Alanxtl Alanxtl added ☢️ Bug 3.3.2 version 3.3.2 labels Jun 18, 2026
@Alanxtl

Alanxtl commented Jun 18, 2026

Copy link
Copy Markdown
Member

这版把一个很小的 sticky 状态抽成了 atomic.Value + stickyInvokerWrapper + getter/setter,太java了,不太像 Go。这里其实只需要一个并发安全的指针状态,建议直接用 atomic.Pointer ,这里的 holder 只是为了配合 atomic,业务语义上没必要。要么直接用更轻的并发原语,要么把 holder 压到只剩一个字段,别再往上叠 getter/setter 了。

别为了适配 atomic.Value 再多包一层。顺手也建议把 BaseClusterInvoker 的使用方式收敛成只按指针持有,避免后面被值拷贝埋坑。

@Aias00

Aias00 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

这版把一个很小的 sticky 状态抽成了 atomic.Value + stickyInvokerWrapper + getter/setter,太java了,不太像 Go。这里其实只需要一个并发安全的指针状态,建议直接用 atomic.Pointer ,这里的 holder 只是为了配合 atomic,业务语义上没必要。要么直接用更轻的并发原语,要么把 holder 压到只剩一个字段,别再往上叠 getter/setter 了。

别为了适配 atomic.Value 再多包一层。顺手也建议把 BaseClusterInvoker 的使用方式收敛成只按指针持有,避免后面被值拷贝埋坑。

好 我处理下

Aias00 and others added 2 commits June 18, 2026 13:28
…ClusterInvoker from constructor

- Replace atomic.Value + stickyInvokerWrapper + getter/setter with
  atomic.Pointer[base.Invoker] — idiomatic Go, nil-safe, type-safe,
  no wrapper needed
- NewBaseClusterInvoker now returns *BaseClusterInvoker (pointer),
  preventing accidental value copies of the struct containing atomic
  fields
- All 9 cluster implementations updated to use pointer embedding
  (*base.BaseClusterInvoker) to match the new constructor return type
- StickyInvoker field remains exported (atomic.Pointer is safe for
  concurrent access by design)

Co-Authored-By: Claude <noreply@anthropic.com>
@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.3.2 version 3.3.2 ☢️ Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants