Skip to content

Ignore tsan warning due to the rocm barrier#618

Open
alekstheod wants to merge 6 commits intorocm-jaxlib-v0.8.0from
suppress_known_tsan_warning
Open

Ignore tsan warning due to the rocm barrier#618
alekstheod wants to merge 6 commits intorocm-jaxlib-v0.8.0from
suppress_known_tsan_warning

Conversation

@alekstheod
Copy link
Collaborator

@alekstheod alekstheod commented Feb 4, 2026

cursor reasoning for this error:

  The callback eventually goes through:
  1. DoHostCallback() → registers via hipLaunchHostFunc (ROCm/HIP API)
  2. GPU runtime executes the host callback on its thread
  3. Host callback calls callback_thread_->Schedule()
  4. Worker thread picks up the task and calls SetStateConcrete()


  The Problem

  TSAN doesn't understand GPU callback synchronization.
  The HIP/CUDA LaunchHostFunc APIs use internal synchronization mechanisms that aren't visible to TSAN. While the code is logically correct (construction happens-before DoHostCallback is called, and the callback only runs after registration), TSAN can't see the synchronization that the GPU runtime provides.
  This is essentially a false positive from TSAN's perspective because:
  • The AsyncValue is fully constructed before DoHostCallback() returns
  • The callback captures a reference-counted pointer (AsyncValueRef), keeping the object alive
  • The GPU runtime must synchronize to execute the callback with its captured data

  However, TSAN only understands standard C++ synchronization primitives (atomics, mutexes), not GPU driver-level synchronization.

  Possible Fixes

  1. Add TSAN annotations to suppress this known false positive for GPU callbacks
  2. Add explicit C++ synchronization visible to TSAN, such as a release fence before DoHostCallback and acquire fence at callback start:


  // In ThenExecuteCallback, before DoHostCallback:
  std::atomic_thread_fence(std::memory_order_release);
  // In the callback, before accessing shared state:
  std::atomic_thread_fence(std::memory_order_acquire);

  3. Use TSAN suppression file to ignore this specific pattern:


  race:tsl::AsyncValue::NotifyAvailable
  race:DoHostCallbackWithStatus

  The most pragmatic solution is likely option 3 (suppression) since this is a known limitation of TSAN with GPU runtimes, and the code logic is correct.

Second ignore:

  Root Cause

  The callback scheduling goes through RunCallbackOnStream:

   xla/pjrt/gpu/se_gpu_pjrt_client.cc lines 146-158

  absl::Status RunCallbackOnStream(se::Stream* stream,
                                   tsl::thread::ThreadPool* thread_pool,
                                   absl::AnyInvocable<void() &&> callback) {
    return stream->DoHostCallbackWithStatus(
        [cb = std::move(callback), thread_pool]() mutable {
          thread_pool->Schedule(
              [cb_ptr = new absl::AnyInvocable<void() &&>(std::move(cb))]() {
                std::move (*cb_ptr)();
                delete cb_ptr;
              });
          return absl::OkStatus();
        });
  }

  The lambda is registered via DoHostCallbackWithStatus() on a GPU stream. When GPU operations complete, the callback runs on a GPU runtime thread and schedules the actual work on a thread pool.
  The problem: GPU host callbacks (CUDA/HIP) synchronize GPU operations but don't necessarily establish C++ memory visibility (happens-before relationships) that TSAN can recognize. TSAN doesn't understand the GPU runtime's synchronization mechanisms, so it reports a race even though the memory accesses are likely properly ordered by
   the stream callback mechanism.

  Verdict

  This is most likely a false positive caused by TSAN not understanding the GPU stream callback synchronization. The DoHostCallbackWithStatus ensures the callback runs after the stream operations complete, which implicitly orders the accesses, but TSAN can't see through the GPU runtime internals.
  You can suppress this with a TSAN suppression file or by adding TSAN annotations to mark the synchronization points. Given your branch name suppress_known_tsan_warning, you might want to add an entry like:

  race:xla::EventPool::Handle::Handle

  to a TSAN suppressions file.

@i-chaochen
Copy link
Collaborator

No need this PR anymore?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants