Skip to content

Fix stream race in dispatch/combine with deterministic algorithms#600

Open
ccs668899 wants to merge 1 commit intodeepseek-ai:mainfrom
ccs668899:fix/stream-race-nan-deterministic
Open

Fix stream race in dispatch/combine with deterministic algorithms#600
ccs668899 wants to merge 1 commit intodeepseek-ai:mainfrom
ccs668899:fix/stream-race-nan-deterministic

Conversation

@ccs668899
Copy link
Copy Markdown

The cause of #589 so far as I can tell is as follows in deep_ep.cpp:

// The comm stream waits for everything on the compute stream.
stream_wait(comm_stream, compute_stream)

// The output tensor is allocated, but in deterministic mode a NaN-fill is enqueued on the compute stream.
torch::empty(recv_x)

// The communication kernel writes recv_x on to the comm stream.
intranode::combine(..., recv_x, ..., comm_stream)

Now this could be fixed by putting the stream wait until after the empty allocation, but we have the comment "NOTES: do not allocate tensors upfront!". I don't want to mess up the stream ownership so I just add a second sync stream wait right before the intranode combine:

// The output tensor is allocated, but in deterministic mode a NaN-fill is enqueued on the compute stream.
torch::empty(recv_x)

// Ensure comm_stream observes all compute_stream work above this point.
if (!allocate_on_comm_stream)
    stream_wait(comm_stream, compute_stream);

// The communication kernel writes recv_x on to the comm stream.
intranode::combine(..., recv_x, ..., comm_stream)

The tests should deterministically catch the bug once the fix is in. I don't have a multi-GPU setup on hand to validate myself, any recommendations?

Add stream_wait(comm_stream, compute_stream) after torch::empty()
allocations and before kernel launches in all four dispatch/combine
functions. When fill_uninitialized_memory=True, torch::empty() enqueues
a NaN-fill kernel on compute_stream after the initial stream_wait
snapshot, causing a write-write race with the communication kernel on
comm_stream. The second sync captures these fill kernels.

Add regression test with 4-config control matrix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant