Issue Description
I've encountered an issue where values copied via cp_async4_stream are not visible in shared memory when accessed later in the kernel. Specifically:
In the original code sequence (L659-665 in marlin/marlin_cuda_kernel.cu):
// For per-column scales, we only fetch them here in the final step before write-out
if (group_blocks == -1 && last) {
if (s_sh_wr_pred) // s_sh_wr_pred = threadIdx.x < s_sh_stride = 32 * thread_n_blocks / 8
cp_async4_stream(&sh_s[s_sh_wr], &s[s_gl_rd]);
cp_async_fence();
}
thread_block_reduce(); // Contains __syncthreads()
The shared memory values (sh_s) appear zeroed when accessed later, even though the source global memory (s) contains valid data.
However, when I modify the sequence to:
thread_block_reduce(); // Moved before async copy
if (group_blocks == -1 && last) {
if (s_sh_wr_pred) // s_sh_wr_pred = threadIdx.x < s_sh_stride = 32 * thread_n_blocks / 8
cp_async4_stream(&sh_s[s_sh_wr], &s[s_gl_rd]);
cp_async_fence();
}
The values are correctly populated in shared memory. This suggests a potential data race or synchronization issue with the asynchronous copy.
Debug output before fix:
##### cp_async4_stream
threadIdx.x=4 ## s_sh_wr=4, s_gl_rd=4 ##
sh_s[s_sh_wr].x=0, ... // All zeros
s[s_gl_rd].x=15899, ... // We controlled that only one value in scales is non-zero
Questions for Maintainers
-
Correctness Validation:
- Is swapping the
thread_block_reduce() and async operations safe?
- Could this change break other execution paths or edge cases?
-
Performance Impact:
- How might this affect computation/communication overlap?
- Does moving the synchronization earlier impact pipeline efficiency?
Environment
This appears related to CUDA async operation scheduling where synchronization ops immediately following async launches might inhibit proper execution. Would appreciate your insight on the correct synchronization pattern for this case.
Issue Description
I've encountered an issue where values copied via
cp_async4_streamare not visible in shared memory when accessed later in the kernel. Specifically:In the original code sequence (L659-665 in marlin/marlin_cuda_kernel.cu):
The shared memory values (
sh_s) appear zeroed when accessed later, even though the source global memory (s) contains valid data.However, when I modify the sequence to:
The values are correctly populated in shared memory. This suggests a potential data race or synchronization issue with the asynchronous copy.
Debug output before fix:
Questions for Maintainers
Correctness Validation:
thread_block_reduce()and async operations safe?Performance Impact:
Environment
This appears related to CUDA async operation scheduling where synchronization ops immediately following async launches might inhibit proper execution. Would appreciate your insight on the correct synchronization pattern for this case.