What is your question?
I've been studying the source code of CUTLASS and have a question. When multiple threads within a single block perform atomicAdd to write to the same shared memory address, is the execution order deterministic? This matters because floating-point addition is not associative. Additionally, does tma_reduce_add guarantee a fixed execution order? Thank you.
What is your question?
I've been studying the source code of CUTLASS and have a question. When multiple threads within a single block perform atomicAdd to write to the same shared memory address, is the execution order deterministic? This matters because floating-point addition is not associative. Additionally, does tma_reduce_add guarantee a fixed execution order? Thank you.