[QST]Memory Ordering of atomicAdd and tma_reduce_add

**What is your question?**
I've been studying the source code of CUTLASS and have a question. When multiple threads within a single block perform atomicAdd to write to the same shared memory address, is the execution order deterministic? This matters because floating-point addition is not associative. Additionally, does tma_reduce_add guarantee a fixed execution order? Thank you.