Implement a sparse alltoall exchange pattern by wence- · Pull Request #959 · rapidsai/rapidsmpf

wence- · 2026-04-10T16:33:10Z

In this collective every rank advertises the destination ranks it will send
to and the source ranks it will receive from (these have to match up
collective, no error is provided). The caller can then insert messages to
particular ranks, followed by a final insert_finished() call.

On the receive side, after waiting for completion, we can extract received
messages by rank. The receive side message order is defined by the
insertion order on the send side. That is, if rank-A inserts messages in
order [A0, A1, A2] to rank-B, then when rank-B calls extract(rank-A) it
will see the same order (even if the messages were sent in a different order).

In this collective every rank advertises the destination ranks it will send to and the source ranks it will receive from (these have to match up collective, no error is provided). The caller can then insert messages to particular ranks, followed by a final insert_finished() call. On the receive side, after waiting for completion, we can extract received messages by rank. The receive side message order is defined by the insertion order on the send side. That is, if rank-A inserts messages in order [A0, A1, A2] to rank-B, then when rank-B calls `extract(rank-A)` it will see the same order (even if the messages were sent in a different order).

wence- · 2026-04-10T16:33:27Z

I can split these into bits for review purposes if that is useful

pentschev · 2026-04-14T14:38:18Z

+     * @note Concurrent insertion by multiple threads is supported, the caller must ensure
+     * that `insert_finished()` is called _after_ all `insert()` calls have completed.


They seem like two separate notes or am I misreading? It seems that the multi-threaded insertion comment and insert_finished() requirement are independent from each other.

This relates to the note in insert_finished, they are indeed dependent. I would rephrase this slightly to make dependence clearer:

Suggested change

* @note Concurrent insertion by multiple threads is supported, the caller must ensure

* that `insert_finished()` is called _after_ all `insert()` calls have completed.

* @note Concurrent insertion by multiple threads is supported, the caller must ensure

* all `insert()` calls (concurrent or not) have completed before calling

* `insert_finished()`.

pentschev · 2026-04-14T16:33:28Z

+                                           packed_data_vector_to_list)
+
+
+cdef class SparseAlltoall:


I know this isn't common, but especially with Python free threading do we want to also add notes about multithreading support here?

We should also include SparseAlltoAll to https://github.com/rapidsai/rapidsmpf/blob/main/python/rapidsmpf/rapidsmpf/coll/__init__.py .

pentschev · 2026-04-14T18:45:57Z

+            plc.Table(
+                [plc.Column.from_array(np.array([29], dtype=np.int32), stream=stream)]
+            ),
+        )


This test is ok but it only tests exactly 2 ranks, while any other ranks are all no-op. I think we could use a test that would test as many ranks are available too.

pentschev · 2026-04-14T18:59:51Z

+    );
+}
+
+void SparseAlltoall::insert_finished() {


We should probably add a check to prevent calling insert_finished() more than once.

Maybe we can check locally_finished_==false

pentschev · 2026-04-14T19:01:28Z

A few tests I think we're missing include:

insert() after insert_finished()

multi-threaded insert()

extract() with invalid source rank

pentschev · 2026-04-14T19:04:45Z

+    RAPIDSMPF_EXPECTS_FATAL(
+        event_.is_set(),
+        "~SparseAlltoall: not all notification tasks complete, did you forget to await "
+        "this->wait() or to call this->insert_finished()?"


Is this->wait() a mistake? I don't think we have a wait() method.

nirandaperera

Had some comments for the normal impl. I will check the coroutine impl later today.

nirandaperera · 2026-04-15T22:05:09Z

+        std::uint64_t received_count{0};
+        std::vector<std::unique_ptr<detail::Chunk>> chunks;
+
+        [[nodiscard]] bool ready() const noexcept {


Nit.

Suggested change

[[nodiscard]] bool ready() const noexcept {

[[nodiscard]] bool constexpr ready() const noexcept {

nirandaperera · 2026-04-15T22:29:37Z

+    void send_ready_messages();
+    void receive_metadata_messages();
+    void receive_data_messages();
+    void complete_data_messages();


It would be nice to add some @brief docstrings here for these methods (for future references). Either here, or in the cpp file

nirandaperera · 2026-04-15T22:31:23Z

+    RAPIDSMPF_EXPECTS(br_ != nullptr, "the buffer resource pointer cannot be null");
+    auto const size = comm_->nranks();
+    auto const self = comm_->rank();
+    for (auto src : srcs_) {


Nit

Suggested change

for (auto src : srcs_) {

source_states.reserve(srcs_.size());

for (auto src : srcs_) {

nirandaperera · 2026-04-15T22:33:29Z

+        );
+        source_states_.emplace(src, SourceState{});
+    }
+    for (auto dst : dsts_) {


Nit.

Suggested change

for (auto dst : dsts_) {

next_ordinal_per_dst_.reserve(dsts_.size());

for (auto dst : dsts_) {

nirandaperera · 2026-04-15T22:36:11Z

+SparseAlltoall::~SparseAlltoall() noexcept {
+    RAPIDSMPF_EXPECTS_FATAL(
+        locally_finished_.load(std::memory_order_acquire),
+        "Destroying SparseAlltoall without `insert_finished()`"
+    );


Marked as noexcept but throwing.

nirandaperera · 2026-04-15T23:32:22Z

+    Tag const metadata_tag{op_id_, 0};
+    for (auto src : srcs_) {
+        auto& state = source_states_.at(src);
+        while (!state.ready()) {


Woouldnt this while not ready loop hog the progress thread until all the messages are received from all sources? I feel like this will be unfair for other concurrent collectives, isnt it?

Did you mean to use an if instead?

nirandaperera · 2026-04-15T23:33:41Z

+                );
+                state.expected_count = chunk->sequence();
+            } else {
+                incoming_by_src_.at(src).push_back(std::move(chunk));


Nit

Suggested change

incoming_by_src_.at(src).push_back(std::move(chunk));

incoming_by_src_.[src].push_back(std::move(chunk));

nirandaperera · 2026-04-15T23:39:28Z

+            src >= 0 && src < size && src != self, "SparseAlltoall invalid source rank."
+        );
+        RAPIDSMPF_EXPECTS(
+            incoming_by_src_.emplace(src, std::vector<std::unique_ptr<detail::Chunk>>{})


Why cant incoming_by_src_ a class member of the SourceState? I feel like, it is the received metadata queue from a particular source, isnt it?

nirandaperera · 2026-04-15T23:47:38Z

+                );
+            }
+        }
+        queue.erase(queue.begin(), queue.begin() + processed);


Wouldn't a std::deque/queue better here? Unprocessed chunks will always be moved by processed elements in each progress iteration, isnt it?

nirandaperera · 2026-04-15T23:48:16Z

+            }
+            processed++;
+            if (chunk->data_size() == 0) {
+                auto& state = source_states_.at(chunk->origin());


Suggested change

auto& state = source_states_.at(chunk->origin());

auto& state = source_states_[chunk->origin()];

wence- added 5 commits April 10, 2026 16:03

Python bindings for sparse all to all

3f8c575

Python tests for sparse alltoall

12a5c88

Streaming interface and tests for sparse all to all

7c54825

Python bindings for streaming alltoall and simple test

f43e886

wence- requested review from a team as code owners April 10, 2026 16:33

wence- added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Apr 10, 2026

rjzamora mentioned this pull request Apr 14, 2026

[WIP] Add async 1D HaloExchange API #947

Closed

pentschev reviewed Apr 14, 2026

View reviewed changes

nirandaperera requested changes Apr 15, 2026

View reviewed changes

rjzamora mentioned this pull request Apr 17, 2026

Implement single-rank streaming LazyFrame.rolling() rapidsai/cudf#22046

Open

3 tasks

		* @note Concurrent insertion by multiple threads is supported, the caller must ensure
		* that `insert_finished()` is called _after_ all `insert()` calls have completed.

	[[nodiscard]] bool ready() const noexcept {
	[[nodiscard]] bool constexpr ready() const noexcept {

	for (auto src : srcs_) {
	source_states.reserve(srcs_.size());
	for (auto src : srcs_) {

	for (auto dst : dsts_) {
	next_ordinal_per_dst_.reserve(dsts_.size());
	for (auto dst : dsts_) {

	incoming_by_src_.at(src).push_back(std::move(chunk));
	incoming_by_src_.[src].push_back(std::move(chunk));

	auto& state = source_states_.at(chunk->origin());
	auto& state = source_states_[chunk->origin()];

Conversation

wence- commented Apr 10, 2026

Uh oh!

wence- commented Apr 10, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nirandaperera left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants