[Perf] Re-land Streams 1-4 with bug fixes by hughperkins · Pull Request #653 · Genesis-Embodied-AI/quadrants

hughperkins · 2026-05-07T22:29:05Z

Summary

Re-lands the streams merge ([Perf] Streams 1-4 #410) that was reverted in [Bug] Revert "[Perf] Streams 1-4 (#410)" #650
Fixes arg_buffer race when the same kernel is called on different explicit streams (per-handle persistent buffers instead of launcher-global)
Fixes adstack overflow diagnostic losing kernel name on fast GPUs (peek-then-sync instead of consuming both slots atomically)
Fixes use-after-free in KernelLauncher: contexts_ changed from std::vector to std::deque so that recursive register_llvm_kernel calls from publish_adstack_metadata's host-eval path cannot invalidate references held by a parent launch_llvm_kernel frame

Test plan

test_adstack_triangular_ndrange_self_referential_push_idempotency passes 10/10 with QD_OFFLINE_CACHE=0
Full adstack test suite (1108 tests) passes
CI passes

Made with Cursor

This reverts commit 47d5750.

The per-handle persistent arg_buffer/runtime_context in the CUDA and AMDGPU kernel launchers is reused across launches. When the same kernel is dispatched on two different explicit streams (qd_stream), the second call's memcpy can overwrite the buffer while the first kernel is still reading it, causing data corruption. Fix: when active_stream != nullptr (explicit stream), allocate a per-call ephemeral buffer instead of reusing the persistent one. The stream-ordered mem_free_async ensures the memory stays live until the kernel finishes reading. Default-stream launches keep the existing persistent-buffer fast path.

The per-launch check_adstack_overflow poll (called without prior sync) could observe the overflow flag before the companion task_id write had been flushed from the device to pinned host memory. The old code consumed both slots via exchange(0), so by the time qd.sync() ran the flag was already clear and the identity was lost — producing a diagnostic with no kernel name. Fix: peek the flag with a relaxed load first; if set, synchronize() to ensure the task_id is visible, then consume both slots. The sync only fires on the rare overflow path, so the fast path stays zero-cost.

Re-add the "no sync drain", "where the kernel that wrote it ran", and the reinterpret_cast / Itanium ABI / MSVC ABI portability note that were inadvertently dropped when rewriting the comment block.

…eque publish_adstack_metadata's host-eval branch recursively registers snode-reader kernels via register_llvm_kernel, which calls contexts_.resize(). When contexts_ is a std::vector, resize can reallocate, invalidating the launcher_ctx reference held by the parent launch_llvm_kernel frame. std::deque guarantees reference stability on push_back / resize, eliminating the bug without per-launch copy overhead.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5f5f6d0dd9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-07T22:32:05Z

+      for (size_t j = group_start; j < i; j++) {
+        int sid = offloaded_tasks[j].stream_parallel_group_id;
+        if (stream_by_id.find(sid) == stream_by_id.end()) {
+          stream_by_id[sid] = CUDAContext::get_instance().acquire_stream();


Chain stream-parallel worker streams to the launch stream

When a kernel containing qd.stream_parallel() is launched with qd_stream=s after earlier work has been queued on s, these newly acquired non-blocking worker streams start launching immediately and never wait for active_stream. That breaks the normal stream-ordering expectation for produce(qd_stream=s); consume_parallel(qd_stream=s): the stream-parallel loops can read inputs before produce has completed. The worker streams need an event/wait dependency on active_stream before their launches (and the AMDGPU path has the same pattern).

Useful? React with 👍 / 👎.

github-actions · 2026-05-07T23:06:56Z

Total: 41 file(s) changed, +1313 -102 code lines.

github-actions · 2026-05-07T23:47:02Z

Diff coverage: 80% · 695 lines, 137 missing

Resolves conflicts from the adstack file split (llvm_adstack_lazy_claim.cpp -> adstack_lazy_claim/{heap_grow,bound_eval,metadata_publish}.cpp) and max-reducer dispatch additions, preserving both the stream-parallel paths and the new max-reducer infrastructure.

github-actions · 2026-05-08T21:13:21Z

Total: 41 file(s) changed, +1313 -102 code lines.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cbef4561b7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-08T21:14:48Z

+        if self._handle != 0 and self._prog_ref is not None:
+            prog = self._destroy_prog()
+            if prog is not None:
+                prog.stream_destroy(self._handle)


Synchronize before destroying managed streams

When a managed stream is used only through the documented context-manager pattern, e.g. with qd.create_stream() as s: fill(qd_stream=s), __exit__ reaches this path and immediately drops the handle without waiting for the queued work. CUDA/HIP stream destruction is not a host-side synchronization point, and qd.sync()/field reads later only wait on the default stream, so the caller can observe stale data or race with the still-running explicit-stream kernel despite the docs saying the context manager waits for in-flight work. Please synchronize the stream before destroying managed handles, at least on explicit destroy()/__exit__.

Useful? React with 👍 / 👎.

hughperkins · 2026-05-08T21:22:07Z

Benchmarks ok:

hughperkins · 2026-05-08T22:07:05Z

Genesis unit tests pass:

github-actions · 2026-05-08T22:09:38Z

Diff coverage: 80% · 695 lines, 137 missing

hughperkins added 8 commits May 7, 2026 15:28

Revert "[Bug] Revert "[Perf] Streams 1-4 (#410)" (#650)"

5643d91

This reverts commit 47d5750.

Restore deleted comment explaining task_id == 0 semantics

5f6eab0

Reflow comments in check_adstack_overflow to 120c line width

068fc27

Restore deleted portability comment in check_adstack_overflow

9697aaf

Re-add the "no sync drain", "where the kernel that wrote it ran", and the reinterpret_cast / Itanium ABI / MSVC ABI portability note that were inadvertently dropped when rewriting the comment block.

Restore deleted 'Persistent scratch' comment in AMDGPU kernel_launcher

5b1ecb0

hughperkins marked this pull request as draft May 7, 2026 22:31

chatgpt-codex-connector Bot reviewed May 7, 2026

View reviewed changes

hughperkins added 2 commits May 8, 2026 13:32

Restore adstack gate comment in AMDGPU kernel_launcher

cbef456

hughperkins marked this pull request as ready for review May 8, 2026 21:12

chatgpt-codex-connector Bot reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Re-land Streams 1-4 with bug fixes#653

[Perf] Re-land Streams 1-4 with bug fixes#653
hughperkins wants to merge 10 commits intomainfrom
hp/streams-re-land

hughperkins commented May 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 8, 2026

Uh oh!

hughperkins commented May 8, 2026 •

edited

Loading

Uh oh!

hughperkins commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hughperkins commented May 7, 2026

Summary

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

hughperkins commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hughperkins commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hughperkins commented May 8, 2026 •

edited

Loading