Skip to content

[Perf] Re-land Streams 1-4 with bug fixes#653

Open
hughperkins wants to merge 10 commits intomainfrom
hp/streams-re-land
Open

[Perf] Re-land Streams 1-4 with bug fixes#653
hughperkins wants to merge 10 commits intomainfrom
hp/streams-re-land

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

Summary

  • Re-lands the streams merge ([Perf] Streams 1-4 #410) that was reverted in [Bug] Revert "[Perf] Streams 1-4 (#410)" #650
  • Fixes arg_buffer race when the same kernel is called on different explicit streams (per-handle persistent buffers instead of launcher-global)
  • Fixes adstack overflow diagnostic losing kernel name on fast GPUs (peek-then-sync instead of consuming both slots atomically)
  • Fixes use-after-free in KernelLauncher: contexts_ changed from std::vector to std::deque so that recursive register_llvm_kernel calls from publish_adstack_metadata's host-eval path cannot invalidate references held by a parent launch_llvm_kernel frame

Test plan

  • test_adstack_triangular_ndrange_self_referential_push_idempotency passes 10/10 with QD_OFFLINE_CACHE=0
  • Full adstack test suite (1108 tests) passes
  • CI passes

Made with Cursor

The per-handle persistent arg_buffer/runtime_context in the CUDA and
AMDGPU kernel launchers is reused across launches.  When the same
kernel is dispatched on two different explicit streams (qd_stream),
the second call's memcpy can overwrite the buffer while the first
kernel is still reading it, causing data corruption.

Fix: when active_stream != nullptr (explicit stream), allocate a
per-call ephemeral buffer instead of reusing the persistent one.
The stream-ordered mem_free_async ensures the memory stays live
until the kernel finishes reading.  Default-stream launches keep
the existing persistent-buffer fast path.
The per-launch check_adstack_overflow poll (called without prior sync)
could observe the overflow flag before the companion task_id write had
been flushed from the device to pinned host memory.  The old code
consumed both slots via exchange(0), so by the time qd.sync() ran the
flag was already clear and the identity was lost — producing a
diagnostic with no kernel name.

Fix: peek the flag with a relaxed load first; if set, synchronize() to
ensure the task_id is visible, then consume both slots.  The sync only
fires on the rare overflow path, so the fast path stays zero-cost.
Re-add the "no sync drain", "where the kernel that wrote it ran", and
the reinterpret_cast / Itanium ABI / MSVC ABI portability note that
were inadvertently dropped when rewriting the comment block.
…eque

publish_adstack_metadata's host-eval branch recursively registers
snode-reader kernels via register_llvm_kernel, which calls
contexts_.resize(). When contexts_ is a std::vector, resize can
reallocate, invalidating the launcher_ctx reference held by the parent
launch_llvm_kernel frame. std::deque guarantees reference stability on
push_back / resize, eliminating the bug without per-launch copy overhead.
@hughperkins hughperkins marked this pull request as draft May 7, 2026 22:31
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5f5f6d0dd9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

for (size_t j = group_start; j < i; j++) {
int sid = offloaded_tasks[j].stream_parallel_group_id;
if (stream_by_id.find(sid) == stream_by_id.end()) {
stream_by_id[sid] = CUDAContext::get_instance().acquire_stream();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Chain stream-parallel worker streams to the launch stream

When a kernel containing qd.stream_parallel() is launched with qd_stream=s after earlier work has been queued on s, these newly acquired non-blocking worker streams start launching immediately and never wait for active_stream. That breaks the normal stream-ordering expectation for produce(qd_stream=s); consume_parallel(qd_stream=s): the stream-parallel loops can read inputs before produce has completed. The worker streams need an event/wait dependency on active_stream before their launches (and the AMDGPU path has the same pattern).

Useful? React with 👍 / 👎.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Resolves conflicts from the adstack file split (llvm_adstack_lazy_claim.cpp
-> adstack_lazy_claim/{heap_grow,bound_eval,metadata_publish}.cpp) and
max-reducer dispatch additions, preserving both the stream-parallel paths
and the new max-reducer infrastructure.
@hughperkins hughperkins marked this pull request as ready for review May 8, 2026 21:12
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cbef4561b7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

if self._handle != 0 and self._prog_ref is not None:
prog = self._destroy_prog()
if prog is not None:
prog.stream_destroy(self._handle)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Synchronize before destroying managed streams

When a managed stream is used only through the documented context-manager pattern, e.g. with qd.create_stream() as s: fill(qd_stream=s), __exit__ reaches this path and immediately drops the handle without waiting for the queued work. CUDA/HIP stream destruction is not a host-side synchronization point, and qd.sync()/field reads later only wait on the default stream, so the caller can observe stale data or race with the still-running explicit-stream kernel despite the docs saying the context manager waits for in-flight work. Please synchronize the stream before destroying managed handles, at least on explicit destroy()/__exit__.

Useful? React with 👍 / 👎.

@hughperkins
Copy link
Copy Markdown
Collaborator Author

hughperkins commented May 8, 2026

Benchmarks ok:

20260508-1705-streamsagain

@hughperkins
Copy link
Copy Markdown
Collaborator Author

Genesis unit tests pass:

Screenshot 2026-05-08 at 18 06 37

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant