Skip to content

perf(vk_native): chain renderer atlas-blit → pre-DP submit via semaphore#360

Open
leaiss wants to merge 2 commits into
mainfrom
perf/vk-native-renderer-semaphore-chain
Open

perf(vk_native): chain renderer atlas-blit → pre-DP submit via semaphore#360
leaiss wants to merge 2 commits into
mainfrom
perf/vk-native-renderer-semaphore-chain

Conversation

@leaiss
Copy link
Copy Markdown
Collaborator

@leaiss leaiss commented May 28, 2026

Summary

Tier 3 #10 (partial): eliminates the first of two per-frame CPU stalls in the in-process vk_native compositor by chaining the renderer's atlas-blit submit and the compositor's pre-DP submit via a binary semaphore instead of a vkQueueWaitIdle between them. The second stall (pre-DP → DP self-submit) requires a DP vtable change and is deferred.

Before

renderer.draw():
    vkQueueSubmit(atlas blit)
    vkQueueWaitIdle             ← CPU stall A (eliminated by this PR)
    free cmd buffer

compositor frame loop:
    vkQueueSubmit(pre-DP cmds: WS-layer composite + atlas crop)
    vkQueueWaitIdle             ← CPU stall B (deferred — needs DP vtable change)
    display_processor.process_atlas(NULL cmd)  // self-submits internally

After

renderer.draw():
    vkQueueSubmit(atlas blit, signal=frame_done_sem, fence=frame_done_fence)
    return  // no waitIdle, no cmd buffer free

compositor frame loop:
    vkQueueSubmit(pre-DP cmds, wait=frame_done_sem @ TRANSFER|FRAGMENT_SHADER|COLOR_ATT)
    vkQueueWaitIdle             ← stall B still present
    display_processor.process_atlas(NULL cmd)

The CPU now skips stall A entirely. Stall B remains until the DP gets a wait-semaphore parameter on its self-submit path — that's a separate plug-in-coordinated PR.

Implementation

Renderer (comp_vk_native_renderer.{h,c})

  • frame_done_sem + frame_done_fence allocated at create time. Fence starts signaled so the first draw() doesn't stall waiting on a non-existent prior submit.
  • draw() waits on the fence at the start (back-pressure from previous frame), frees pending_cmd from previous frame, resets fence, records & submits, defers the cmd-buffer free.
  • A defensive drain at the top of draw() consumes a stuck previous-frame signal in the rare case (downstream submit failed, or resize() ran between frames) — avoids the binary-semaphore double-signal UB.
  • New public accessor comp_vk_native_renderer_take_frame_done_semaphore(): returns the handle once per draw, then returns 0. Lets the compositor consume the signal at whichever per-frame submit site fires first.

Compositor (comp_vk_native_compositor.c)

  • Static helper vk_native_wire_renderer_wait(c, &submit_info, &sem_storage, &stage_storage) called at each per-frame vkQueueSubmit site. Wait-stage mask is TRANSFER | FRAGMENT_SHADER | COLOR_ATTACHMENT_OUTPUT — the union of stages where downstream paths read the atlas. The take-once-per-frame semantics in the accessor mean only one submit per frame actually waits; the rest get a 0 handle and are no-ops.
  • Wired into all five per-frame submit sites: shared-image path (pre-DP, post-DP, fallback) and target/window-present path (pre-DP, final).
  • The one-shot debug vk_native_capture_atlas_to_png helper keeps its vkQueueSubmit + vkQueueWaitIdle — not per-frame, not worth the state.

Correctness

The change is functionally equivalent to the prior vkQueueWaitIdle pattern:

Pre-change (waitIdle) Post-change (semaphore)
Execution ordering Submission order on a queue is guaranteed; waitIdle is over-sync Submission order + semaphore wait at correct stages
Memory visibility Implicit via full GPU drain Explicit via wait at TRANSFER/FRAGMENT_SHADER/COLOR_ATT stages
Cmd buffer lifetime Freed immediately after waitIdle Deferred to next draw() start, gated on fence
First-frame behavior Same Fence created signaled → first wait is a no-op
Zero-copy path renderer.draw() skipped, no waitIdle needed renderer.draw() skipped, accessor returns 0, no wait wired
Resize race renderer.draw() doesn't run during resize Defensive drain at next draw() handles unclaimed signal

Validation

  • Windows runtime build: clean (scripts\build_windows.bat build).
  • macOS + Android builds: running via CI.
  • Hardware perf measurement: deferred. Emulator can't reach the frame loop (Vulkan device-extension wall at xrCreateVulkanDeviceKHR). Windows test apps default to service mode which routes around in-process vk_native compositor. Real perf delta will land when Lume Pad arrives or someone runs the macOS test apps with sim_display.

Test plan

  • CI: Windows + macOS + Android builds all green.
  • On Lume Pad (post-arrival): cube_handle_vk_android renders cleanly with no validation errors and no visible regression. Perfetto/RenderDoc trace shows the renderer submit + pre-DP submit chained on a single queue with no CPU gap.
  • On macOS: cube_handle_vk_macos renders the cube as before.

Follow-up

Tier 3 #10 part 2: extend xrt_display_processor vtable with get_self_submit_wait_semaphores (or accept a wait sem on process_atlas). Plug-in side: CNSDK plug-in honors the wait. Eliminates stall B. Coordinated runtime + plug-in PR; better measured on hardware with real frame timings.

🤖 Generated with Claude Code

@leaiss leaiss requested a review from dfattal as a code owner May 28, 2026 17:08
leaiss added a commit that referenced this pull request May 28, 2026
…-fail path

Two bug-fixes surfaced during self-audit of the previous commit:

1. **Fence-reset/deadlock race.** The previous commit reset
   frame_done_fence at the *top* of draw(), then proceeded to
   vkAllocateCommandBuffers. If the alloc failed (or any early
   return between the reset and the main submit fired), the fence
   was left in the *unsignaled* state with no submit to ever
   signal it. The next draw() would block on vkWaitForFences
   forever — runtime hang on the next frame after any one-off
   transient failure.

   Fix: defer the vkResetFences call to just before the main
   vkQueueSubmit. All early-return paths now leave the fence in
   its previous (signaled) state, so the next draw() proceeds
   normally.

2. **Drain-failure double-signal risk.** The semaphore-drain block
   at the top of draw() set signal_pending = false
   unconditionally, even if the no-op drain submit returned
   non-VK_SUCCESS. With signal_pending cleared, the next main
   submit would queue a signal operation on a binary semaphore
   that may still have its previous signal pending — undefined
   behavior per Vulkan spec.

   Fix: only clear signal_pending when vkQueueSubmit returns
   VK_SUCCESS. On failure, log + return XRT_ERROR_VULKAN so the
   caller knows the renderer isn't in a usable state this frame.

Builds clean on Windows. The original change (#360) passed CI on
Windows + macOS + Android; this commit re-pushes for re-validation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@leaiss
Copy link
Copy Markdown
Collaborator Author

leaiss commented May 28, 2026

Self-audit pass surfaced two bugs in the prior commit, fixed in 8ad6de5f2:

  1. Fence-reset deadlock. vkResetFences ran at the top of draw() before vkAllocateCommandBuffers. If alloc (or any early-return) fired, the fence was left unsignaled with no submit to signal it → next draw() blocks on vkWaitForFences forever. Fix: defer fence reset to just before the main vkQueueSubmit so all early-return paths leave the fence in its prior signaled state.

  2. Drain-failure double-signal risk. The semaphore-drain block cleared signal_pending unconditionally, even on vkQueueSubmit failure. With signal_pending=false but the binary semaphore still in the signaled-pending state, the next main submit's signal operation is UB per Vulkan spec. Fix: only clear signal_pending when the drain submit succeeded; on failure log + return XRT_ERROR_VULKAN.

Builds clean on Windows; CI re-running.

leaiss and others added 2 commits June 2, 2026 08:58
Eliminates the first of two CPU stalls per frame in the in-process
vk_native compositor path (Tier 3 #10, partial).

Before this change every frame did:
  1. renderer.draw()  → vkQueueSubmit + vkQueueWaitIdle   ← CPU stall A
  2. compositor pre-DP cmd buffer → vkQueueSubmit + vkQueueWaitIdle  ← CPU stall B
  3. display_processor.process_atlas() → DP self-submits internally

Submissions 1 and 2 are on the same queue (vk->main_queue) and #2
consumes atlas_image written by #1. The waitIdle at end of (1) is doing
two jobs: (a) cross-submission memory dependency, (b) cmd-buffer free
safety. Both can be expressed correctly with a binary semaphore +
fence instead, without ever blocking the CPU.

What changed
- Renderer now owns a binary VkSemaphore (frame_done_sem) and a VkFence
  (frame_done_fence). draw()'s vkQueueSubmit signals both, defers the
  cmd-buffer free to the next call (gated on the fence), and skips the
  vkQueueWaitIdle entirely. Fence is created in the signaled state so
  the first draw() doesn't stall on a non-existent prior submit.
- New accessor comp_vk_native_renderer_take_frame_done_semaphore()
  returns the signaled semaphore handle to a single downstream consumer
  per frame, and clears the pending flag so subsequent submits in the
  same frame don't double-wait.
- comp_vk_native_compositor.c grows a static vk_native_wire_renderer_wait()
  helper that's called at every per-frame vkQueueSubmit site — the
  helper takes the pending semaphore (if any) and wires it into the
  submit info at wait-stage TRANSFER | FRAGMENT_SHADER | COLOR_ATTACHMENT_OUTPUT.
  Only one submit per frame ever sees a non-NULL handle, so the wiring
  is correct regardless of which branch (dp_self_submits / not, target
  path / shared-image path, fallback / weave) actually executes.

What didn't change
- The vkQueueWaitIdle *after* the pre-DP submit (stall B above) stays.
  Eliminating it requires the DP vtable to accept a wait-semaphore on
  its self-submit — a separate plug-in-coordinated change deferred to
  Tier 3 #10 part 2.
- The vk_native_capture_atlas_to_png debug helper still uses
  vkQueueSubmit + vkQueueWaitIdle for its one-shot CPU readback — not
  per-frame and not worth the extra state.
- Renderer's resize() path keeps vkDeviceWaitIdle. A new guard at the
  start of draw() drains any unclaimed pending signal (would-be
  binary-semaphore double-signal UB) so resize can safely interleave
  with the chain.

Validation
- Builds clean on Windows (scripts/build_windows.bat build).
- Cross-platform code: macOS + Android builds will run via CI.
- Hardware perf measurement deferred — emulator hits its Vulkan-device-
  extension wall before the frame loop runs, and Windows test apps
  default to service mode which routes around in-process vk_native.
  The change is correctness-equivalent to the prior waitIdle pattern
  (same execution ordering on the queue, same memory-visibility
  guarantees via the wait-stage mask).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-fail path

Two bug-fixes surfaced during self-audit of the previous commit:

1. **Fence-reset/deadlock race.** The previous commit reset
   frame_done_fence at the *top* of draw(), then proceeded to
   vkAllocateCommandBuffers. If the alloc failed (or any early
   return between the reset and the main submit fired), the fence
   was left in the *unsignaled* state with no submit to ever
   signal it. The next draw() would block on vkWaitForFences
   forever — runtime hang on the next frame after any one-off
   transient failure.

   Fix: defer the vkResetFences call to just before the main
   vkQueueSubmit. All early-return paths now leave the fence in
   its previous (signaled) state, so the next draw() proceeds
   normally.

2. **Drain-failure double-signal risk.** The semaphore-drain block
   at the top of draw() set signal_pending = false
   unconditionally, even if the no-op drain submit returned
   non-VK_SUCCESS. With signal_pending cleared, the next main
   submit would queue a signal operation on a binary semaphore
   that may still have its previous signal pending — undefined
   behavior per Vulkan spec.

   Fix: only clear signal_pending when vkQueueSubmit returns
   VK_SUCCESS. On failure, log + return XRT_ERROR_VULKAN so the
   caller knows the renderer isn't in a usable state this frame.

Builds clean on Windows. The original change (#360) passed CI on
Windows + macOS + Android; this commit re-pushes for re-validation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@leaiss leaiss force-pushed the perf/vk-native-renderer-semaphore-chain branch from 8ad6de5 to 805d5a0 Compare June 2, 2026 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant