perf(vk_native): chain renderer atlas-blit → pre-DP submit via semaphore by leaiss · Pull Request #360 · DisplayXR/displayxr-runtime

leaiss · 2026-05-28T17:08:45Z

Summary

Tier 3 #10 (partial): eliminates the first of two per-frame CPU stalls in the in-process vk_native compositor by chaining the renderer's atlas-blit submit and the compositor's pre-DP submit via a binary semaphore instead of a vkQueueWaitIdle between them. The second stall (pre-DP → DP self-submit) requires a DP vtable change and is deferred.

Before

renderer.draw():
    vkQueueSubmit(atlas blit)
    vkQueueWaitIdle             ← CPU stall A (eliminated by this PR)
    free cmd buffer

compositor frame loop:
    vkQueueSubmit(pre-DP cmds: WS-layer composite + atlas crop)
    vkQueueWaitIdle             ← CPU stall B (deferred — needs DP vtable change)
    display_processor.process_atlas(NULL cmd)  // self-submits internally

After

renderer.draw():
    vkQueueSubmit(atlas blit, signal=frame_done_sem, fence=frame_done_fence)
    return  // no waitIdle, no cmd buffer free

compositor frame loop:
    vkQueueSubmit(pre-DP cmds, wait=frame_done_sem @ TRANSFER|FRAGMENT_SHADER|COLOR_ATT)
    vkQueueWaitIdle             ← stall B still present
    display_processor.process_atlas(NULL cmd)

The CPU now skips stall A entirely. Stall B remains until the DP gets a wait-semaphore parameter on its self-submit path — that's a separate plug-in-coordinated PR.

Implementation

Renderer (comp_vk_native_renderer.{h,c})

frame_done_sem + frame_done_fence allocated at create time. Fence starts signaled so the first draw() doesn't stall waiting on a non-existent prior submit.
draw() waits on the fence at the start (back-pressure from previous frame), frees pending_cmd from previous frame, resets fence, records & submits, defers the cmd-buffer free.
A defensive drain at the top of draw() consumes a stuck previous-frame signal in the rare case (downstream submit failed, or resize() ran between frames) — avoids the binary-semaphore double-signal UB.
New public accessor comp_vk_native_renderer_take_frame_done_semaphore(): returns the handle once per draw, then returns 0. Lets the compositor consume the signal at whichever per-frame submit site fires first.

Compositor (comp_vk_native_compositor.c)

Static helper vk_native_wire_renderer_wait(c, &submit_info, &sem_storage, &stage_storage) called at each per-frame vkQueueSubmit site. Wait-stage mask is TRANSFER | FRAGMENT_SHADER | COLOR_ATTACHMENT_OUTPUT — the union of stages where downstream paths read the atlas. The take-once-per-frame semantics in the accessor mean only one submit per frame actually waits; the rest get a 0 handle and are no-ops.
Wired into all five per-frame submit sites: shared-image path (pre-DP, post-DP, fallback) and target/window-present path (pre-DP, final).
The one-shot debug vk_native_capture_atlas_to_png helper keeps its vkQueueSubmit + vkQueueWaitIdle — not per-frame, not worth the state.

Correctness

The change is functionally equivalent to the prior vkQueueWaitIdle pattern:

	Pre-change (waitIdle)	Post-change (semaphore)
Execution ordering	Submission order on a queue is guaranteed; waitIdle is over-sync	Submission order + semaphore wait at correct stages
Memory visibility	Implicit via full GPU drain	Explicit via wait at TRANSFER/FRAGMENT_SHADER/COLOR_ATT stages
Cmd buffer lifetime	Freed immediately after waitIdle	Deferred to next `draw()` start, gated on fence
First-frame behavior	Same	Fence created signaled → first wait is a no-op
Zero-copy path	renderer.draw() skipped, no waitIdle needed	renderer.draw() skipped, accessor returns 0, no wait wired
Resize race	renderer.draw() doesn't run during resize	Defensive drain at next draw() handles unclaimed signal

Validation

Windows runtime build: clean (scripts\build_windows.bat build).
macOS + Android builds: running via CI.
Hardware perf measurement: deferred. Emulator can't reach the frame loop (Vulkan device-extension wall at xrCreateVulkanDeviceKHR). Windows test apps default to service mode which routes around in-process vk_native compositor. Real perf delta will land when Lume Pad arrives or someone runs the macOS test apps with sim_display.

Test plan

CI: Windows + macOS + Android builds all green.
On Lume Pad (post-arrival): cube_handle_vk_android renders cleanly with no validation errors and no visible regression. Perfetto/RenderDoc trace shows the renderer submit + pre-DP submit chained on a single queue with no CPU gap.
On macOS: cube_handle_vk_macos renders the cube as before.

Follow-up

Tier 3 #10 part 2: extend xrt_display_processor vtable with get_self_submit_wait_semaphores (or accept a wait sem on process_atlas). Plug-in side: CNSDK plug-in honors the wait. Eliminates stall B. Coordinated runtime + plug-in PR; better measured on hardware with real frame timings.

🤖 Generated with Claude Code

…-fail path Two bug-fixes surfaced during self-audit of the previous commit: 1. **Fence-reset/deadlock race.** The previous commit reset frame_done_fence at the *top* of draw(), then proceeded to vkAllocateCommandBuffers. If the alloc failed (or any early return between the reset and the main submit fired), the fence was left in the *unsignaled* state with no submit to ever signal it. The next draw() would block on vkWaitForFences forever — runtime hang on the next frame after any one-off transient failure. Fix: defer the vkResetFences call to just before the main vkQueueSubmit. All early-return paths now leave the fence in its previous (signaled) state, so the next draw() proceeds normally. 2. **Drain-failure double-signal risk.** The semaphore-drain block at the top of draw() set signal_pending = false unconditionally, even if the no-op drain submit returned non-VK_SUCCESS. With signal_pending cleared, the next main submit would queue a signal operation on a binary semaphore that may still have its previous signal pending — undefined behavior per Vulkan spec. Fix: only clear signal_pending when vkQueueSubmit returns VK_SUCCESS. On failure, log + return XRT_ERROR_VULKAN so the caller knows the renderer isn't in a usable state this frame. Builds clean on Windows. The original change (#360) passed CI on Windows + macOS + Android; this commit re-pushes for re-validation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

leaiss · 2026-05-28T18:27:12Z

Self-audit pass surfaced two bugs in the prior commit, fixed in 8ad6de5f2:

Fence-reset deadlock. vkResetFences ran at the top of draw() before vkAllocateCommandBuffers. If alloc (or any early-return) fired, the fence was left unsignaled with no submit to signal it → next draw() blocks on vkWaitForFences forever. Fix: defer fence reset to just before the main vkQueueSubmit so all early-return paths leave the fence in its prior signaled state.
Drain-failure double-signal risk. The semaphore-drain block cleared signal_pending unconditionally, even on vkQueueSubmit failure. With signal_pending=false but the binary semaphore still in the signaled-pending state, the next main submit's signal operation is UB per Vulkan spec. Fix: only clear signal_pending when the drain submit succeeded; on failure log + return XRT_ERROR_VULKAN.

Builds clean on Windows; CI re-running.

Eliminates the first of two CPU stalls per frame in the in-process vk_native compositor path (Tier 3 #10, partial). Before this change every frame did: 1. renderer.draw() → vkQueueSubmit + vkQueueWaitIdle ← CPU stall A 2. compositor pre-DP cmd buffer → vkQueueSubmit + vkQueueWaitIdle ← CPU stall B 3. display_processor.process_atlas() → DP self-submits internally Submissions 1 and 2 are on the same queue (vk->main_queue) and #2 consumes atlas_image written by #1. The waitIdle at end of (1) is doing two jobs: (a) cross-submission memory dependency, (b) cmd-buffer free safety. Both can be expressed correctly with a binary semaphore + fence instead, without ever blocking the CPU. What changed - Renderer now owns a binary VkSemaphore (frame_done_sem) and a VkFence (frame_done_fence). draw()'s vkQueueSubmit signals both, defers the cmd-buffer free to the next call (gated on the fence), and skips the vkQueueWaitIdle entirely. Fence is created in the signaled state so the first draw() doesn't stall on a non-existent prior submit. - New accessor comp_vk_native_renderer_take_frame_done_semaphore() returns the signaled semaphore handle to a single downstream consumer per frame, and clears the pending flag so subsequent submits in the same frame don't double-wait. - comp_vk_native_compositor.c grows a static vk_native_wire_renderer_wait() helper that's called at every per-frame vkQueueSubmit site — the helper takes the pending semaphore (if any) and wires it into the submit info at wait-stage TRANSFER | FRAGMENT_SHADER | COLOR_ATTACHMENT_OUTPUT. Only one submit per frame ever sees a non-NULL handle, so the wiring is correct regardless of which branch (dp_self_submits / not, target path / shared-image path, fallback / weave) actually executes. What didn't change - The vkQueueWaitIdle *after* the pre-DP submit (stall B above) stays. Eliminating it requires the DP vtable to accept a wait-semaphore on its self-submit — a separate plug-in-coordinated change deferred to Tier 3 #10 part 2. - The vk_native_capture_atlas_to_png debug helper still uses vkQueueSubmit + vkQueueWaitIdle for its one-shot CPU readback — not per-frame and not worth the extra state. - Renderer's resize() path keeps vkDeviceWaitIdle. A new guard at the start of draw() drains any unclaimed pending signal (would-be binary-semaphore double-signal UB) so resize can safely interleave with the chain. Validation - Builds clean on Windows (scripts/build_windows.bat build). - Cross-platform code: macOS + Android builds will run via CI. - Hardware perf measurement deferred — emulator hits its Vulkan-device- extension wall before the frame loop runs, and Windows test apps default to service mode which routes around in-process vk_native. The change is correctness-equivalent to the prior waitIdle pattern (same execution ordering on the queue, same memory-visibility guarantees via the wait-stage mask). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-fail path Two bug-fixes surfaced during self-audit of the previous commit: 1. **Fence-reset/deadlock race.** The previous commit reset frame_done_fence at the *top* of draw(), then proceeded to vkAllocateCommandBuffers. If the alloc failed (or any early return between the reset and the main submit fired), the fence was left in the *unsignaled* state with no submit to ever signal it. The next draw() would block on vkWaitForFences forever — runtime hang on the next frame after any one-off transient failure. Fix: defer the vkResetFences call to just before the main vkQueueSubmit. All early-return paths now leave the fence in its previous (signaled) state, so the next draw() proceeds normally. 2. **Drain-failure double-signal risk.** The semaphore-drain block at the top of draw() set signal_pending = false unconditionally, even if the no-op drain submit returned non-VK_SUCCESS. With signal_pending cleared, the next main submit would queue a signal operation on a binary semaphore that may still have its previous signal pending — undefined behavior per Vulkan spec. Fix: only clear signal_pending when vkQueueSubmit returns VK_SUCCESS. On failure, log + return XRT_ERROR_VULKAN so the caller knows the renderer isn't in a usable state this frame. Builds clean on Windows. The original change (#360) passed CI on Windows + macOS + Android; this commit re-pushes for re-validation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

leaiss requested a review from dfattal as a code owner May 28, 2026 17:08

leaiss and others added 2 commits June 2, 2026 08:58

leaiss force-pushed the perf/vk-native-renderer-semaphore-chain branch from 8ad6de5 to 805d5a0 Compare June 2, 2026 15:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(vk_native): chain renderer atlas-blit → pre-DP submit via semaphore#360

perf(vk_native): chain renderer atlas-blit → pre-DP submit via semaphore#360
leaiss wants to merge 2 commits into
mainfrom
perf/vk-native-renderer-semaphore-chain

leaiss commented May 28, 2026

Uh oh!

leaiss commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leaiss commented May 28, 2026

Summary

Before

After

Implementation

Correctness

Validation

Test plan

Follow-up

Uh oh!

leaiss commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant