perf(vk_native): chain renderer atlas-blit → pre-DP submit via semaphore#360
perf(vk_native): chain renderer atlas-blit → pre-DP submit via semaphore#360leaiss wants to merge 2 commits into
Conversation
…-fail path Two bug-fixes surfaced during self-audit of the previous commit: 1. **Fence-reset/deadlock race.** The previous commit reset frame_done_fence at the *top* of draw(), then proceeded to vkAllocateCommandBuffers. If the alloc failed (or any early return between the reset and the main submit fired), the fence was left in the *unsignaled* state with no submit to ever signal it. The next draw() would block on vkWaitForFences forever — runtime hang on the next frame after any one-off transient failure. Fix: defer the vkResetFences call to just before the main vkQueueSubmit. All early-return paths now leave the fence in its previous (signaled) state, so the next draw() proceeds normally. 2. **Drain-failure double-signal risk.** The semaphore-drain block at the top of draw() set signal_pending = false unconditionally, even if the no-op drain submit returned non-VK_SUCCESS. With signal_pending cleared, the next main submit would queue a signal operation on a binary semaphore that may still have its previous signal pending — undefined behavior per Vulkan spec. Fix: only clear signal_pending when vkQueueSubmit returns VK_SUCCESS. On failure, log + return XRT_ERROR_VULKAN so the caller knows the renderer isn't in a usable state this frame. Builds clean on Windows. The original change (#360) passed CI on Windows + macOS + Android; this commit re-pushes for re-validation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Self-audit pass surfaced two bugs in the prior commit, fixed in
Builds clean on Windows; CI re-running. |
Eliminates the first of two CPU stalls per frame in the in-process vk_native compositor path (Tier 3 #10, partial). Before this change every frame did: 1. renderer.draw() → vkQueueSubmit + vkQueueWaitIdle ← CPU stall A 2. compositor pre-DP cmd buffer → vkQueueSubmit + vkQueueWaitIdle ← CPU stall B 3. display_processor.process_atlas() → DP self-submits internally Submissions 1 and 2 are on the same queue (vk->main_queue) and #2 consumes atlas_image written by #1. The waitIdle at end of (1) is doing two jobs: (a) cross-submission memory dependency, (b) cmd-buffer free safety. Both can be expressed correctly with a binary semaphore + fence instead, without ever blocking the CPU. What changed - Renderer now owns a binary VkSemaphore (frame_done_sem) and a VkFence (frame_done_fence). draw()'s vkQueueSubmit signals both, defers the cmd-buffer free to the next call (gated on the fence), and skips the vkQueueWaitIdle entirely. Fence is created in the signaled state so the first draw() doesn't stall on a non-existent prior submit. - New accessor comp_vk_native_renderer_take_frame_done_semaphore() returns the signaled semaphore handle to a single downstream consumer per frame, and clears the pending flag so subsequent submits in the same frame don't double-wait. - comp_vk_native_compositor.c grows a static vk_native_wire_renderer_wait() helper that's called at every per-frame vkQueueSubmit site — the helper takes the pending semaphore (if any) and wires it into the submit info at wait-stage TRANSFER | FRAGMENT_SHADER | COLOR_ATTACHMENT_OUTPUT. Only one submit per frame ever sees a non-NULL handle, so the wiring is correct regardless of which branch (dp_self_submits / not, target path / shared-image path, fallback / weave) actually executes. What didn't change - The vkQueueWaitIdle *after* the pre-DP submit (stall B above) stays. Eliminating it requires the DP vtable to accept a wait-semaphore on its self-submit — a separate plug-in-coordinated change deferred to Tier 3 #10 part 2. - The vk_native_capture_atlas_to_png debug helper still uses vkQueueSubmit + vkQueueWaitIdle for its one-shot CPU readback — not per-frame and not worth the extra state. - Renderer's resize() path keeps vkDeviceWaitIdle. A new guard at the start of draw() drains any unclaimed pending signal (would-be binary-semaphore double-signal UB) so resize can safely interleave with the chain. Validation - Builds clean on Windows (scripts/build_windows.bat build). - Cross-platform code: macOS + Android builds will run via CI. - Hardware perf measurement deferred — emulator hits its Vulkan-device- extension wall before the frame loop runs, and Windows test apps default to service mode which routes around in-process vk_native. The change is correctness-equivalent to the prior waitIdle pattern (same execution ordering on the queue, same memory-visibility guarantees via the wait-stage mask). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-fail path Two bug-fixes surfaced during self-audit of the previous commit: 1. **Fence-reset/deadlock race.** The previous commit reset frame_done_fence at the *top* of draw(), then proceeded to vkAllocateCommandBuffers. If the alloc failed (or any early return between the reset and the main submit fired), the fence was left in the *unsignaled* state with no submit to ever signal it. The next draw() would block on vkWaitForFences forever — runtime hang on the next frame after any one-off transient failure. Fix: defer the vkResetFences call to just before the main vkQueueSubmit. All early-return paths now leave the fence in its previous (signaled) state, so the next draw() proceeds normally. 2. **Drain-failure double-signal risk.** The semaphore-drain block at the top of draw() set signal_pending = false unconditionally, even if the no-op drain submit returned non-VK_SUCCESS. With signal_pending cleared, the next main submit would queue a signal operation on a binary semaphore that may still have its previous signal pending — undefined behavior per Vulkan spec. Fix: only clear signal_pending when vkQueueSubmit returns VK_SUCCESS. On failure, log + return XRT_ERROR_VULKAN so the caller knows the renderer isn't in a usable state this frame. Builds clean on Windows. The original change (#360) passed CI on Windows + macOS + Android; this commit re-pushes for re-validation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8ad6de5 to
805d5a0
Compare
Summary
Tier 3 #10 (partial): eliminates the first of two per-frame CPU stalls in the in-process vk_native compositor by chaining the renderer's atlas-blit submit and the compositor's pre-DP submit via a binary semaphore instead of a
vkQueueWaitIdlebetween them. The second stall (pre-DP → DP self-submit) requires a DP vtable change and is deferred.Before
After
The CPU now skips stall A entirely. Stall B remains until the DP gets a wait-semaphore parameter on its self-submit path — that's a separate plug-in-coordinated PR.
Implementation
Renderer (
comp_vk_native_renderer.{h,c})frame_done_sem+frame_done_fenceallocated at create time. Fence starts signaled so the firstdraw()doesn't stall waiting on a non-existent prior submit.draw()waits on the fence at the start (back-pressure from previous frame), freespending_cmdfrom previous frame, resets fence, records & submits, defers the cmd-buffer free.draw()consumes a stuck previous-frame signal in the rare case (downstream submit failed, orresize()ran between frames) — avoids the binary-semaphore double-signal UB.comp_vk_native_renderer_take_frame_done_semaphore(): returns the handle once per draw, then returns 0. Lets the compositor consume the signal at whichever per-frame submit site fires first.Compositor (
comp_vk_native_compositor.c)vk_native_wire_renderer_wait(c, &submit_info, &sem_storage, &stage_storage)called at each per-framevkQueueSubmitsite. Wait-stage mask isTRANSFER | FRAGMENT_SHADER | COLOR_ATTACHMENT_OUTPUT— the union of stages where downstream paths read the atlas. The take-once-per-frame semantics in the accessor mean only one submit per frame actually waits; the rest get a 0 handle and are no-ops.vk_native_capture_atlas_to_pnghelper keeps itsvkQueueSubmit + vkQueueWaitIdle— not per-frame, not worth the state.Correctness
The change is functionally equivalent to the prior
vkQueueWaitIdlepattern:draw()start, gated on fenceValidation
scripts\build_windows.bat build).xrCreateVulkanDeviceKHR). Windows test apps default to service mode which routes around in-process vk_native compositor. Real perf delta will land when Lume Pad arrives or someone runs the macOS test apps with sim_display.Test plan
cube_handle_vk_androidrenders cleanly with no validation errors and no visible regression. Perfetto/RenderDoc trace shows the renderer submit + pre-DP submit chained on a single queue with no CPU gap.cube_handle_vk_macosrenders the cube as before.Follow-up
Tier 3 #10 part 2: extend
xrt_display_processorvtable withget_self_submit_wait_semaphores(or accept a wait sem onprocess_atlas). Plug-in side: CNSDK plug-in honors the wait. Eliminates stall B. Coordinated runtime + plug-in PR; better measured on hardware with real frame timings.🤖 Generated with Claude Code