renderer/directx12: surface staging-buffer upload failures#382
Merged
Conversation
When Texture.uploadRegion's createStagingBuffer call returned null
(UPLOAD-heap allocation refused, e.g. under upload-heap budget pressure
on a large image) the function logged at error level and silently
returned. Texture.init then transitioned the texture to
PIXEL_SHADER_RESOURCE state with no contents and the image upload path
in src/renderer/image.zig marked the placement .ready. The placement
rendered as a black quad with no caller-visible signal.
Surface the failure through a new Texture.Error.UploadFailed variant:
- uploadRegion now returns Error!void instead of void. Each of the four
silent-failure paths (null device, null command_list, null resource,
null staging buffer, Map() failure) returns error.UploadFailed.
- Texture.init's initial-data upload at the bottom of the function is
now `try tex.uploadRegion(...)`. The existing errdefer releases the
GPU resource on the error path. Image.upload (src/renderer/image.zig)
already maps Texture.init errors into its own error.UploadFailed and
the state machine keeps the image .pending so the failure is logged
by State.upload's catch instead of silently producing a black quad.
- replaceRegion keeps its error{}!void Metal-compat signature but now
catches uploadRegion's error and logs at warn level. Atlas and
render-target updates are recoverable on the next frame and don't
benefit from propagating the error all the way back to the renderer.
Four new pure-Zig tests construct a Texture struct directly with
sentinel device/resource pointers and assert uploadRegion returns
error.UploadFailed on each of the three null-guard paths, without
needing a real D3D12 device.
This fixes audit followup #1 from the 2026-05-19 Zig DX12 image upload
audit. Followups #2 (chunked row-band uploads) and #3 (per-frame
upload-budget guard) are intentionally deferred to separate PRs;
followup #5 (document the size ceiling near prepImage) is unblocked
but small enough to fold into either of those.
Known limitation: a persistent failure (e.g. an image larger than the
upload-heap budget) will retry every frame while it remains .pending
and log every retry. A "tried-and-failed" terminal state to break the
loop is worth a separate follow-up once we see this in production logs.
2 tasks
deblasis
added a commit
that referenced
this pull request
May 19, 2026
When Image.upload fails (wuffs decode error or DX12 UploadFailed) the image was previously left in .pending state. State.upload's next iteration saw the same pending image, retried the upload, logged the same warn, and so on every frame for the lifetime of the placement. A persistently-failing upload (image larger than the upload-heap budget, persistent wuffs decode error, missing device) produced an unbounded log stream the user had no way to stop. Mark the image for unload in State.upload's catch arm. The state machine already has unload_pending / unload_replace transitions (via Image.markForUnload), and the per-frame sweep at the top of State.upload runs the isUnloading() check before the isPending() retry branch, so the failed image is freed on the next iteration and the log fires exactly once. Two pure-Zig tests cover the new contract: - Image.markForUnload pending -> unload_pending: builds an Image in .pending state, asserts markForUnload flips the tag and isUnloading() flips true. - Image.markForUnload is idempotent on already-unloading states: the early-return arms of markForUnload don't disturb the active tag. The retry-loop concern was the known limitation called out in PR #382 (audit followup #1).
This was referenced May 20, 2026
deblasis
added a commit
that referenced
this pull request
May 20, 2026
PR #382 introduced four sentinel-pointer null-guard tests that never actually ran on Windows: zig build test failed at compile time because @ptrFromInt(0xDEAD1) is not 8-byte aligned and Zig 0.15.2 enforces pointer alignment for @ptrFromInt. After the alignment fix, the tests failed again at runtime because the null-guard branches call log.err, which the test runner counts via log_err_count and the build runner reports as "logged errors". Two changes: * Use 0xDEAD8 instead of 0xDEAD1 for the sentinel pointers. Both are still distinct from 0xDEAD0 (so the failing-field is grep-locatable in a panic dump), and 0xDEAD8 is 8-byte aligned so @ptrFromInt accepts it. * Downgrade the three null-guard log.err calls in uploadRegion to log.warn. Production behavior is virtually identical: the warning is still logged for diagnostic purposes, the caller still receives error.UploadFailed, and State.upload still drops the placement via markForUnload (PR #383). The previous err level was the wrong signal for a recoverable, caller-handled state. Followup to PRs #382/#383 (DX12 image-upload audit). Co-Authored-By: Alessandro De Blasis <alex@deblasis.net>
deblasis
added a commit
that referenced
this pull request
May 20, 2026
Large image uploads previously allocated a single exact-fit UPLOAD-heap staging buffer for the whole region (e.g. ~48 MiB for a 4096x3072 RGBA image after the iTerm2 multipart cap was raised to 65 MiB). Split into ~8 MiB row-bands so each CopyTextureRegion uses a smaller staging allocation, reducing per-call heap pressure under fragmentation. Bands are computed via the rowsPerBand+Bands helpers. y + start_row shifts each band's destination. All bands' staging buffers stay alive until the next replaceRegion or deinit -- the frame fence still gates release in beginFrame. No behavior change for sub-band uploads (atlas updates, render targets, small images); they still allocate a single band-sized staging buffer. Followup to #382/#383. Documented in project_dx12_image_upload_audit.md. Co-Authored-By: Alessandro De Blasis <alex@deblasis.net>
deblasis
added a commit
that referenced
this pull request
May 20, 2026
…384) PR #382 introduced four sentinel-pointer null-guard tests that never actually ran on Windows: zig build test failed at compile time because @ptrFromInt(0xDEAD1) is not 8-byte aligned and Zig 0.15.2 enforces pointer alignment for @ptrFromInt. After the alignment fix, the tests failed again at runtime because the null-guard branches call log.err, which the test runner counts via log_err_count and the build runner reports as "logged errors". Two changes: * Use 0xDEAD8 instead of 0xDEAD1 for the sentinel pointers. Both are still distinct from 0xDEAD0 (so the failing-field is grep-locatable in a panic dump), and 0xDEAD8 is 8-byte aligned so @ptrFromInt accepts it. * Downgrade the three null-guard log.err calls in uploadRegion to log.warn. Production behavior is virtually identical: the warning is still logged for diagnostic purposes, the caller still receives error.UploadFailed, and State.upload still drops the placement via markForUnload (PR #383). The previous err level was the wrong signal for a recoverable, caller-handled state. Followup to PRs #382/#383 (DX12 image-upload audit).
deblasis
added a commit
that referenced
this pull request
May 20, 2026
Large image uploads previously allocated a single exact-fit UPLOAD-heap staging buffer for the whole region (e.g. ~48 MiB for a 4096x3072 RGBA image after the iTerm2 multipart cap was raised to 65 MiB). Split into ~8 MiB row-bands so each CopyTextureRegion uses a smaller staging allocation, reducing per-call heap pressure under fragmentation. Bands are computed via the rowsPerBand+Bands helpers. y + start_row shifts each band's destination. All bands' staging buffers stay alive until the next replaceRegion or deinit -- the frame fence still gates release in beginFrame. No behavior change for sub-band uploads (atlas updates, render targets, small images); they still allocate a single band-sized staging buffer. Followup to #382/#383. Documented in project_dx12_image_upload_audit.md. Co-Authored-By: Alessandro De Blasis <alex@deblasis.net>
deblasis
added a commit
that referenced
this pull request
May 20, 2026
* renderer/directx12: add rowsPerBand helper for chunked uploads Pure helper that returns how many source rows fit in a single upload band, with a min-1 floor so a single very-wide row still makes progress. Used by the upcoming chunked uploadRegion to size each staging buffer at ~8 MiB instead of one buffer for the full image. Co-Authored-By: Alessandro De Blasis <alex@deblasis.net> * renderer/directx12: add Bands iterator for chunked uploads Yields (start_row, row_count) tuples for splitting an upload region into row-bands. Last band is truncated. Pure data, unit-tested, used by the upcoming uploadRegion refactor. Co-Authored-By: Alessandro De Blasis <alex@deblasis.net> * renderer/directx12: pending_staging holds a list of buffers Replace the single ?*ID3D12Resource with ArrayListUnmanaged backed by c_allocator so a chunked upload can keep all of its band staging buffers alive until the GPU finishes the CopyTextureRegion calls. deinit, the replaceRegion prelude, and the existing single-buffer uploadRegion line all migrate to the list-based shape; behavior is unchanged for now because uploadRegion still does a single append per call. The band-loop lands in the next commit. Co-Authored-By: Alessandro De Blasis <alex@deblasis.net> * renderer/directx12: chunk uploadRegion into 8 MiB row-bands Large image uploads previously allocated a single exact-fit UPLOAD-heap staging buffer for the whole region (e.g. ~48 MiB for a 4096x3072 RGBA image after the iTerm2 multipart cap was raised to 65 MiB). Split into ~8 MiB row-bands so each CopyTextureRegion uses a smaller staging allocation, reducing per-call heap pressure under fragmentation. Bands are computed via the rowsPerBand+Bands helpers. y + start_row shifts each band's destination. All bands' staging buffers stay alive until the next replaceRegion or deinit -- the frame fence still gates release in beginFrame. No behavior change for sub-band uploads (atlas updates, render targets, small images); they still allocate a single band-sized staging buffer. Followup to #382/#383. Documented in project_dx12_image_upload_audit.md. Co-Authored-By: Alessandro De Blasis <alex@deblasis.net> * renderer/image: note the image-size ceiling at prepImage Tie the alloc.dupe to the kitty graphics / iTerm2 multipart 65 MiB payload cap and point at the DX12 row-band chunking that handles the staging-heap pressure. Closes audit followup #5 from project_dx12_image_upload_audit.md. Co-Authored-By: Alessandro De Blasis <alex@deblasis.net> * renderer/directx12: fold review feedback into chunked uploads Addresses ghostty-reviewer feedback on #385 (will be squash-merged): * Trim deinit's value-receiver comment to lead with the cross-backend signature reason; drop the lecture on Zig value-parameter mechanics. * Document the partial-write semantics on replaceRegion: if a later band's staging/Map fails, earlier bands' CopyTextureRegion calls have already been recorded and will execute. Texture.init's errdefer Releases the destination so the partial writes are harmless; replaceRegion's destination survives, so a multi-band atlas update could leave a mix of new+previous rows until the next call. * Note the log-level asymmetry near the band loop: null guards stay at warn (defensive checks under correct callers), createStagingBuffer and Map failures stay at err (real GPU/driver failures with context). * Flag the y + band.start_row arithmetic at the CopyTextureRegion call as the integration point between the pure Bands iterator and DX12 that no unit test exercises -- a bug there shows up as vertically shifted/overlapping bands in a multi-band image. * Add an exact-fit boundary test for rowsPerBand (row == budget). Co-Authored-By: Alessandro De Blasis <alex@deblasis.net>
deblasis
added a commit
that referenced
this pull request
May 21, 2026
When Texture.uploadRegion's createStagingBuffer call returned null
(UPLOAD-heap allocation refused, e.g. under upload-heap budget
pressure on a large image) the function logged at error level and
silently returned. Texture.init then transitioned the texture to
PIXEL_SHADER_RESOURCE state with no contents and the image upload
path in src/renderer/image.zig marked the placement .ready. The
placement rendered as a black quad with no caller-visible signal.
Surface the failure through a new Texture.Error.UploadFailed variant:
- uploadRegion now returns Error!void instead of void. Each of the
four silent-failure paths (null device, null command_list, null
resource, null staging buffer, Map() failure) returns
error.UploadFailed.
- Texture.init's initial-data upload at the bottom is now
`try tex.uploadRegion(...)`. The existing errdefer releases the
GPU resource on the error path. Image.upload already maps
Texture.init errors into its own error.UploadFailed and the state
machine keeps the image .pending so the failure is logged by
State.upload's catch instead of silently producing a black quad.
- replaceRegion keeps its error{}!void Metal-compat signature but
catches uploadRegion's error and logs at warn level. Atlas and
render-target updates retry on the next frame.
Four new pure-Zig tests construct a Texture struct directly with
sentinel device/resource pointers and assert uploadRegion returns
error.UploadFailed on each of the three null-guard paths, without
needing a real D3D12 device.
Followup #1 from the 2026-05-19 Zig DX12 image upload audit.
deblasis
added a commit
that referenced
this pull request
May 21, 2026
When Image.upload fails (wuffs decode error or DX12 UploadFailed) the image was previously left in .pending state. State.upload's next iteration saw the same pending image, retried the upload, logged the same warn, and so on every frame for the lifetime of the placement. A persistently-failing upload (image larger than the upload-heap budget, persistent wuffs decode error, missing device) produced an unbounded log stream the user had no way to stop. Mark the image for unload in State.upload's catch arm. The state machine already has unload_pending / unload_replace transitions (via Image.markForUnload), and the per-frame sweep at the top of State.upload runs the isUnloading() check before the isPending() retry branch, so the failed image is freed on the next iteration and the log fires exactly once. Two pure-Zig tests cover the new contract: - Image.markForUnload pending -> unload_pending: builds an Image in .pending state, asserts markForUnload flips the tag and isUnloading() flips true. - Image.markForUnload is idempotent on already-unloading states: the early-return arms of markForUnload don't disturb the active tag. The retry-loop concern was the known limitation called out in PR #382 (audit followup #1).
deblasis
added a commit
that referenced
this pull request
May 21, 2026
…384) PR #382 introduced four sentinel-pointer null-guard tests that never actually ran on Windows: zig build test failed at compile time because @ptrFromInt(0xDEAD1) is not 8-byte aligned and Zig 0.15.2 enforces pointer alignment for @ptrFromInt. After the alignment fix, the tests failed again at runtime because the null-guard branches call log.err, which the test runner counts via log_err_count and the build runner reports as "logged errors". Two changes: * Use 0xDEAD8 instead of 0xDEAD1 for the sentinel pointers. Both are still distinct from 0xDEAD0 (so the failing-field is grep-locatable in a panic dump), and 0xDEAD8 is 8-byte aligned so @ptrFromInt accepts it. * Downgrade the three null-guard log.err calls in uploadRegion to log.warn. Production behavior is virtually identical: the warning is still logged for diagnostic purposes, the caller still receives error.UploadFailed, and State.upload still drops the placement via markForUnload (PR #383). The previous err level was the wrong signal for a recoverable, caller-handled state. Followup to PRs #382/#383 (DX12 image-upload audit).
deblasis
added a commit
that referenced
this pull request
May 21, 2026
* renderer/directx12: add rowsPerBand helper for chunked uploads Pure helper that returns how many source rows fit in a single upload band, with a min-1 floor so a single very-wide row still makes progress. Used by the upcoming chunked uploadRegion to size each staging buffer at ~8 MiB instead of one buffer for the full image. Co-Authored-By: Alessandro De Blasis <alex@deblasis.net> * renderer/directx12: add Bands iterator for chunked uploads Yields (start_row, row_count) tuples for splitting an upload region into row-bands. Last band is truncated. Pure data, unit-tested, used by the upcoming uploadRegion refactor. Co-Authored-By: Alessandro De Blasis <alex@deblasis.net> * renderer/directx12: pending_staging holds a list of buffers Replace the single ?*ID3D12Resource with ArrayListUnmanaged backed by c_allocator so a chunked upload can keep all of its band staging buffers alive until the GPU finishes the CopyTextureRegion calls. deinit, the replaceRegion prelude, and the existing single-buffer uploadRegion line all migrate to the list-based shape; behavior is unchanged for now because uploadRegion still does a single append per call. The band-loop lands in the next commit. Co-Authored-By: Alessandro De Blasis <alex@deblasis.net> * renderer/directx12: chunk uploadRegion into 8 MiB row-bands Large image uploads previously allocated a single exact-fit UPLOAD-heap staging buffer for the whole region (e.g. ~48 MiB for a 4096x3072 RGBA image after the iTerm2 multipart cap was raised to 65 MiB). Split into ~8 MiB row-bands so each CopyTextureRegion uses a smaller staging allocation, reducing per-call heap pressure under fragmentation. Bands are computed via the rowsPerBand+Bands helpers. y + start_row shifts each band's destination. All bands' staging buffers stay alive until the next replaceRegion or deinit -- the frame fence still gates release in beginFrame. No behavior change for sub-band uploads (atlas updates, render targets, small images); they still allocate a single band-sized staging buffer. Followup to #382/#383. Documented in project_dx12_image_upload_audit.md. Co-Authored-By: Alessandro De Blasis <alex@deblasis.net> * renderer/image: note the image-size ceiling at prepImage Tie the alloc.dupe to the kitty graphics / iTerm2 multipart 65 MiB payload cap and point at the DX12 row-band chunking that handles the staging-heap pressure. Closes audit followup #5 from project_dx12_image_upload_audit.md. Co-Authored-By: Alessandro De Blasis <alex@deblasis.net> * renderer/directx12: fold review feedback into chunked uploads Addresses ghostty-reviewer feedback on #385 (will be squash-merged): * Trim deinit's value-receiver comment to lead with the cross-backend signature reason; drop the lecture on Zig value-parameter mechanics. * Document the partial-write semantics on replaceRegion: if a later band's staging/Map fails, earlier bands' CopyTextureRegion calls have already been recorded and will execute. Texture.init's errdefer Releases the destination so the partial writes are harmless; replaceRegion's destination survives, so a multi-band atlas update could leave a mix of new+previous rows until the next call. * Note the log-level asymmetry near the band loop: null guards stay at warn (defensive checks under correct callers), createStagingBuffer and Map failures stay at err (real GPU/driver failures with context). * Flag the y + band.start_row arithmetic at the CopyTextureRegion call as the integration point between the pure Bands iterator and DX12 that no unit test exercises -- a bug there shows up as vertically shifted/overlapping bands in a multi-band image. * Add an exact-fit boundary test for rowsPerBand (row == budget). Co-Authored-By: Alessandro De Blasis <alex@deblasis.net>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix a silent failure in the DX12 image upload path. When
createStagingBuffer(UPLOAD-heap allocation viaCreateCommittedResource) returned null, the texture was previously left inPIXEL_SHADER_RESOURCEstate with no contents and the image was marked.readyby src/renderer/image.zig. Placement rendered as a black quad with no caller-visible signal. The bug becomes more reachable under the new ~48 MiB per-image ceiling from the iTerm2 multipart series (PRs #378/#380/#381).This is followup #1 from the 2026-05-19 Zig DX12 image upload audit (memory file).
What changes
UploadFailedvariant to Texture.Error.uploadRegionreturnsError!void. Each of the four silent-failure paths (null device, null command_list, null resource, null staging buffer,Map()failure) now returnserror.UploadFailed.Texture.init's initial-data upload at the bottom is nowtry tex.uploadRegion(...). The existingerrdefer _ = resource.Release()already releases the GPU resource on the error path.replaceRegionkeeps itserror{}!voidMetal-compat signature but catchesuploadRegion's error and logs at warn level — atlas / render-target updates retry next frame and don't benefit from propagating all the way out.Texturestruct directly with sentinel pointers and exercise each null-guard path without needing a real D3D12 device.Effect
[default] (warn): error uploading image to GPU err=UploadFailedinstead of a phantom-success black quad..pendingso the upload retries on the next frame (until either succeeding or being unloaded).zig build testpass.Out of scope
prepImage. Small enough to fold into one of the above.InfoBarin WinUI 3 host) — would need a new FFI channel; deferred.Test plan
zig build testpasses on Windows (1990 tests)zig build testpasses on macOS (via SSH bundle)Texture.zigexercise each null guard without a real DX12 device