Skip to content

renderer/directx12: surface staging-buffer upload failures#382

Merged
deblasis merged 1 commit into
windowsfrom
kitten-into-wintty/dx12-staging-fail-loud
May 19, 2026
Merged

renderer/directx12: surface staging-buffer upload failures#382
deblasis merged 1 commit into
windowsfrom
kitten-into-wintty/dx12-staging-fail-loud

Conversation

@deblasis
Copy link
Copy Markdown
Owner

Summary

Fix a silent failure in the DX12 image upload path. When createStagingBuffer (UPLOAD-heap allocation via CreateCommittedResource) returned null, the texture was previously left in PIXEL_SHADER_RESOURCE state with no contents and the image was marked .ready by src/renderer/image.zig. Placement rendered as a black quad with no caller-visible signal. The bug becomes more reachable under the new ~48 MiB per-image ceiling from the iTerm2 multipart series (PRs #378/#380/#381).

This is followup #1 from the 2026-05-19 Zig DX12 image upload audit (memory file).

What changes

  • Added UploadFailed variant to Texture.Error.
  • uploadRegion returns Error!void. Each of the four silent-failure paths (null device, null command_list, null resource, null staging buffer, Map() failure) now returns error.UploadFailed.
  • Texture.init's initial-data upload at the bottom is now try tex.uploadRegion(...). The existing errdefer _ = resource.Release() already releases the GPU resource on the error path.
  • replaceRegion keeps its error{}!void Metal-compat signature but catches uploadRegion's error and logs at warn level — atlas / render-target updates retry next frame and don't benefit from propagating all the way out.
  • 4 new pure-Zig tests construct a Texture struct directly with sentinel pointers and exercise each null-guard path without needing a real D3D12 device.

Effect

  • A real allocation-rejected case now produces [default] (warn): error uploading image to GPU err=UploadFailed instead of a phantom-success black quad.
  • The image stays .pending so the upload retries on the next frame (until either succeeding or being unloaded).
  • Both Windows and macOS zig build test pass.

Out of scope

Test plan

  • zig build test passes on Windows (1990 tests)
  • zig build test passes on macOS (via SSH bundle)
  • 4 new failure-path tests in Texture.zig exercise each null guard without a real DX12 device
  • Manual smoke once the next image-decoder change lands that might exercise this path under real budget pressure

When Texture.uploadRegion's createStagingBuffer call returned null
(UPLOAD-heap allocation refused, e.g. under upload-heap budget pressure
on a large image) the function logged at error level and silently
returned. Texture.init then transitioned the texture to
PIXEL_SHADER_RESOURCE state with no contents and the image upload path
in src/renderer/image.zig marked the placement .ready. The placement
rendered as a black quad with no caller-visible signal.

Surface the failure through a new Texture.Error.UploadFailed variant:

- uploadRegion now returns Error!void instead of void. Each of the four
  silent-failure paths (null device, null command_list, null resource,
  null staging buffer, Map() failure) returns error.UploadFailed.
- Texture.init's initial-data upload at the bottom of the function is
  now `try tex.uploadRegion(...)`. The existing errdefer releases the
  GPU resource on the error path. Image.upload (src/renderer/image.zig)
  already maps Texture.init errors into its own error.UploadFailed and
  the state machine keeps the image .pending so the failure is logged
  by State.upload's catch instead of silently producing a black quad.
- replaceRegion keeps its error{}!void Metal-compat signature but now
  catches uploadRegion's error and logs at warn level. Atlas and
  render-target updates are recoverable on the next frame and don't
  benefit from propagating the error all the way back to the renderer.

Four new pure-Zig tests construct a Texture struct directly with
sentinel device/resource pointers and assert uploadRegion returns
error.UploadFailed on each of the three null-guard paths, without
needing a real D3D12 device.

This fixes audit followup #1 from the 2026-05-19 Zig DX12 image upload
audit. Followups #2 (chunked row-band uploads) and #3 (per-frame
upload-budget guard) are intentionally deferred to separate PRs;
followup #5 (document the size ceiling near prepImage) is unblocked
but small enough to fold into either of those.

Known limitation: a persistent failure (e.g. an image larger than the
upload-heap budget) will retry every frame while it remains .pending
and log every retry. A "tried-and-failed" terminal state to break the
loop is worth a separate follow-up once we see this in production logs.
@deblasis deblasis marked this pull request as ready for review May 19, 2026 14:55
@deblasis deblasis merged commit a004133 into windows May 19, 2026
99 checks passed
@deblasis deblasis deleted the kitten-into-wintty/dx12-staging-fail-loud branch May 19, 2026 14:55
deblasis added a commit that referenced this pull request May 19, 2026
When Image.upload fails (wuffs decode error or DX12 UploadFailed) the
image was previously left in .pending state. State.upload's next
iteration saw the same pending image, retried the upload, logged the
same warn, and so on every frame for the lifetime of the placement.
A persistently-failing upload (image larger than the upload-heap
budget, persistent wuffs decode error, missing device) produced an
unbounded log stream the user had no way to stop.

Mark the image for unload in State.upload's catch arm. The state
machine already has unload_pending / unload_replace transitions
(via Image.markForUnload), and the per-frame sweep at the top of
State.upload runs the isUnloading() check before the isPending()
retry branch, so the failed image is freed on the next iteration
and the log fires exactly once.

Two pure-Zig tests cover the new contract:

- Image.markForUnload pending -> unload_pending: builds an Image in
  .pending state, asserts markForUnload flips the tag and
  isUnloading() flips true.
- Image.markForUnload is idempotent on already-unloading states: the
  early-return arms of markForUnload don't disturb the active tag.

The retry-loop concern was the known limitation called out in
PR #382 (audit followup #1).
deblasis added a commit that referenced this pull request May 20, 2026
PR #382 introduced four sentinel-pointer null-guard tests that never
actually ran on Windows: zig build test failed at compile time because
@ptrFromInt(0xDEAD1) is not 8-byte aligned and Zig 0.15.2 enforces
pointer alignment for @ptrFromInt. After the alignment fix, the tests
failed again at runtime because the null-guard branches call log.err,
which the test runner counts via log_err_count and the build runner
reports as "logged errors".

Two changes:

* Use 0xDEAD8 instead of 0xDEAD1 for the sentinel pointers. Both are
  still distinct from 0xDEAD0 (so the failing-field is grep-locatable
  in a panic dump), and 0xDEAD8 is 8-byte aligned so @ptrFromInt
  accepts it.
* Downgrade the three null-guard log.err calls in uploadRegion to
  log.warn. Production behavior is virtually identical: the warning
  is still logged for diagnostic purposes, the caller still receives
  error.UploadFailed, and State.upload still drops the placement
  via markForUnload (PR #383). The previous err level was the wrong
  signal for a recoverable, caller-handled state.

Followup to PRs #382/#383 (DX12 image-upload audit).

Co-Authored-By: Alessandro De Blasis <alex@deblasis.net>
deblasis added a commit that referenced this pull request May 20, 2026
Large image uploads previously allocated a single exact-fit UPLOAD-heap
staging buffer for the whole region (e.g. ~48 MiB for a 4096x3072 RGBA
image after the iTerm2 multipart cap was raised to 65 MiB). Split into
~8 MiB row-bands so each CopyTextureRegion uses a smaller staging
allocation, reducing per-call heap pressure under fragmentation.

Bands are computed via the rowsPerBand+Bands helpers. y + start_row
shifts each band's destination. All bands' staging buffers stay alive
until the next replaceRegion or deinit -- the frame fence still gates
release in beginFrame. No behavior change for sub-band uploads (atlas
updates, render targets, small images); they still allocate a single
band-sized staging buffer.

Followup to #382/#383. Documented in project_dx12_image_upload_audit.md.

Co-Authored-By: Alessandro De Blasis <alex@deblasis.net>
deblasis added a commit that referenced this pull request May 20, 2026
…384)

PR #382 introduced four sentinel-pointer null-guard tests that never
actually ran on Windows: zig build test failed at compile time because
@ptrFromInt(0xDEAD1) is not 8-byte aligned and Zig 0.15.2 enforces
pointer alignment for @ptrFromInt. After the alignment fix, the tests
failed again at runtime because the null-guard branches call log.err,
which the test runner counts via log_err_count and the build runner
reports as "logged errors".

Two changes:

* Use 0xDEAD8 instead of 0xDEAD1 for the sentinel pointers. Both are
  still distinct from 0xDEAD0 (so the failing-field is grep-locatable
  in a panic dump), and 0xDEAD8 is 8-byte aligned so @ptrFromInt
  accepts it.
* Downgrade the three null-guard log.err calls in uploadRegion to
  log.warn. Production behavior is virtually identical: the warning
  is still logged for diagnostic purposes, the caller still receives
  error.UploadFailed, and State.upload still drops the placement
  via markForUnload (PR #383). The previous err level was the wrong
  signal for a recoverable, caller-handled state.

Followup to PRs #382/#383 (DX12 image-upload audit).
deblasis added a commit that referenced this pull request May 20, 2026
Large image uploads previously allocated a single exact-fit UPLOAD-heap
staging buffer for the whole region (e.g. ~48 MiB for a 4096x3072 RGBA
image after the iTerm2 multipart cap was raised to 65 MiB). Split into
~8 MiB row-bands so each CopyTextureRegion uses a smaller staging
allocation, reducing per-call heap pressure under fragmentation.

Bands are computed via the rowsPerBand+Bands helpers. y + start_row
shifts each band's destination. All bands' staging buffers stay alive
until the next replaceRegion or deinit -- the frame fence still gates
release in beginFrame. No behavior change for sub-band uploads (atlas
updates, render targets, small images); they still allocate a single
band-sized staging buffer.

Followup to #382/#383. Documented in project_dx12_image_upload_audit.md.

Co-Authored-By: Alessandro De Blasis <alex@deblasis.net>
deblasis added a commit that referenced this pull request May 20, 2026
* renderer/directx12: add rowsPerBand helper for chunked uploads

Pure helper that returns how many source rows fit in a single upload
band, with a min-1 floor so a single very-wide row still makes progress.
Used by the upcoming chunked uploadRegion to size each staging buffer
at ~8 MiB instead of one buffer for the full image.

Co-Authored-By: Alessandro De Blasis <alex@deblasis.net>

* renderer/directx12: add Bands iterator for chunked uploads

Yields (start_row, row_count) tuples for splitting an upload region into
row-bands. Last band is truncated. Pure data, unit-tested, used by the
upcoming uploadRegion refactor.

Co-Authored-By: Alessandro De Blasis <alex@deblasis.net>

* renderer/directx12: pending_staging holds a list of buffers

Replace the single ?*ID3D12Resource with ArrayListUnmanaged backed by
c_allocator so a chunked upload can keep all of its band staging buffers
alive until the GPU finishes the CopyTextureRegion calls. deinit, the
replaceRegion prelude, and the existing single-buffer uploadRegion line
all migrate to the list-based shape; behavior is unchanged for now
because uploadRegion still does a single append per call. The band-loop
lands in the next commit.

Co-Authored-By: Alessandro De Blasis <alex@deblasis.net>

* renderer/directx12: chunk uploadRegion into 8 MiB row-bands

Large image uploads previously allocated a single exact-fit UPLOAD-heap
staging buffer for the whole region (e.g. ~48 MiB for a 4096x3072 RGBA
image after the iTerm2 multipart cap was raised to 65 MiB). Split into
~8 MiB row-bands so each CopyTextureRegion uses a smaller staging
allocation, reducing per-call heap pressure under fragmentation.

Bands are computed via the rowsPerBand+Bands helpers. y + start_row
shifts each band's destination. All bands' staging buffers stay alive
until the next replaceRegion or deinit -- the frame fence still gates
release in beginFrame. No behavior change for sub-band uploads (atlas
updates, render targets, small images); they still allocate a single
band-sized staging buffer.

Followup to #382/#383. Documented in project_dx12_image_upload_audit.md.

Co-Authored-By: Alessandro De Blasis <alex@deblasis.net>

* renderer/image: note the image-size ceiling at prepImage

Tie the alloc.dupe to the kitty graphics / iTerm2 multipart 65 MiB
payload cap and point at the DX12 row-band chunking that handles the
staging-heap pressure. Closes audit followup #5 from
project_dx12_image_upload_audit.md.

Co-Authored-By: Alessandro De Blasis <alex@deblasis.net>

* renderer/directx12: fold review feedback into chunked uploads

Addresses ghostty-reviewer feedback on #385 (will be squash-merged):

* Trim deinit's value-receiver comment to lead with the cross-backend
  signature reason; drop the lecture on Zig value-parameter mechanics.
* Document the partial-write semantics on replaceRegion: if a later
  band's staging/Map fails, earlier bands' CopyTextureRegion calls have
  already been recorded and will execute. Texture.init's errdefer
  Releases the destination so the partial writes are harmless;
  replaceRegion's destination survives, so a multi-band atlas update
  could leave a mix of new+previous rows until the next call.
* Note the log-level asymmetry near the band loop: null guards stay at
  warn (defensive checks under correct callers), createStagingBuffer
  and Map failures stay at err (real GPU/driver failures with context).
* Flag the y + band.start_row arithmetic at the CopyTextureRegion call
  as the integration point between the pure Bands iterator and DX12
  that no unit test exercises -- a bug there shows up as vertically
  shifted/overlapping bands in a multi-band image.
* Add an exact-fit boundary test for rowsPerBand (row == budget).

Co-Authored-By: Alessandro De Blasis <alex@deblasis.net>
deblasis added a commit that referenced this pull request May 21, 2026
When Texture.uploadRegion's createStagingBuffer call returned null
(UPLOAD-heap allocation refused, e.g. under upload-heap budget
pressure on a large image) the function logged at error level and
silently returned. Texture.init then transitioned the texture to
PIXEL_SHADER_RESOURCE state with no contents and the image upload
path in src/renderer/image.zig marked the placement .ready. The
placement rendered as a black quad with no caller-visible signal.

Surface the failure through a new Texture.Error.UploadFailed variant:

- uploadRegion now returns Error!void instead of void. Each of the
  four silent-failure paths (null device, null command_list, null
  resource, null staging buffer, Map() failure) returns
  error.UploadFailed.
- Texture.init's initial-data upload at the bottom is now
  `try tex.uploadRegion(...)`. The existing errdefer releases the
  GPU resource on the error path. Image.upload already maps
  Texture.init errors into its own error.UploadFailed and the state
  machine keeps the image .pending so the failure is logged by
  State.upload's catch instead of silently producing a black quad.
- replaceRegion keeps its error{}!void Metal-compat signature but
  catches uploadRegion's error and logs at warn level. Atlas and
  render-target updates retry on the next frame.

Four new pure-Zig tests construct a Texture struct directly with
sentinel device/resource pointers and assert uploadRegion returns
error.UploadFailed on each of the three null-guard paths, without
needing a real D3D12 device.

Followup #1 from the 2026-05-19 Zig DX12 image upload audit.
deblasis added a commit that referenced this pull request May 21, 2026
When Image.upload fails (wuffs decode error or DX12 UploadFailed) the
image was previously left in .pending state. State.upload's next
iteration saw the same pending image, retried the upload, logged the
same warn, and so on every frame for the lifetime of the placement.
A persistently-failing upload (image larger than the upload-heap
budget, persistent wuffs decode error, missing device) produced an
unbounded log stream the user had no way to stop.

Mark the image for unload in State.upload's catch arm. The state
machine already has unload_pending / unload_replace transitions
(via Image.markForUnload), and the per-frame sweep at the top of
State.upload runs the isUnloading() check before the isPending()
retry branch, so the failed image is freed on the next iteration
and the log fires exactly once.

Two pure-Zig tests cover the new contract:

- Image.markForUnload pending -> unload_pending: builds an Image in
  .pending state, asserts markForUnload flips the tag and
  isUnloading() flips true.
- Image.markForUnload is idempotent on already-unloading states: the
  early-return arms of markForUnload don't disturb the active tag.

The retry-loop concern was the known limitation called out in
PR #382 (audit followup #1).
deblasis added a commit that referenced this pull request May 21, 2026
…384)

PR #382 introduced four sentinel-pointer null-guard tests that never
actually ran on Windows: zig build test failed at compile time because
@ptrFromInt(0xDEAD1) is not 8-byte aligned and Zig 0.15.2 enforces
pointer alignment for @ptrFromInt. After the alignment fix, the tests
failed again at runtime because the null-guard branches call log.err,
which the test runner counts via log_err_count and the build runner
reports as "logged errors".

Two changes:

* Use 0xDEAD8 instead of 0xDEAD1 for the sentinel pointers. Both are
  still distinct from 0xDEAD0 (so the failing-field is grep-locatable
  in a panic dump), and 0xDEAD8 is 8-byte aligned so @ptrFromInt
  accepts it.
* Downgrade the three null-guard log.err calls in uploadRegion to
  log.warn. Production behavior is virtually identical: the warning
  is still logged for diagnostic purposes, the caller still receives
  error.UploadFailed, and State.upload still drops the placement
  via markForUnload (PR #383). The previous err level was the wrong
  signal for a recoverable, caller-handled state.

Followup to PRs #382/#383 (DX12 image-upload audit).
deblasis added a commit that referenced this pull request May 21, 2026
* renderer/directx12: add rowsPerBand helper for chunked uploads

Pure helper that returns how many source rows fit in a single upload
band, with a min-1 floor so a single very-wide row still makes progress.
Used by the upcoming chunked uploadRegion to size each staging buffer
at ~8 MiB instead of one buffer for the full image.

Co-Authored-By: Alessandro De Blasis <alex@deblasis.net>

* renderer/directx12: add Bands iterator for chunked uploads

Yields (start_row, row_count) tuples for splitting an upload region into
row-bands. Last band is truncated. Pure data, unit-tested, used by the
upcoming uploadRegion refactor.

Co-Authored-By: Alessandro De Blasis <alex@deblasis.net>

* renderer/directx12: pending_staging holds a list of buffers

Replace the single ?*ID3D12Resource with ArrayListUnmanaged backed by
c_allocator so a chunked upload can keep all of its band staging buffers
alive until the GPU finishes the CopyTextureRegion calls. deinit, the
replaceRegion prelude, and the existing single-buffer uploadRegion line
all migrate to the list-based shape; behavior is unchanged for now
because uploadRegion still does a single append per call. The band-loop
lands in the next commit.

Co-Authored-By: Alessandro De Blasis <alex@deblasis.net>

* renderer/directx12: chunk uploadRegion into 8 MiB row-bands

Large image uploads previously allocated a single exact-fit UPLOAD-heap
staging buffer for the whole region (e.g. ~48 MiB for a 4096x3072 RGBA
image after the iTerm2 multipart cap was raised to 65 MiB). Split into
~8 MiB row-bands so each CopyTextureRegion uses a smaller staging
allocation, reducing per-call heap pressure under fragmentation.

Bands are computed via the rowsPerBand+Bands helpers. y + start_row
shifts each band's destination. All bands' staging buffers stay alive
until the next replaceRegion or deinit -- the frame fence still gates
release in beginFrame. No behavior change for sub-band uploads (atlas
updates, render targets, small images); they still allocate a single
band-sized staging buffer.

Followup to #382/#383. Documented in project_dx12_image_upload_audit.md.

Co-Authored-By: Alessandro De Blasis <alex@deblasis.net>

* renderer/image: note the image-size ceiling at prepImage

Tie the alloc.dupe to the kitty graphics / iTerm2 multipart 65 MiB
payload cap and point at the DX12 row-band chunking that handles the
staging-heap pressure. Closes audit followup #5 from
project_dx12_image_upload_audit.md.

Co-Authored-By: Alessandro De Blasis <alex@deblasis.net>

* renderer/directx12: fold review feedback into chunked uploads

Addresses ghostty-reviewer feedback on #385 (will be squash-merged):

* Trim deinit's value-receiver comment to lead with the cross-backend
  signature reason; drop the lecture on Zig value-parameter mechanics.
* Document the partial-write semantics on replaceRegion: if a later
  band's staging/Map fails, earlier bands' CopyTextureRegion calls have
  already been recorded and will execute. Texture.init's errdefer
  Releases the destination so the partial writes are harmless;
  replaceRegion's destination survives, so a multi-band atlas update
  could leave a mix of new+previous rows until the next call.
* Note the log-level asymmetry near the band loop: null guards stay at
  warn (defensive checks under correct callers), createStagingBuffer
  and Map failures stay at err (real GPU/driver failures with context).
* Flag the y + band.start_row arithmetic at the CopyTextureRegion call
  as the integration point between the pure Bands iterator and DX12
  that no unit test exercises -- a bug there shows up as vertically
  shifted/overlapping bands in a multi-band image.
* Add an exact-fit boundary test for rowsPerBand (row == budget).

Co-Authored-By: Alessandro De Blasis <alex@deblasis.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant