From 95b7ed2cd1efda00eda706189be10566b7e4920b Mon Sep 17 00:00:00 2001 From: BANANASJIM Date: Sun, 17 May 2026 09:17:10 -0700 Subject: [PATCH] fix(daemon): match D3D12 adapter via real chunk name + exact DeviceId (#225) PR #226's structured-data adapter guard never fired on real hardware. Defects fixed here: 1. Chunk-name guard: the real renderdoc 1.42 chunk is "Internal::Driver Initialisation Parameters" (with a space), which none of the legacy markers ("DriverInit"/"EnumAdapters"/ "CreateDXGIFactory") match. Add "Driver Initialisation Parameters" to d3d_markers; keep the legacy markers (commented as version-fallback) for other versions/paths. 2. Tree depth: the real Description lives at depth 3 (chunk -> InitParams -> AdapterDesc -> Description), but the old walker only inspected depth 1/2. Replace it with a depth-bounded recursive descent (cap 5) returning an AdapterMatch carrying (description, device_id); DeviceId read via AsInt(). VendorId is deliberately not parsed (PCI id vs GPUVendor enum are unrelated, so it has no consumer). 3. Disambiguation: name-substring matching is fragile for same-vendor multi-GPU. _match_capture_gpu now selects in three tiers: exact DeviceId == g.deviceID (primary), name-substring (secondary), vendor-priority (tertiary, unchanged). 4. WARP/software sentinel: DXGI_ADAPTER_DESC.DeviceId == 0 on software/WARP/virtualized captures. Tier-1 is gated on truthiness (not "is not None") and skips any candidate with deviceID == 0, so a zero id can never bind the WARP adapter. 5. Version-drift diagnostic: a distinct _log.warning fires when a multi-GPU capture has structured data but no recognized vk/d3d adapter chunk, surfacing future renderdoc chunk renames instead of silently falling back to vendor priority. Adds 14 unit tests (5 red-first defect proofs + 9 regression guards, incl. the WARP-sentinel guard). --- .../proposal.md | 235 ++++++++++++++++++ .../tasks.md | 14 ++ .../test-plan.md | 232 +++++++++++++++++ openspec/specs/daemon/spec.md | 11 +- src/rdc/daemon_server.py | 103 ++++++-- tests/unit/test_daemon_server_unit.py | 200 +++++++++++++++ 6 files changed, 767 insertions(+), 28 deletions(-) create mode 100644 openspec/changes/2026-05-17-fix-225-d3d12-chunk-name/proposal.md create mode 100644 openspec/changes/2026-05-17-fix-225-d3d12-chunk-name/tasks.md create mode 100644 openspec/changes/2026-05-17-fix-225-d3d12-chunk-name/test-plan.md diff --git a/openspec/changes/2026-05-17-fix-225-d3d12-chunk-name/proposal.md b/openspec/changes/2026-05-17-fix-225-d3d12-chunk-name/proposal.md new file mode 100644 index 0000000..3f47006 --- /dev/null +++ b/openspec/changes/2026-05-17-fix-225-d3d12-chunk-name/proposal.md @@ -0,0 +1,235 @@ +# Fix #225 (fix-forward): D3D12 Chunk Name, Depth, and Exact-ID Matching + +## Summary + +PR #226 (squash 3735f2b) shipped the D3D12 adapter-matching logic but the +structured-data guard never fires on real hardware. Hardware verification by +issue reporter @YunHsiao (renderdoc 1.42, Windows multi-GPU D3D12 box) confirmed +that the user-facing crash is resolved by the vendor-priority fallback introduced +in #226, but the structured-data path — the primary, deterministic signal — is +silently skipped on every real capture. Three distinct defects prevent it from +working. This change is a fix-forward on the merged commit; issue #225 remains +open until hardware re-verification passes. + +## Problem + +### Canonical dump (ground truth) + +Reporter provided the structured-data tree from renderdoc 1.42's bundled Python +module on a real D3D12 capture. The ONLY chunk carrying adapter identity is +`chunk[0]`, whose name is literally: + +``` +Internal::Driver Initialisation Parameters +``` + +Its subtree: + +``` +Internal::Driver Initialisation Parameters [Chunk] children=1 + InitParams [D3D12InitParams] children=7 + MinimumFeatureLevel [D3D_FEATURE_LEVEL] + AdapterDesc [DXGI_ADAPTER_DESC] children=9 + Description [string] = 'NVIDIA RTX 4500 Ada Generation' + VendorId [uint32_t] + DeviceId [uint32_t] + SubSysId [uint32_t] + Revision [uint32_t] + DedicatedVideoMemory [uint64_t] + DedicatedSystemMemory [uint64_t] + SharedSystemMemory [uint64_t] + AdapterLuid [LUID] + usedDXIL, VendorExtensions, VendorUAV, VendorUAVSpace, SDKVersion +``` + +All remaining chunks are `ID3D12Device::*` / `ID3D12Resource::SetName` API calls +and 4834× `Internal::Initial Contents`. There is no `EnumAdapters`, no +`CreateDXGIFactory`, no `DriverInit` chunk anywhere in the capture. + +Available GPUs on reporter's machine: +- `AMD Radeon(TM) Graphics` vendor=2 deviceID=5056 +- `NVIDIA RTX 4500 Ada Generation` vendor=6 deviceID=10161 +- `WARP Rasterizer` vendor=9 deviceID=0 + +### Defect 1 — outer chunk guard never matches + +`d3d_markers = ("DriverInit", "EnumAdapters", "CreateDXGIFactory")` + +None of the three substrings occur in `"Internal::Driver Initialisation Parameters"`. +The space between "Driver" and "Initialisation" defeats `"DriverInit"`. The guard +falls through on every chunk and `_find_adapter_description` is never called. + +### Defect 2 — `_find_adapter_description` only walks 2 levels + +The real `Description` field is at depth 3 relative to the chunk root: + +``` +chunk (depth 0) + InitParams (depth 1) + AdapterDesc (depth 2) + Description (depth 3) ← required +``` + +The current walker checks direct children (depth 1) for `Description`/`DeviceName` +and direct children of `AdapterDesc`/`pAdapter`/`adapter` (depth 2 for the +container, depth 3 only if the container is at depth 1). Because `AdapterDesc` is +at depth 2 (a grandchild), the current inner loop is never reached. + +### Defect 3 — name-substring match is fragile for same-vendor multi-GPU + +The name-substring comparison `desc_l in name_l or name_l in desc_l` is the +only disambiguation signal once a Description is found. On a system with two +NVIDIA cards whose names differ only by suffix, substring matching can produce a +false-positive match. The `AdapterDesc` already exposes `DeviceId` (PCI Device +ID, uint32) and `VendorId` (PCI Vendor ID, uint32). The reporter explicitly +recommended hardening to exact numeric match. + +`GPUDevice.deviceID` in the renderdoc Python API carries the PCI Device ID, which +is what `DXGI_ADAPTER_DESC.DeviceId` encodes. An exact `DeviceId` match is +unambiguous and free of string-normalization risk. + +## Design + +### Part 1 — extend chunk name markers + +Add `"Driver Initialisation Parameters"` to `d3d_markers`. Retain the three +existing substrings (`"DriverInit"`, `"EnumAdapters"`, `"CreateDXGIFactory"`) for +compatibility with other renderdoc versions or API paths that may use them. + +```python +d3d_markers = ( + "Driver Initialisation Parameters", # renderdoc 1.42 D3D12 + "DriverInit", + "EnumAdapters", + "CreateDXGIFactory", +) +``` + +### Part 2 — bounded recursive descent in `_find_adapter_description` + +Replace the 2-level manual walk with a depth-bounded recursive descent (cap at +depth 4 or 5). At each node: + +1. If the node name is `Description` or `DeviceName`, return its string value and + a paired `DeviceId` if available at the same level. +2. If the node name is `AdapterDesc`, `pAdapter`, or `adapter`, recurse into it + with depth decremented — this finds `Description` at depth 3 regardless of + intermediate wrapper levels. +3. Otherwise recurse into all children with depth decremented. + +Return type changes to a small named tuple carrying +`(description, device_id)` so the caller can use the numeric `DeviceId` for the +primary match. `VendorId` is deliberately NOT parsed (see Risks — it cannot be +compared against the renderdoc `GPUVendor` enum, so carrying it would be dead). + +`DeviceId` is a `uint32_t` structured-data scalar. The renderdoc `SDObject` +interface used everywhere else in this code reads strings via `AsString()`; the +integer-valued counterpart is `AsInt()` (there is no `AsUInt()` on `SDObject`). +The implementation MUST read `DeviceId` via `child.AsInt()`, consistent with the +existing `AsString()` usage on the same interface. There is no top-level integer +`.data` attribute to read directly. + +### Part 3 — exact numeric DeviceId match as primary signal + +In `_match_capture_gpu`, after obtaining the result from +`_find_adapter_description`, apply a three-tier selection: + +1. **Primary**: exact `DeviceId` match — `match.device_id == g.deviceID` for + each GPU `g`. The first matching GPU wins. A `DeviceId` of `0` is treated as + a software/WARP sentinel (software, WARP and virtualized adapters report + `DXGI_ADAPTER_DESC.DeviceId == 0`): the tier-1 gate is truthiness, not + `is not None`, so a present-but-zero id falls through to tier 2; and any + candidate GPU whose `deviceID == 0` is skipped inside the exact-match loop so + a real non-zero parsed id can never coincidentally bind the WARP adapter. +2. **Secondary**: name-substring match (existing logic) — used when no GPU's + `deviceID` matches the parsed value, when `DeviceId` is absent, or when the + parsed `DeviceId` is the `0` sentinel. +3. **Tertiary**: vendor-priority fallback (existing logic, unchanged). + +### Known limitations / risks + +**Why VendorId is not parsed.** `DXGI_ADAPTER_DESC.VendorId` is the raw PCI +vendor ID (e.g., NVIDIA = 0x10DE = 4318). The renderdoc `GPUVendor` enum uses +ordinals (nVidia = 6). These values are in unrelated spaces, so `VendorId` +cannot be compared against `g.vendor` and there is no other consumer for it. +The field is therefore deliberately NOT read from the chunk (carrying it would +be dead state). `DeviceId` → `g.deviceID` is the only exact key used. + +**(RISK-A) `deviceID` vs PCI `DeviceId` same-space assumption.** Tier-1 assumes +the renderdoc `GPUDevice.deviceID` is the same integer as the PCI +`DXGI_ADAPTER_DESC.DeviceId`. This holds on the D3D12 path per the reporter +dump but is unverifiable on Linux/Vulkan from this repo. A mismatch silently +demotes tier-1 to name-substring (correctness preserved, determinism lost). +*Reporter verification question:* "On the #225 capture, print +`GetAvailableGPUs()[i].deviceID` for the NVIDIA RTX 4500 entry AND +`AdapterDesc.DeviceId` parsed from the same capture; confirm they are +bitwise-equal (expected 10161)." + +**(RISK-B) Chunk-name version pinning.** The substring +`"Driver Initialisation Parameters"` is renderdoc-1.42-specific (British +spelling). A future renderdoc renaming it makes the guard miss every chunk and +silently reverts to vendor-priority — exactly the #225 failure class. Mitigated +by the new FIX-3 diagnostic `_log.warning` ("no recognized adapter chunk … — +renderdoc version may have changed chunk naming") which fires only on the +multi-GPU + structured-data + no-marker-chunk path, making the regression +loud and diagnosable instead of silent. *Reporter verification question:* "After +a renderdoc upgrade, if multi-GPU replay picks the wrong card, grep the daemon +log for `renderdoc version may have changed chunk naming`; its presence confirms +the chunk was renamed and the new name must be added to `d3d_markers`." + +**Bounded-recursion depth.** A depth cap of 4 is sufficient for the observed tree +(Description at depth 3). Setting the cap to 5 provides one level of headroom. +Setting it higher risks traversing large subtrees (4834× `Initial Contents` chunks +are siblings at the top level, not descendants, so they are not a concern; the +guard in Part 1 prevents entering them). + +**Chunk-name variance across renderdoc versions.** The real chunk name contains +"Initialisation" (British spelling). Future renderdoc versions may rename it. +Adding the new marker as an additional entry (not replacing the old ones) hedges +against this. If the chunk name is ever confirmed to differ, the new entry should +be updated in a follow-up. + +**`GPUDevice` field names.** The field names `g.deviceID` and `g.name` are +inferred from the existing code and the reported dump. The implementer must +confirm these names against the bundled `renderdoc.pyi` or equivalent before +finalising the implementation. + +## Spec Delta + +The existing scenario "Multi-GPU capture replay" under **Requirement: Replay +lifecycle** in `openspec/specs/daemon/spec.md` (added by #226) requires the +following amendment: + +The target scenario in `openspec/specs/daemon/spec.md` (HEAD lines 38–45) uses +bold `**WHEN**` / `**AND**` / `**THEN**` markers. The delta below reproduces +those bytes verbatim so it applies cleanly: + +```diff + #### Scenario: Multi-GPU capture replay + - **WHEN** the capture was taken on a multi-GPU system + - **AND** multiple GPUs are available for replay +-- **THEN** the daemon selects the GPU whose name matches structured-data adapter +- metadata extracted from the capture ++- **THEN** the daemon walks structured-data chunks whose name contains any of ++ "Driver Initialisation Parameters", "DriverInit", "EnumAdapters", or ++ "CreateDXGIFactory" ++- **AND** within each such chunk descends recursively (depth ≤ 5) to locate an ++ AdapterDesc/pAdapter/adapter subtree ++- **AND** selects the GPU whose PCI DeviceId exactly matches ++ DXGI_ADAPTER_DESC.DeviceId when that field is present ++- **AND** falls back to case-insensitive name-substring match when no exact ++ DeviceId match is found + - **AND** if no structured-data match exists, Software/WARP adapters are excluded + and the highest-ranked discrete GPU is preferred (nVidia > AMD > Intel) + - **AND** a single available GPU is always returned directly without inspection +``` + +This delta has been applied verbatim to `openspec/specs/daemon/spec.md` as +part of this change. + +## Files Changed + +- `src/rdc/daemon_server.py` — `d3d_markers` tuple; `_find_adapter_description` + return type and walk algorithm; `_match_capture_gpu` selection logic +- `openspec/specs/daemon/spec.md` — scenario amendment (see Spec Delta above) +- `tests/unit/test_daemon_server_unit.py` — new unit tests (see `test-plan.md`) diff --git a/openspec/changes/2026-05-17-fix-225-d3d12-chunk-name/tasks.md b/openspec/changes/2026-05-17-fix-225-d3d12-chunk-name/tasks.md new file mode 100644 index 0000000..a361ccc --- /dev/null +++ b/openspec/changes/2026-05-17-fix-225-d3d12-chunk-name/tasks.md @@ -0,0 +1,14 @@ +# Fix #225 (fix-forward): Tasks + +- [ ] Opus review of `proposal.md` and `test-plan.md`; revise as needed +- [ ] `daemon_server.py`: add `"Driver Initialisation Parameters"` to `d3d_markers` (keep existing three entries) +- [ ] `daemon_server.py`: rewrite `_find_adapter_description` with bounded recursive descent (depth ≤ 5); return `(description, device_id)` named tuple (`VendorId` deliberately not parsed) +- [ ] `daemon_server.py`: update `_match_capture_gpu` selection logic — exact `DeviceId` match primary (with `0` treated as software/WARP sentinel: tier-1 gated on truthiness and `deviceID == 0` candidates skipped), name-substring secondary, vendor-priority tertiary +- [ ] `daemon_server.py`: add FIX-3 diagnostic `_log.warning` for chunk-name version drift (multi-GPU + sd present + no recognized vk/d3d marker chunk seen) +- [ ] `openspec/specs/daemon/spec.md`: apply Spec Delta from `proposal.md` to the "Multi-GPU capture replay" scenario +- [ ] `tests/unit/test_daemon_server_unit.py`: implement all 14 test cases from `test-plan.md` (5 red-first defect proofs + 9 regression guards, incl. the WARP-sentinel guard); confirm each red-first test fails against current `daemon_server.py` BEFORE editing source; the existing `_gpu` helper hardcodes `deviceID=0`, so DeviceId tests must build GPUs with explicit distinct `deviceID` values +- [ ] `pixi run lint && pixi run test` passes +- [ ] Fresh Opus review of implementation diff +- [ ] Adversarial review: walk the real canonical dump tree by hand against the new code path; confirm guard fires on `"Internal::Driver Initialisation Parameters"` and `DeviceId=10161` selects NVIDIA +- [ ] Base/target branch is `master` — verified, not assumed: `git -C symbolic-ref refs/remotes/origin/HEAD` resolves to `refs/remotes/origin/master`, and `git remote show origin` reports `HEAD branch: master` (rdc-cli project convention). Branch off `master`; do not assume `main` +- [ ] Open PR targeting `master`; tag @YunHsiao for hardware re-verification diff --git a/openspec/changes/2026-05-17-fix-225-d3d12-chunk-name/test-plan.md b/openspec/changes/2026-05-17-fix-225-d3d12-chunk-name/test-plan.md new file mode 100644 index 0000000..d2e4f9f --- /dev/null +++ b/openspec/changes/2026-05-17-fix-225-d3d12-chunk-name/test-plan.md @@ -0,0 +1,232 @@ +# Fix #225 (fix-forward): Test Plan + +## Test classification + +Tests are split into two classes with different obligations: + +- **Red-first defect proofs** — MUST fail when run against the current buggy + `daemon_server.py` (pre-fix) and pass after the fix. Each states the exact + failure mode against current code so the implementer can confirm redness + before changing any source. +- **Regression guards** — green pre-fix; they pin behavior that already works + so the new recursive walker / DeviceId selection does not regress it. They + MUST NOT be presented as defect proofs. + +All tests live in `tests/unit/test_daemon_server_unit.py`. + +## Mock construction (real shape — verified against `tests/mocks/mock_renderdoc.py`) + +Structured data is built from the real `mock_renderdoc` dataclasses, exactly as +existing tests do (see `_d3d12_sd` / `_vulkan_sd` helpers in the current file). +There is **no** `AsUInt()` and **no** top-level integer `.data`. Numeric scalars +flow through `SDBasic.value` and are read via `AsInt()`; strings via +`AsString()`. Both accessors live on the same `SDObject`. + +String field: + +```python +rd.SDObject(name="Description", data=rd.SDData(basic=rd.SDBasic(value="NVIDIA RTX 4500 Ada Generation"))) +# .AsString() -> "NVIDIA RTX 4500 Ada Generation" +``` + +Numeric (uint32) field — `DeviceId` (the chunk also contains a `VendorId` +sibling, but it is deliberately NOT parsed by the implementation; see +`proposal.md` Risks — so tests do not build a `VendorId` child): + +```python +rd.SDObject(name="DeviceId", data=rd.SDData(basic=rd.SDBasic(value=10161))) +# .AsInt() -> 10161 +``` + +Canonical depth-3 tree from the reporter's dump (omitting the unused +`VendorId` node): + +```python +desc = rd.SDObject(name="Description", data=rd.SDData(basic=rd.SDBasic(value="NVIDIA RTX 4500 Ada Generation"))) +devid = rd.SDObject(name="DeviceId", data=rd.SDData(basic=rd.SDBasic(value=10161))) +adapter = rd.SDObject(name="AdapterDesc", children=[desc, devid]) +init = rd.SDObject(name="InitParams", children=[adapter]) +chunk = rd.SDChunk(name="Internal::Driver Initialisation Parameters", children=[init]) +sd = rd.StructuredFile(chunks=[chunk]) +``` + +The implementation reads `DeviceId` via `child.AsInt()` (consistent with the +existing `AsString()` usage on the same `SDObject` interface). No accessor +hedging is needed — `AsInt()` is the confirmed real accessor. + +GPU objects: the existing `_gpu(name, vendor)` helper hardcodes `deviceID=0`. +Tests that exercise DeviceId matching MUST construct GPUs with explicit, +distinct `deviceID` values (e.g. `SimpleNamespace(name=..., vendor=..., +deviceID=10161, driver="")`). + +## Red-first defect proofs (MUST fail against current buggy code) + +- [ ] `test_d3d12_chunk_guard_recognizes_real_chunk_name` — **isolates Defect 1 + independent of depth-3 resolution.** Build structured data with a single chunk + named exactly `"Internal::Driver Initialisation Parameters"` whose + `AdapterDesc.Description = "AMD Radeon RX 7900 XTX"` is a *direct child of the + chunk* (depth-2 layout, no `InitParams` wrapper). Two GPUs, pinned exactly: + 1. `name="AMD Radeon RX 7900 XTX" vendor=2 deviceID=29772` — the GPU the + chunk's Description identifies. + 2. `name="NVIDIA RTX 4500 Ada Generation" vendor=6 deviceID=10161`. + + Pass `rd=` as the mock module (it has **no** `GPUVendor` attribute, so the + fallback uses the hardcoded defaults `nVidia=6, AMD=2, Intel=5, Software=9` + and `priority = {6:0, 2:1, 5:2}`). Assert the AMD GPU (`deviceID=29772`) is + returned. *Failure mode against current code:* `d3d_markers = ("DriverInit", + "EnumAdapters", "CreateDXGIFactory")` — no substring occurs in `"Internal:: + Driver Initialisation Parameters"` (the space breaks `"DriverInit"`), so the + guard never fires and `_find_adapter_description` is never called. The + function falls to vendor-priority: `filtered` keeps both GPUs (neither is + Software), `filtered.sort(key=priority.get(vendor, 99))` orders NVIDIA + (vendor=6 → priority 0) before AMD (vendor=2 → priority 1), so current code + returns **NVIDIA** — the WRONG GPU vs. the asserted AMD. The wrong-GPU + pick (NVIDIA before AMD) is guaranteed by the priority map regardless of + enumeration order, so this test is provably RED pre-fix and isolates the + guard predicate: a marker-only fix makes the guard fire on the real chunk + name, the depth-2 `AdapterDesc.Description` resolves under the existing inner + loop, and AMD is returned. Provable separately from the depth-3 walker fix. + +- [ ] `test_d3d12_chunk_guard_fires_on_real_name` — end-to-end on the real tree, + redesigned so the vendor-priority fallback returns the WRONG GPU (only the + guard + depth-3 walker + DeviceId fix yields the asserted one). Single chunk + named `"Internal::Driver Initialisation Parameters"` with the full canonical + depth-3 subtree `chunk → InitParams → AdapterDesc → {Description = "AMD Radeon + RX 7900 XTX", DeviceId = 29772}`. Two GPUs, pinned exactly: + 1. `name="AMD Radeon RX 7900 XTX" vendor=2 deviceID=29772` — the card the + capture's `AdapterDesc` identifies (by Description AND DeviceId). + 2. `name="NVIDIA RTX 4500 Ada Generation" vendor=6 deviceID=10161`. + + Pass `rd=` as the mock module (no `GPUVendor` attribute → defaults + `nVidia=6, AMD=2, Intel=5, Software=9`, `priority = {6:0, 2:1, 5:2}`). Assert + `_match_capture_gpu` returns the **AMD** GPU (`deviceID=29772`). *Failure mode + against current code:* guard miss (Defect 1) — `"Internal::Driver + Initialisation Parameters"` contains none of the markers — so + `_find_adapter_description` is never called; even if the guard had matched, + `_find_adapter_description` stops at depth 2 (Defect 2) and never reaches the + depth-3 `AdapterDesc` under `InitParams`; and even if a name were extracted + there is no DeviceId logic (Defect 3). The function falls to vendor-priority: + both GPUs survive the Software filter, `filtered.sort` orders NVIDIA + (vendor=6 → priority 0) before AMD (vendor=2 → priority 1), so current code + returns **NVIDIA** — the WRONG card vs. the asserted AMD. The fallback's + wrong pick is fixed by the priority map (NVIDIA strictly before AMD) and is + independent of enumeration order, so the test is provably RED pre-fix. Only + after all three fixes — marker added → guard fires on the real name; walker + recurses to depth 3 → reaches `AdapterDesc`; exact `DeviceId=29772` matches + GPU #1 — is the asserted AMD GPU returned. + +- [ ] `test_find_adapter_description_depth3` — call `_find_adapter_description` + directly with the canonical depth-3 object tree `chunk → InitParams → + AdapterDesc → {Description, DeviceId=10161}`. Assert the returned value exposes + `description == "NVIDIA RTX 4500 Ada Generation"` and `device_id == 10161`. + *Failure mode against current code:* the current walker only inspects direct + children of the chunk and direct children of an `AdapterDesc`/`pAdapter`/ + `adapter` *direct child*; here `AdapterDesc` is a grandchild (under + `InitParams`), so the inner loop is never entered and the function returns + `None` (and has no `device_id` field at all — current return type is `str | + None`). Proves Defect 2. + +- [ ] `test_exact_deviceid_wins_over_wrong_name` — chunk named exactly + `"Internal::Driver Initialisation Parameters"` with the canonical depth-3 + subtree `chunk → InitParams → AdapterDesc → {Description = "NVIDIA RTX 4500", + DeviceId = 10161}` (same chunk name and InitParams→AdapterDesc depth as the + real reporter dump, so the implementer deterministically exercises the + guard + depth-3 walker + DeviceId path). Two GPUs, same vendor, **wrong one + placed first**: `name="NVIDIA RTX 4500 Ada Generation Laptop GPU" vendor=6 + deviceID=9999` then `name="NVIDIA RTX 4500 Ada Generation" vendor=6 + deviceID=10161`. Both names contain the substring `"NVIDIA RTX 4500"`. Pass + `rd=` as the mock module. Assert the GPU with `deviceID=10161` is returned. + *Failure mode against current code:* the guard misses the real chunk name + (no marker substring), so `_find_adapter_description` and the substring loop + never run; the function falls to vendor-priority. Both GPUs have vendor=6, so + `priority.get(6)=0` for both and the stable `filtered.sort` preserves + enumeration order, returning GPU #1 (`deviceID=9999`) — the WRONG card vs. the + asserted `deviceID=10161`. (Independently, even with the guard fixed, + pre-walker code would never reach the depth-3 `AdapterDesc`, and even with a + name extracted there is no DeviceId logic and the substring loop `desc_l in + name_l or name_l in desc_l` would still match the first GPU.) After the fix + the guard fires, the walker reaches depth-3 `AdapterDesc`, and exact + `DeviceId=10161` selects the second GPU. Proves Defect 3 (and is guarded + against the guard/walker defects masking it). Note: both GPUs share vendor=6 + so the vendor-priority tertiary path cannot accidentally yield the asserted + GPU — only the DeviceId fix can. + +- [ ] `test_exact_deviceid_same_vendor_multi_gpu` — pinned for guaranteed + redness. Chunk named exactly `"Internal::Driver Initialisation Parameters"` + with the canonical depth-3 subtree `chunk → InitParams → AdapterDesc → + {Description = "NVIDIA RTX A5000", DeviceId = 8888}` (same chunk name and + InitParams→AdapterDesc depth as the real dump). Three GPUs, all vendor + NVIDIA (vendor=6), in this order: + 1. `name="NVIDIA RTX A5000" deviceID=2204` — a *wrong* same-vendor GPU whose + name is an exact substring match of the Description. + 2. `name="NVIDIA RTX A4000" deviceID=9999`. + 3. `name="NVIDIA RTX A6000" deviceID=8888` — the wanted GPU (DeviceId points + here), placed last and whose name does NOT substring-match the Description. + + Pass `rd=` as the mock module. Assert the GPU with `deviceID=8888` (GPU #3, + "RTX A6000") is returned. *Failure mode against current code:* the guard + misses the real chunk name (no marker substring), so the substring loop never + runs and the function falls to vendor-priority. All three GPUs have vendor=6, + so `priority.get(6)=0` for all and the stable `filtered.sort` preserves + enumeration order, returning GPU #1 (`deviceID=2204`) — the WRONG card vs. the + asserted `deviceID=8888`. (Independently, even with guard + walker fixed but + no DeviceId logic, the substring loop would hit GPU #1 — `"NVIDIA RTX A5000"` + ⊆ Description — first and still return the wrong same-vendor card.) The wanted + GPU is deliberately placed after a wrong same-vendor GPU whose name + substring-matches the Description, so both pre-fix paths (fallback and, with + partial fixes, substring) are guaranteed to return the wrong GPU, and only the + fixed exact-DeviceId logic returns GPU #3. Same vendor across all three means + vendor-priority can never accidentally yield the asserted GPU. + +## Regression guards (green pre-fix; prevent regression) + +- [ ] `test_d3d12_legacy_marker_path_unbroken` — **reworked legacy-marker + guard; explicitly NOT a defect proof.** This is green pre-fix (it is the same + path as the existing passing `test_d3d12_match_via_driverinit_adapterdesc`). + Mock a chunk named `"DriverInit"` with a *direct* `AdapterDesc` child whose + `Description = "Intel HD Graphics"`; one GPU named `"Intel HD Graphics"`. + Assert that GPU is returned. Purpose: confirm the existing depth-1 + `AdapterDesc` legacy-marker path still resolves under the NEW recursive walker + and that the retained markers (`"DriverInit"` etc.) are not dropped. + +- [ ] `test_find_adapter_description_depth1_still_works` — `Description` is a + direct child of the chunk (depth 1). Assert it is returned. Green pre-fix; + guards shallow captures against the recursive rewrite. + +- [ ] `test_find_adapter_description_depth2_adapter_child` — `AdapterDesc` is a + direct child of the chunk and `Description` its child (depth 2). Assert the + value is returned. Green pre-fix (this is the exact path the current code and + existing tests already cover); guards the original depth-2 case. + +- [ ] `test_name_substring_fallback_when_no_device_id` — chunk has a + `Description` but no `DeviceId` child; two GPUs with distinct names (each with + a concrete `deviceID`). Assert the GPU whose name substring-matches the + Description is returned. Green pre-fix (substring path already works); guards + the secondary signal when DeviceId is absent. + +- [ ] `test_fallback_still_works_when_no_chunk` — `sd=None`; two GPUs: AMD + (`vendor=2`) and NVIDIA (`vendor=6`). Assert NVIDIA is returned via + vendor-priority fallback. Green pre-fix (#226 fallback); guards the tertiary + path. + +- [ ] `test_vulkan_path_unchanged` — chunk named + `"vkEnumeratePhysicalDevices"` with `physProps.deviceName = "AMD RX 7900 XT"`; + two GPUs. Assert the AMD GPU is returned. Green pre-fix; guards the Vulkan + branch from D3D12 changes. + +- [ ] `test_single_gpu_short_circuit` — `len(gpus) == 1`, no structured data. + Assert that GPU is returned immediately without tree traversal. Green pre-fix; + guards the single-GPU short-circuit from #226. + +- [ ] `test_empty_gpu_list_returns_none` — `gpus=[]`. Assert return value is + `None` with no exception raised. Green pre-fix; guards the empty-list edge. + +- [ ] `test_zero_device_id_does_not_bind_warp` — **WARP-sentinel guard.** Real + mock SD shapes: GPUs `[WARP Rasterizer vendor=9 deviceID=0, + NVIDIA RTX 4500 Ada Generation vendor=6 deviceID=10161]`; canonical + `Internal::Driver Initialisation Parameters` → InitParams → AdapterDesc with + `Description="NVIDIA RTX 4500 Ada Generation"` and `DeviceId=0`. Assert the + NVIDIA GPU is returned (NOT WARP): a present-but-zero `DeviceId` is the + software/WARP sentinel and must fall through tier-1 to name-substring, which + resolves NVIDIA. Pins the FIX-1 truthiness gate + `deviceID == 0` candidate + skip so a real non-zero id can never coincidentally bind the WARP adapter. diff --git a/openspec/specs/daemon/spec.md b/openspec/specs/daemon/spec.md index e04d6f0..2bf1277 100644 --- a/openspec/specs/daemon/spec.md +++ b/openspec/specs/daemon/spec.md @@ -38,8 +38,15 @@ ReplayController for the duration of the session. #### Scenario: Multi-GPU capture replay - **WHEN** the capture was taken on a multi-GPU system - **AND** multiple GPUs are available for replay -- **THEN** the daemon selects the GPU whose name matches structured-data adapter - metadata extracted from the capture +- **THEN** the daemon walks structured-data chunks whose name contains any of + "Driver Initialisation Parameters", "DriverInit", "EnumAdapters", or + "CreateDXGIFactory" +- **AND** within each such chunk descends recursively (depth ≤ 5) to locate an + AdapterDesc/pAdapter/adapter subtree +- **AND** selects the GPU whose PCI DeviceId exactly matches + DXGI_ADAPTER_DESC.DeviceId when that field is present +- **AND** falls back to case-insensitive name-substring match when no exact + DeviceId match is found - **AND** if no structured-data match exists, Software/WARP adapters are excluded and the highest-ranked discrete GPU is preferred (nVidia > AMD > Intel) - **AND** a single available GPU is always returned directly without inspection diff --git a/src/rdc/daemon_server.py b/src/rdc/daemon_server.py index 91e2530..ddd5cd9 100644 --- a/src/rdc/daemon_server.py +++ b/src/rdc/daemon_server.py @@ -12,7 +12,7 @@ import time from dataclasses import dataclass, field from pathlib import Path -from typing import TYPE_CHECKING, Any +from typing import TYPE_CHECKING, Any, NamedTuple from rdc import _platform from rdc._progress import make_progress_cb @@ -167,37 +167,63 @@ def _cleanup_temp(state: DaemonState) -> None: _log = logging.getLogger("rdc.daemon") -def _find_adapter_description(chunk: Any) -> str | None: - """Extract D3D12 adapter description string from a structured-data chunk. +class AdapterMatch(NamedTuple): + """Adapter identity parsed from a D3D12 structured-data chunk.""" + + description: str + device_id: int | None + + +def _scalar_int(node: Any, name: str) -> int | None: + """Return the AsInt() value of a direct child ``name``, or None if absent.""" + for i in range(node.NumChildren()): + child = node.GetChild(i) + if child.name == name: + return int(child.AsInt()) + return None + + +def _walk_adapter(node: Any, depth: int) -> AdapterMatch | None: + """Depth-bounded descent for a Description/DeviceName plus paired DeviceId.""" + if depth < 0: + return None + for i in range(node.NumChildren()): + child = node.GetChild(i) + cname = child.name + if cname in ("Description", "DeviceName"): + value = child.AsString() + if value: + return AdapterMatch(str(value), _scalar_int(node, "DeviceId")) + if cname in ("AdapterDesc", "pAdapter", "adapter"): + found = _walk_adapter(child, depth - 1) + if found is not None: + return found + for i in range(node.NumChildren()): + found = _walk_adapter(node.GetChild(i), depth - 1) + if found is not None: + return found + return None + + +def _find_adapter_description(chunk: Any) -> AdapterMatch | None: + """Extract D3D12 adapter identity from a structured-data chunk subtree. + + Recurses up to 5 levels for a ``Description``/``DeviceName`` field, pairing + it with a sibling ``DeviceId`` PCI scalar when present. Args: chunk: A renderdoc SDChunk (or SDObject) whose subtree may contain an - ``AdapterDesc``/``pAdapter``/``adapter`` child with a ``Description`` - or ``DeviceName`` field, or a direct ``Description``/``DeviceName`` - child. + ``AdapterDesc``/``pAdapter``/``adapter`` node with a ``Description`` + or ``DeviceName`` field at any depth, or a direct one. Returns: - The adapter description string when found, otherwise ``None``. + An :class:`AdapterMatch` when a description is found, otherwise ``None``. """ try: - for i in range(chunk.NumChildren()): - child = chunk.GetChild(i) - cname = child.name - if cname in ("Description", "DeviceName"): - value = child.AsString() - if value: - return str(value) - if cname in ("AdapterDesc", "pAdapter", "adapter"): - for j in range(child.NumChildren()): - grand = child.GetChild(j) - if grand.name in ("Description", "DeviceName"): - value = grand.AsString() - if value: - return str(value) + return _walk_adapter(chunk, 5) except Exception as exc: # noqa: BLE001 _log.debug("adapter-desc parse failed: %s: %s", type(exc).__name__, exc) return None - return None def _match_capture_gpu(cap: Any, sd: Any = None, rd: Any = None) -> Any | None: @@ -223,12 +249,22 @@ def _match_capture_gpu(cap: Any, sd: Any = None, rd: Any = None) -> Any | None: if len(gpus) == 1: return gpus[0] + saw_adapter_chunk = False if sd is not None: - d3d_markers = ("DriverInit", "EnumAdapters", "CreateDXGIFactory") + d3d_markers = ( + "Driver Initialisation Parameters", # renderdoc 1.42 D3D12 name + # Legacy markers: provably dead on renderdoc-1.42 D3D12 (the + # real chunk name has spaces). Retained for older/other + # renderdoc versions and API paths that still emit them. + "DriverInit", + "EnumAdapters", + "CreateDXGIFactory", + ) for i in range(len(sd.chunks)): c = sd.chunks[i] cname = c.name if cname == "vkEnumeratePhysicalDevices": + saw_adapter_chunk = True for j in range(c.NumChildren()): child = c.GetChild(j) if child.name == "physProps": @@ -241,14 +277,29 @@ def _match_capture_gpu(cap: Any, sd: Any = None, rd: Any = None) -> Any | None: return g break if any(marker in cname for marker in d3d_markers): - desc = _find_adapter_description(c) - if desc: - desc_l = desc.lower() + saw_adapter_chunk = True + match = _find_adapter_description(c) + if match: + # device_id 0 is the WARP/software sentinel: a present + # but zero DXGI_ADAPTER_DESC.DeviceId must fall through + # to name-substring, never bind the WARP adapter. + if match.device_id: + for g in gpus: + if g.deviceID and g.deviceID == match.device_id: + return g + desc_l = match.description.lower() for g in gpus: name_l = g.name.lower() if desc_l in name_l or name_l in desc_l: return g + if sd is not None and not saw_adapter_chunk: + _log.warning( + "no recognized adapter chunk (vk/d3d markers) in structured " + "data -- renderdoc version may have changed chunk naming; " + "falling back to vendor priority" + ) + vendor_enum = getattr(rd, "GPUVendor", None) if rd is not None else None nvidia = getattr(vendor_enum, "nVidia", 6) amd = getattr(vendor_enum, "AMD", 2) diff --git a/tests/unit/test_daemon_server_unit.py b/tests/unit/test_daemon_server_unit.py index 2d63c94..3aeb140 100644 --- a/tests/unit/test_daemon_server_unit.py +++ b/tests/unit/test_daemon_server_unit.py @@ -16,6 +16,7 @@ _NO_REPLAY_METHODS, DaemonState, _cleanup_temp_capture, + _find_adapter_description, _handle_request, _load_replay, _match_capture_gpu, @@ -342,6 +343,205 @@ def _create_remote(url: str) -> tuple[Any, Any]: assert captured.get("rd") is mock_rd +class TestFix225D3D12ChunkNameAndDeviceId: + """#225 fix-forward: real chunk name, depth-3 walk, exact DeviceId match. + + 5 red-first defect proofs + 9 regression guards. Mock shapes per + tests/mocks/mock_renderdoc.py: SDObject has only AsString()/AsInt(); + numeric scalars via SDBasic.value read with AsInt(); strings with + AsString(). The mock module has no GPUVendor attribute, so the production + fallback uses hardcoded nVidia=6/AMD=2/Intel=5/Software=9 and + priority={6:0, 2:1, 5:2}. + """ + + _CHUNK = "Internal::Driver Initialisation Parameters" + + @staticmethod + def _gpu(name: str, vendor: int, device_id: int) -> SimpleNamespace: + return SimpleNamespace(name=name, vendor=vendor, deviceID=device_id, driver="") + + @staticmethod + def _cap(gpus: list[Any]) -> SimpleNamespace: + return SimpleNamespace(GetAvailableGPUs=lambda: gpus) + + @staticmethod + def _str_obj(name: str, value: str) -> Any: + import mock_renderdoc as rd + + return rd.SDObject(name=name, data=rd.SDData(basic=rd.SDBasic(value=value))) + + @staticmethod + def _int_obj(name: str, value: int) -> Any: + import mock_renderdoc as rd + + return rd.SDObject(name=name, data=rd.SDData(basic=rd.SDBasic(value=value))) + + def _depth3_sd( + self, + description: str, + device_id: int | None, + chunk_name: str, + ) -> Any: + """Canonical chunk -> InitParams -> AdapterDesc -> {Description[, DeviceId]}.""" + import mock_renderdoc as rd + + children: list[Any] = [self._str_obj("Description", description)] + if device_id is not None: + children.append(self._int_obj("DeviceId", device_id)) + adapter = rd.SDObject(name="AdapterDesc", children=children) + init = rd.SDObject(name="InitParams", children=[adapter]) + chunk = rd.SDChunk(name=chunk_name, children=[init]) + return rd.StructuredFile(chunks=[chunk]) + + def _depth2_sd(self, description: str, chunk_name: str) -> Any: + """chunk -> AdapterDesc -> Description (no InitParams wrapper).""" + import mock_renderdoc as rd + + desc = self._str_obj("Description", description) + adapter = rd.SDObject(name="AdapterDesc", children=[desc]) + chunk = rd.SDChunk(name=chunk_name, children=[adapter]) + return rd.StructuredFile(chunks=[chunk]) + + # --- Red-first defect proofs ------------------------------------------ + + def test_d3d12_chunk_guard_recognizes_real_chunk_name(self) -> None: + """Defect 1 in isolation: depth-2 layout, only the marker is missing.""" + import mock_renderdoc as rd + + amd = self._gpu("AMD Radeon RX 7900 XTX", 2, 29772) + nvidia = self._gpu("NVIDIA RTX 4500 Ada Generation", 6, 10161) + cap = self._cap([amd, nvidia]) + sd = self._depth2_sd("AMD Radeon RX 7900 XTX", self._CHUNK) + assert _match_capture_gpu(cap, sd, rd) is amd + + def test_d3d12_chunk_guard_fires_on_real_name(self) -> None: + """End-to-end real tree: all 3 fixes needed to return AMD over priority.""" + import mock_renderdoc as rd + + amd = self._gpu("AMD Radeon RX 7900 XTX", 2, 29772) + nvidia = self._gpu("NVIDIA RTX 4500 Ada Generation", 6, 10161) + cap = self._cap([amd, nvidia]) + sd = self._depth3_sd("AMD Radeon RX 7900 XTX", 29772, self._CHUNK) + assert _match_capture_gpu(cap, sd, rd) is amd + + def test_find_adapter_description_depth3(self) -> None: + """Defect 2: walker must reach AdapterDesc under InitParams (grandchild).""" + import mock_renderdoc as rd + + desc = self._str_obj("Description", "NVIDIA RTX 4500 Ada Generation") + devid = self._int_obj("DeviceId", 10161) + adapter = rd.SDObject(name="AdapterDesc", children=[desc, devid]) + init = rd.SDObject(name="InitParams", children=[adapter]) + chunk = rd.SDChunk(name=self._CHUNK, children=[init]) + result = _find_adapter_description(chunk) + assert result is not None + assert result.description == "NVIDIA RTX 4500 Ada Generation" + assert result.device_id == 10161 + + def test_exact_deviceid_wins_over_wrong_name(self) -> None: + """Defect 3: same-vendor, wrong-name-first; DeviceId must disambiguate.""" + import mock_renderdoc as rd + + wrong = self._gpu("NVIDIA RTX 4500 Ada Generation Laptop GPU", 6, 9999) + right = self._gpu("NVIDIA RTX 4500 Ada Generation", 6, 10161) + cap = self._cap([wrong, right]) + sd = self._depth3_sd("NVIDIA RTX 4500", 10161, self._CHUNK) + assert _match_capture_gpu(cap, sd, rd) is right + + def test_exact_deviceid_same_vendor_multi_gpu(self) -> None: + """Defect 3 pinned: 3 NVIDIA GPUs, wanted one last and not name-matching.""" + import mock_renderdoc as rd + + g1 = self._gpu("NVIDIA RTX A5000", 6, 2204) + g2 = self._gpu("NVIDIA RTX A4000", 6, 9999) + g3 = self._gpu("NVIDIA RTX A6000", 6, 8888) + cap = self._cap([g1, g2, g3]) + sd = self._depth3_sd("NVIDIA RTX A5000", 8888, self._CHUNK) + assert _match_capture_gpu(cap, sd, rd) is g3 + + # --- Regression guards (green pre-fix) --------------------------------- + + def test_d3d12_legacy_marker_path_unbroken(self) -> None: + import mock_renderdoc as rd + + intel = self._gpu("Intel HD Graphics", 5, 1234) + cap = self._cap([intel, self._gpu("NVIDIA RTX 4500", 6, 10161)]) + sd = self._depth2_sd("Intel HD Graphics", "DriverInit") + assert _match_capture_gpu(cap, sd, rd) is intel + + def test_find_adapter_description_depth1_still_works(self) -> None: + import mock_renderdoc as rd + + desc = self._str_obj("Description", "AMD Radeon RX 7900 XTX") + chunk = rd.SDChunk(name="DriverInit", children=[desc]) + result = _find_adapter_description(chunk) + assert result is not None + assert result.description == "AMD Radeon RX 7900 XTX" + + def test_find_adapter_description_depth2_adapter_child(self) -> None: + import mock_renderdoc as rd + + desc = self._str_obj("Description", "NVIDIA RTX 4500 Ada Generation") + adapter = rd.SDObject(name="AdapterDesc", children=[desc]) + chunk = rd.SDChunk(name="DriverInit", children=[adapter]) + result = _find_adapter_description(chunk) + assert result is not None + assert result.description == "NVIDIA RTX 4500 Ada Generation" + + def test_name_substring_fallback_when_no_device_id(self) -> None: + import mock_renderdoc as rd + + amd = self._gpu("AMD Radeon RX 7900 XTX", 2, 29772) + nvidia = self._gpu("NVIDIA RTX 4500 Ada Generation", 6, 10161) + cap = self._cap([amd, nvidia]) + sd = self._depth3_sd("NVIDIA RTX 4500 Ada Generation", None, self._CHUNK) + assert _match_capture_gpu(cap, sd, rd) is nvidia + + def test_fallback_still_works_when_no_chunk(self) -> None: + import mock_renderdoc as rd + + amd = self._gpu("AMD Radeon Graphics", 2, 5056) + nvidia = self._gpu("NVIDIA RTX 4500 Ada Generation", 6, 10161) + cap = self._cap([amd, nvidia]) + assert _match_capture_gpu(cap, None, rd) is nvidia + + def test_vulkan_path_unchanged(self) -> None: + import mock_renderdoc as rd + + amd = self._gpu("AMD RX 7900 XT", 2, 29772) + nvidia = self._gpu("NVIDIA RTX 4500", 6, 10161) + cap = self._cap([amd, nvidia]) + prop = self._str_obj("deviceName", "AMD RX 7900 XT") + phys = rd.SDObject(name="physProps", children=[prop]) + chunk = rd.SDChunk(name="vkEnumeratePhysicalDevices", children=[phys]) + sd = rd.StructuredFile(chunks=[chunk]) + assert _match_capture_gpu(cap, sd, rd) is amd + + def test_single_gpu_short_circuit(self) -> None: + import mock_renderdoc as rd + + only = self._gpu("Mock GPU", 2, 1) + cap = self._cap([only]) + assert _match_capture_gpu(cap, None, rd) is only + + def test_empty_gpu_list_returns_none(self) -> None: + import mock_renderdoc as rd + + cap = self._cap([]) + assert _match_capture_gpu(cap, None, rd) is None + + def test_zero_device_id_does_not_bind_warp(self) -> None: + """WARP-sentinel guard: a present-but-zero DeviceId must fall through + to name-substring, never exact-match the WARP adapter (deviceID=0).""" + import mock_renderdoc as rd + + warp = self._gpu("WARP Rasterizer", 9, 0) + nvidia = self._gpu("NVIDIA RTX 4500 Ada Generation", 6, 10161) + cap = self._cap([warp, nvidia]) + sd = self._depth3_sd("NVIDIA RTX 4500 Ada Generation", 0, self._CHUNK) + assert _match_capture_gpu(cap, sd, rd) is nvidia + + class TestShutdownExceptionStops: def test_shutdown_exception_returns_not_running(self, monkeypatch: Any) -> None: """If shutdown handler raises, _process_request returns running=False."""