Allow RAM device regions to skip IOMMU mapping #24
Conversation
|
General question: Does this same issue also need to be addressed in NV QEMU 11.0? Codex finding... High: memory_region_skip_iommu_map() is only honored on region add, not on region delete. 2e963ff adds the generic skip-map API and returns before vfio_container_dma_map(), and e227a6b marks CXL mmap subregions with that flag, but vfio_listener_region_del() still leaves try_unmap=true for aligned RAM-device sections and calls vfio_container_dma_unmap(). With iommufd, that reaches IOMMU_IOAS_UNMAP, which is the operation the CXL change is trying to avoid. Please address this in 2e963ff, where the generic memory_region_skip_iommu_map() contract is introduced. The delete path should be symmetric with the add path, e.g. have vfio_listener_region_del() skip vfio_container_dma_unmap() when memory_region_skip_iommu_map(section->mr) is true, while still running the existing memory_region_unref() and section-window cleanup. 2e963ff NVIDIA: SAUCE: memory: Allow RAM device regions to skip IOMMU mapping It looks like you picked this from nvidia_stable-11.0, is that correct? If so, can you pick with -x -s and then add "nvidia_stable-11.0” after the SHA in the pick line that is added by the -s flag? e..g
|
e227a6b to
dfa2b6c
Compare
|
Yes, this change will be applicable for 11.0 as well. I have cherry-picked the patch with some changes from 11.0 branch. I have added the commit message reflecting the same. I have run a codex review on these patches and now pushed new patches after verification. Please take a look at new patches. |
shamiali2008
left a comment
There was a problem hiding this comment.
Why do we need skip_iommu_map() in vfio_iommu_map_notify path? If that's from testing or codex review please include the justification in commit log. Otherwise looks good to me.
dfa2b6c to
b252f04
Compare
…wned RAM-device regions" This reverts commit d814a45. The commit made vfio_container_region_add() take an early return for any RAM-device section owned by a VFIO device (vfio_get_vfio_device(memory_region_owner(section->mr)) != NULL), skipping vfio_container_dma_map() for it. In practice this excludes every VFIO mmap subregion -- PCI BAR windows and, importantly, the CXL.mem / DPA coherent device-memory region of a CXL Type-2 device -- from the IOMMU IOAS (the SMMU Stage-2 page tables). Why it was originally added --------------------------- The commit rested on two stated premises: 1. "this mapping always fails": the backing VMAs carry VM_IO | VM_PFNMAP, pin_user_pages() refuses VM_IO pages, so IOMMU_IOAS_MAP returns -EFAULT; therefore the map is pointless. 2. "no IOMMU entry is required": CPU access to these regions goes through KVM Stage-2 faults independently of the SMMU, and device DMA to system RAM uses separate per-RAM-section IOMMU entries. Both premises are incorrect, and the second is the more damaging one. In accelerated/nested SMMUv3 mode the GPU translates shared virtual addresses through the hardware SMMU (Stage-1 = guest page tables, Stage-2 = host iommufd). When UVM migrates a managed buffer into the device's coherent memory, the page's guest-physical address lies in the CXL DPA window. A GPU access to it is issued as an ATS request, and to answer that request the SMMU must complete the Stage-1 + Stage-2 walk. With the DPA region skipped, there is no Stage-2 entry for that guest-physical address, so the translation faults. The GPU posts a replayable fault; UVM services it, invalidates the TLB, and replays; the access faults again because the Stage-2 entry still does not exist. This becomes an unbounded fault -> service -> replay livelock: the test makes no forward progress (it "hangs"), the host SMMU logs nothing (an ATS request with no translation returns an unsuccessful completion, not a fault event), and on cancellation the GPU reports: NVRM: Xid 31 ... MMU Fault ... FAULT_PTE ACCESS_TYPE_VIRT_WRITE Observed behaviour matches this exactly: - UVM/ATS tests that keep their working set in system RAM pass: guest RAM is mapped into Stage-2 normally, so the ATS access resolves. - UVM/ATS tests that migrate the buffer into DPA coherent memory and then access it via ATS hang. The failure is the intersection of two conditions (buffer is DPA-resident AND reached via ATS); meeting only one is fine. - Non-ATS CUDA workloads (e.g. explicit cudaMalloc + cudaMemcpy) pass: they reach device memory through the GPU's own page tables, never the SMMU. Device-memory residency at hang time was confirmed independently: nvidia-smi device memory usage rises during the failing tests, and the QEMU trace shows the DPA region being skipped before this revert and mapped successfully after it. Why reverting is the correct fix for ATS ---------------------------------------- The correct behaviour is precisely what every other VFIO-owned RAM-device region already gets, and what nvgrace-gpu relies on: the region is mapped into the Stage-2 IOAS via the ordinary vaddr IOMMU_IOAS_MAP path. That makes the GPU's ATS accesses to its own coherent memory translatable, which removes the fault livelock. iommufd already maps these regions correctly. Reverting this commit restores that behaviour. With it reverted, the CXL DPA region is mapped into Stage-2 and the UVM/ATS hang no longer reproduces on CXL Type-2 device passthrough. Note on the boot-time ATC_INV concern -------------------------------------- A secondary motivation for withholding the mapping was to avoid a fatal CMDQ_OP_ATC_INV timeout seen on a CXL Type-2 device during early init: that error is triggered by an IOMMU_IOAS_UNMAP of the region while the device cannot service the resulting ATC invalidation in time. Note that the trigger is the *unmap*, not the *map*. Withholding the map to dodge the unmap is incorrect because the map is mandatory for ATS. If that ATC_INV timeout resurfaces on the unmap paths (guest reboot, FLR/reset, VM shutdown), the correct fix is to guard the region_del / unmap path while the device is in its init/reset state (keep the mapping, defer only the unmap), not to skip the mapping. Signed-off-by: Manish Honap <mhonap@nvidia.com>
b252f04 to
cdf5e75
Compare
|
I see skipping dma mapping for cxl mode is causing some ATS based CUDA tests to fail. So I have modified this review to revert the change NVIDIA: VR: SAUCE: vfio/listener: Skip DMA mapping for VFIO-owned RAM-device regions |
Commit d814a45 skipped vfio_container_dma_map for all VFIO-owned
ram-device regions to avoid a fatal hardware error on VR180 (CXL Type-2).
The skip was implemented in vfio_container_region_add() by checking the
region owner, which was too broad: it also silently skipped non-CXL
coherent regions such as NVLink apertures on GB200, breaking ATS because
the nested SMMU S2 HWPT never receives the IOMMU IOAS entry those
regions need.
CXL mmap-backed regions must not be mapped into the IOMMU IOAS.
Fix by marking CXL mmap subregions at creation time in vfio_region_mmap()
using memory_region_set_skip_iommu_map().
Non-CXL VFIO-owned regions are unaffected: their subregions keep the
default skip_iommu_map=false and fall through to vfio_container_dma_map().