From cdf5e75bf9c54173993ecb1ca73c9026d6593ab8 Mon Sep 17 00:00:00 2001 From: Manish Honap Date: Thu, 18 Jun 2026 17:43:28 +0530 Subject: [PATCH] Revert "NVIDIA: VR: SAUCE: vfio/listener: Skip DMA mapping for VFIO-owned RAM-device regions" This reverts commit d814a45f763979f5f985886abad8a170d68e4eac. The commit made vfio_container_region_add() take an early return for any RAM-device section owned by a VFIO device (vfio_get_vfio_device(memory_region_owner(section->mr)) != NULL), skipping vfio_container_dma_map() for it. In practice this excludes every VFIO mmap subregion -- PCI BAR windows and, importantly, the CXL.mem / DPA coherent device-memory region of a CXL Type-2 device -- from the IOMMU IOAS (the SMMU Stage-2 page tables). Why it was originally added --------------------------- The commit rested on two stated premises: 1. "this mapping always fails": the backing VMAs carry VM_IO | VM_PFNMAP, pin_user_pages() refuses VM_IO pages, so IOMMU_IOAS_MAP returns -EFAULT; therefore the map is pointless. 2. "no IOMMU entry is required": CPU access to these regions goes through KVM Stage-2 faults independently of the SMMU, and device DMA to system RAM uses separate per-RAM-section IOMMU entries. Both premises are incorrect, and the second is the more damaging one. In accelerated/nested SMMUv3 mode the GPU translates shared virtual addresses through the hardware SMMU (Stage-1 = guest page tables, Stage-2 = host iommufd). When UVM migrates a managed buffer into the device's coherent memory, the page's guest-physical address lies in the CXL DPA window. A GPU access to it is issued as an ATS request, and to answer that request the SMMU must complete the Stage-1 + Stage-2 walk. With the DPA region skipped, there is no Stage-2 entry for that guest-physical address, so the translation faults. The GPU posts a replayable fault; UVM services it, invalidates the TLB, and replays; the access faults again because the Stage-2 entry still does not exist. This becomes an unbounded fault -> service -> replay livelock: the test makes no forward progress (it "hangs"), the host SMMU logs nothing (an ATS request with no translation returns an unsuccessful completion, not a fault event), and on cancellation the GPU reports: NVRM: Xid 31 ... MMU Fault ... FAULT_PTE ACCESS_TYPE_VIRT_WRITE Observed behaviour matches this exactly: - UVM/ATS tests that keep their working set in system RAM pass: guest RAM is mapped into Stage-2 normally, so the ATS access resolves. - UVM/ATS tests that migrate the buffer into DPA coherent memory and then access it via ATS hang. The failure is the intersection of two conditions (buffer is DPA-resident AND reached via ATS); meeting only one is fine. - Non-ATS CUDA workloads (e.g. explicit cudaMalloc + cudaMemcpy) pass: they reach device memory through the GPU's own page tables, never the SMMU. Device-memory residency at hang time was confirmed independently: nvidia-smi device memory usage rises during the failing tests, and the QEMU trace shows the DPA region being skipped before this revert and mapped successfully after it. Why reverting is the correct fix for ATS ---------------------------------------- The correct behaviour is precisely what every other VFIO-owned RAM-device region already gets, and what nvgrace-gpu relies on: the region is mapped into the Stage-2 IOAS via the ordinary vaddr IOMMU_IOAS_MAP path. That makes the GPU's ATS accesses to its own coherent memory translatable, which removes the fault livelock. iommufd already maps these regions correctly. Reverting this commit restores that behaviour. With it reverted, the CXL DPA region is mapped into Stage-2 and the UVM/ATS hang no longer reproduces on CXL Type-2 device passthrough. Note on the boot-time ATC_INV concern -------------------------------------- A secondary motivation for withholding the mapping was to avoid a fatal CMDQ_OP_ATC_INV timeout seen on a CXL Type-2 device during early init: that error is triggered by an IOMMU_IOAS_UNMAP of the region while the device cannot service the resulting ATC invalidation in time. Note that the trigger is the *unmap*, not the *map*. Withholding the map to dodge the unmap is incorrect because the map is mandatory for ATS. If that ATC_INV timeout resurfaces on the unmap paths (guest reboot, FLR/reset, VM shutdown), the correct fix is to guard the region_del / unmap path while the device is in its init/reset state (keep the mapping, defer only the unmap), not to skip the mapping. Signed-off-by: Manish Honap --- hw/vfio/listener.c | 14 -------------- 1 file changed, 14 deletions(-) diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c index e050e3c2f69..f498e23a937 100644 --- a/hw/vfio/listener.c +++ b/hw/vfio/listener.c @@ -598,20 +598,6 @@ void vfio_container_region_add(VFIOContainerBase *bcontainer, pgmask + 1); return; } - - /* - * VFIO MMAP backed regions (CXL.mem) uses VM_IO | VM_PFNMAP VMAs - * backed by physical device addresses. Skip vfio_container_dma_map - * as mapping is not needed for this region. - */ - if (vfio_get_vfio_device(memory_region_owner(section->mr))) { - trace_vfio_listener_region_add_no_dma_map( - memory_region_name(section->mr), - section->offset_within_address_space, - int128_getlo(section->size), - pgmask + 1); - return; - } } ret = vfio_container_dma_map(bcontainer, iova, int128_get64(llsize),