From cdf5e75bf9c54173993ecb1ca73c9026d6593ab8 Mon Sep 17 00:00:00 2001
From: Manish Honap <mhonap@nvidia.com>
Date: Thu, 18 Jun 2026 17:43:28 +0530
Subject: [PATCH] Revert "NVIDIA: VR: SAUCE: vfio/listener: Skip DMA mapping
 for VFIO-owned RAM-device regions"

This reverts commit d814a45f763979f5f985886abad8a170d68e4eac.

The commit made vfio_container_region_add() take an early return for
any RAM-device section owned by a VFIO device
(vfio_get_vfio_device(memory_region_owner(section->mr)) != NULL),
skipping vfio_container_dma_map() for it. In practice this excludes
every VFIO mmap subregion -- PCI BAR windows and, importantly, the
CXL.mem / DPA coherent device-memory region of a CXL Type-2 device --
from the IOMMU IOAS (the SMMU Stage-2 page tables).

Why it was originally added
---------------------------
The commit rested on two stated premises:

  1. "this mapping always fails": the backing VMAs carry
     VM_IO | VM_PFNMAP, pin_user_pages() refuses VM_IO pages, so
     IOMMU_IOAS_MAP returns -EFAULT; therefore the map is pointless.

  2. "no IOMMU entry is required": CPU access to these regions goes
     through KVM Stage-2 faults independently of the SMMU, and device
     DMA to system RAM uses separate per-RAM-section IOMMU entries.

Both premises are incorrect, and the second is the more damaging one.

In accelerated/nested SMMUv3 mode the GPU translates shared virtual
addresses through the hardware SMMU (Stage-1 = guest page tables,
Stage-2 = host iommufd). When UVM migrates a managed buffer into the
device's coherent memory, the page's guest-physical address lies in the
CXL DPA window. A GPU access to it is issued as an ATS request, and to
answer that request the SMMU must complete the Stage-1 + Stage-2 walk.
With the DPA region skipped, there is no Stage-2 entry for that
guest-physical address, so the translation faults.

The GPU posts a replayable fault; UVM services it, invalidates the TLB,
and replays; the access faults again because the Stage-2 entry still
does not exist. This becomes an unbounded fault -> service -> replay
livelock: the test makes no forward progress (it "hangs"), the host
SMMU logs nothing (an ATS request with no translation returns an
unsuccessful completion, not a fault event), and on cancellation the
GPU reports:

  NVRM: Xid 31 ... MMU Fault ... FAULT_PTE ACCESS_TYPE_VIRT_WRITE

Observed behaviour matches this exactly:

  - UVM/ATS tests that keep their working set in system RAM pass: guest
    RAM is mapped into Stage-2 normally, so the ATS access resolves.
  - UVM/ATS tests that migrate the buffer into DPA coherent memory and
    then access it via ATS hang. The failure is the intersection of two
    conditions (buffer is DPA-resident AND reached via ATS); meeting
    only one is fine.
  - Non-ATS CUDA workloads (e.g. explicit cudaMalloc + cudaMemcpy) pass:
    they reach device memory through the GPU's own page tables, never
    the SMMU.

Device-memory residency at hang time was confirmed independently:
nvidia-smi device memory usage rises during the failing tests, and the
QEMU trace shows the DPA region being skipped before this revert and
mapped successfully after it.

Why reverting is the correct fix for ATS
----------------------------------------
The correct behaviour is precisely what every other VFIO-owned
RAM-device region already gets, and what nvgrace-gpu relies on: the
region is mapped into the Stage-2 IOAS via the ordinary vaddr
IOMMU_IOAS_MAP path. That makes the GPU's ATS accesses to its own
coherent memory translatable, which removes the fault livelock.
iommufd already maps these regions correctly.

Reverting this commit restores that behaviour. With it reverted, the
CXL DPA region is mapped into Stage-2 and the UVM/ATS hang no longer
reproduces on CXL Type-2 device passthrough.

Note on the boot-time ATC_INV concern
--------------------------------------
A secondary motivation for withholding the mapping was to avoid a
fatal CMDQ_OP_ATC_INV timeout seen on a CXL Type-2 device during early
init: that error is triggered by an IOMMU_IOAS_UNMAP of the region
while the device cannot service the resulting ATC invalidation in time.
Note that the trigger is the *unmap*, not the *map*. Withholding the
map to dodge the unmap is incorrect because the map is mandatory for
ATS. If that ATC_INV timeout resurfaces on the unmap paths (guest
reboot, FLR/reset, VM shutdown), the correct fix is to guard the
region_del / unmap path while the device is in its init/reset state
(keep the mapping, defer only the unmap), not to skip the mapping.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 hw/vfio/listener.c | 14 --------------
 1 file changed, 14 deletions(-)

diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
index e050e3c2f69..f498e23a937 100644
--- a/hw/vfio/listener.c
+++ b/hw/vfio/listener.c
@@ -598,20 +598,6 @@ void vfio_container_region_add(VFIOContainerBase *bcontainer,
                 pgmask + 1);
             return;
         }
-
-        /*
-         * VFIO MMAP backed regions (CXL.mem) uses VM_IO | VM_PFNMAP VMAs
-         * backed by physical device addresses. Skip vfio_container_dma_map
-         * as mapping is not needed for this region.
-         */
-        if (vfio_get_vfio_device(memory_region_owner(section->mr))) {
-            trace_vfio_listener_region_add_no_dma_map(
-                memory_region_name(section->mr),
-                section->offset_within_address_space,
-                int128_getlo(section->size),
-                pgmask + 1);
-            return;
-        }
     }
 
     ret = vfio_container_dma_map(bcontainer, iova, int128_get64(llsize),