Skip to content

Allow RAM device regions to skip IOMMU mapping #24

Open
mmhonap wants to merge 1 commit into
NVIDIA:nvidia_unstable-10.1from
mmhonap:dma-map-skip-fix
Open

Allow RAM device regions to skip IOMMU mapping #24
mmhonap wants to merge 1 commit into
NVIDIA:nvidia_unstable-10.1from
mmhonap:dma-map-skip-fix

Conversation

@mmhonap

@mmhonap mmhonap commented Jun 16, 2026

Copy link
Copy Markdown

Commit d814a45 skipped vfio_container_dma_map for all VFIO-owned
ram-device regions to avoid a fatal hardware error on VR180 (CXL Type-2).
The skip was implemented in vfio_container_region_add() by checking the
region owner, which was too broad: it also silently skipped non-CXL
coherent regions such as NVLink apertures on GB200, breaking ATS because
the nested SMMU S2 HWPT never receives the IOMMU IOAS entry those
regions need.

CXL mmap-backed regions must not be mapped into the IOMMU IOAS.

Fix by marking CXL mmap subregions at creation time in vfio_region_mmap()
using memory_region_set_skip_iommu_map().

Non-CXL VFIO-owned regions are unaffected: their subregions keep the
default skip_iommu_map=false and fall through to vfio_container_dma_map().

@mmhonap mmhonap marked this pull request as ready for review June 16, 2026 16:49
@nvmochs

nvmochs commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

@mmhonap

General question: Does this same issue also need to be addressed in NV QEMU 11.0?


Codex finding...

High: memory_region_skip_iommu_map() is only honored on region add, not on region delete. 2e963ff adds the generic skip-map API and returns before vfio_container_dma_map(), and e227a6b marks CXL mmap subregions with that flag, but vfio_listener_region_del() still leaves try_unmap=true for aligned RAM-device sections and calls vfio_container_dma_unmap(). With iommufd, that reaches IOMMU_IOAS_UNMAP, which is the operation the CXL change is trying to avoid.

Please address this in 2e963ff, where the generic memory_region_skip_iommu_map() contract is introduced. The delete path should be symmetric with the add path, e.g. have vfio_listener_region_del() skip vfio_container_dma_unmap() when memory_region_skip_iommu_map(section->mr) is true, while still running the existing memory_region_unref() and section-window cleanup.


2e963ff NVIDIA: SAUCE: memory: Allow RAM device regions to skip IOMMU mapping

It looks like you picked this from nvidia_stable-11.0, is that correct?

If so, can you pick with -x -s and then add "nvidia_stable-11.0” after the SHA in the pick line that is added by the -s flag?

e..g

(cherry picked from commit XXXX nvidia_stable-11.0)

@mmhonap mmhonap force-pushed the dma-map-skip-fix branch from e227a6b to dfa2b6c Compare June 17, 2026 11:28
@mmhonap

mmhonap commented Jun 17, 2026

Copy link
Copy Markdown
Author

Yes, this change will be applicable for 11.0 as well.

I have cherry-picked the patch with some changes from 11.0 branch. I have added the commit message reflecting the same.

I have run a codex review on these patches and now pushed new patches after verification. Please take a look at new patches.

@shamiali2008 shamiali2008 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need skip_iommu_map() in vfio_iommu_map_notify path? If that's from testing or codex review please include the justification in commit log. Otherwise looks good to me.

@mmhonap mmhonap force-pushed the dma-map-skip-fix branch from dfa2b6c to b252f04 Compare June 17, 2026 19:34
…wned RAM-device regions"

This reverts commit d814a45.

The commit made vfio_container_region_add() take an early return for
any RAM-device section owned by a VFIO device
(vfio_get_vfio_device(memory_region_owner(section->mr)) != NULL),
skipping vfio_container_dma_map() for it. In practice this excludes
every VFIO mmap subregion -- PCI BAR windows and, importantly, the
CXL.mem / DPA coherent device-memory region of a CXL Type-2 device --
from the IOMMU IOAS (the SMMU Stage-2 page tables).

Why it was originally added
---------------------------
The commit rested on two stated premises:

  1. "this mapping always fails": the backing VMAs carry
     VM_IO | VM_PFNMAP, pin_user_pages() refuses VM_IO pages, so
     IOMMU_IOAS_MAP returns -EFAULT; therefore the map is pointless.

  2. "no IOMMU entry is required": CPU access to these regions goes
     through KVM Stage-2 faults independently of the SMMU, and device
     DMA to system RAM uses separate per-RAM-section IOMMU entries.

Both premises are incorrect, and the second is the more damaging one.

In accelerated/nested SMMUv3 mode the GPU translates shared virtual
addresses through the hardware SMMU (Stage-1 = guest page tables,
Stage-2 = host iommufd). When UVM migrates a managed buffer into the
device's coherent memory, the page's guest-physical address lies in the
CXL DPA window. A GPU access to it is issued as an ATS request, and to
answer that request the SMMU must complete the Stage-1 + Stage-2 walk.
With the DPA region skipped, there is no Stage-2 entry for that
guest-physical address, so the translation faults.

The GPU posts a replayable fault; UVM services it, invalidates the TLB,
and replays; the access faults again because the Stage-2 entry still
does not exist. This becomes an unbounded fault -> service -> replay
livelock: the test makes no forward progress (it "hangs"), the host
SMMU logs nothing (an ATS request with no translation returns an
unsuccessful completion, not a fault event), and on cancellation the
GPU reports:

  NVRM: Xid 31 ... MMU Fault ... FAULT_PTE ACCESS_TYPE_VIRT_WRITE

Observed behaviour matches this exactly:

  - UVM/ATS tests that keep their working set in system RAM pass: guest
    RAM is mapped into Stage-2 normally, so the ATS access resolves.
  - UVM/ATS tests that migrate the buffer into DPA coherent memory and
    then access it via ATS hang. The failure is the intersection of two
    conditions (buffer is DPA-resident AND reached via ATS); meeting
    only one is fine.
  - Non-ATS CUDA workloads (e.g. explicit cudaMalloc + cudaMemcpy) pass:
    they reach device memory through the GPU's own page tables, never
    the SMMU.

Device-memory residency at hang time was confirmed independently:
nvidia-smi device memory usage rises during the failing tests, and the
QEMU trace shows the DPA region being skipped before this revert and
mapped successfully after it.

Why reverting is the correct fix for ATS
----------------------------------------
The correct behaviour is precisely what every other VFIO-owned
RAM-device region already gets, and what nvgrace-gpu relies on: the
region is mapped into the Stage-2 IOAS via the ordinary vaddr
IOMMU_IOAS_MAP path. That makes the GPU's ATS accesses to its own
coherent memory translatable, which removes the fault livelock.
iommufd already maps these regions correctly.

Reverting this commit restores that behaviour. With it reverted, the
CXL DPA region is mapped into Stage-2 and the UVM/ATS hang no longer
reproduces on CXL Type-2 device passthrough.

Note on the boot-time ATC_INV concern
--------------------------------------
A secondary motivation for withholding the mapping was to avoid a
fatal CMDQ_OP_ATC_INV timeout seen on a CXL Type-2 device during early
init: that error is triggered by an IOMMU_IOAS_UNMAP of the region
while the device cannot service the resulting ATC invalidation in time.
Note that the trigger is the *unmap*, not the *map*. Withholding the
map to dodge the unmap is incorrect because the map is mandatory for
ATS. If that ATC_INV timeout resurfaces on the unmap paths (guest
reboot, FLR/reset, VM shutdown), the correct fix is to guard the
region_del / unmap path while the device is in its init/reset state
(keep the mapping, defer only the unmap), not to skip the mapping.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
@mmhonap mmhonap force-pushed the dma-map-skip-fix branch from b252f04 to cdf5e75 Compare June 18, 2026 14:08
@mmhonap

mmhonap commented Jun 18, 2026

Copy link
Copy Markdown
Author

I see skipping dma mapping for cxl mode is causing some ATS based CUDA tests to fail. So I have modified this review to revert the change NVIDIA: VR: SAUCE: vfio/listener: Skip DMA mapping for VFIO-owned RAM-device regions
This way both pcie and cxl mode will not skip dma mapping for ram-device regions.
Please check the top commit in dma-map-skip-fix branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants