Skip to content

Conversation

@mresvanis
Copy link

@mresvanis mresvanis commented Jan 15, 2026

This PR adds Fabric Manager (FM) Shared NVSwitch virtualization model support when NVSwitch devices are detected and the newly introduced FABRIC_MANAGER_FABRIC_MODE env var is set to 1 (shared-nvswitch).

No changes introduced when FABRIC_MANAGER_FABRIC_MODE=0 (default FM mode - full-passthrough), which is the current flow when NVSwitch devices are detected.

Changes

  • add env var FABRIC_MANAGER_FABRIC_MODE to control fabric manager FABRIC_MODE (defaults to 0 for full-passthrough, 1 for shared-nvswitch).
  • blacklist GPU devices from the NVIDIA driver when FABRIC_MODE is set to Shared NVSwitch.
    • instead bind them to vfio-pci.
  • Create GPU physical module ID to PCIe address mapping JSON file.
  • Do not run nvidia-persistenced since GPU devices are blacklisted from the NVIDIA driver.

Flow when FABRIC_MANAGER_FABRIC_MODE=1 (shared-nvswitch)

  1. Update fabric manager config to shared-nvswitch mode.
  2. Configure UNIX socket communication instead of TCP.
  3. Capture and persist GPU physical modeul ID to PCIe address mapping via nvidia-smi.
  4. Blacklist GPU devices from the NVIDIA driver using driver_override.
  5. Bind GPU devices to vfio-pci for passthrough scenarios.
  6. Skip nvidia-persistenced startup since GPU devices are no longer managed by the NVIDIA driver.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mresvanis mresvanis force-pushed the fabric-manager-configuration branch 2 times, most recently from 7182624 to d59cb29 Compare January 15, 2026 18:58
The changes include:

- add the `FABRIC_MANAGER_FABRIC_MODE` env var that configures FM with
  either full-passthrough (0) or shared-nvswitch (1) mode. It defaults
  to 0.
- when fabric manager mode is set to 0 no changes to the flow, i.e.
  execute the fabric manager daemon with its default configuration.
- when fabric manager mode is set to 1:
  - edit the fabric manager configuration file and set `FABRIC_MODE=1`.
  - persist mapping of physical GPU module IDs to their PCIe address by
    creating a JSON file on disk (the physical GPU module IDs are
    available through nvidia-smi).
  - blacklist GPU devices from the NVIDIA driver.
  - disable `nvidia-persistenced`, as the GPU devices are now
    blacklisted from the NVIDIA driver and bound to VFIO.

Signed-off-by: Michail Resvanis <mresvani@redhat.com>
@mresvanis mresvanis force-pushed the fabric-manager-configuration branch from d59cb29 to 01dca1b Compare January 15, 2026 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant