NAS-139155 / None / nvme-pci: use NSSR to avoid CAP.TO and optimize timeout handling #238

ixhamza · 2026-01-30T20:51:22Z

Reduce hung device recovery time using NSSR and optimized timeouts:

Skip abort commands and go directly to reset (following FreeBSD)
Trigger NSSR on hotplug-capable slots for fast device removal
Skip explicit queue deletion during reset (saves admin_timeout)
Detect hung devices after first CAP.TO by checking controller state
Skip re-enable in reset_work for hung devices (saves second CAP.TO)
Remove namespaces immediately for hung devices after DEAD state

NSSR path (NSSR supported + hotplug-capable):
After io_timeout (30s), trigger NSSR which causes PCIe link down. pciehp detects link state change and calls nvme_remove() for immediate device removal. Total recovery time: ~30s.

Non-NSSR path (NSSR not supported or non-hotplug slot):
After io_timeout (30s), proceed with controller reset. Hung device detection after first CAP.TO prevents second CAP.TO wait in reset_work(). With default CAP.TO (~45s), total recovery time: ~75s.

Previously, hung device recovery took io_timeout×2 + admin_timeout + CAP.TO×2. With defaults (io_timeout=30s, admin_timeout=60s, CAP.TO=45s), this was ~210s.

Namespace removal for hung devices cleans up stale block devices that previously remained visible despite dead controller, enabling proper drive replacement workflows.

Testing

Manually tested with hung device simulation patch.
Scale Build.

bugclerk · 2026-01-30T20:51:46Z

Jira URL: https://ixsystems.atlassian.net/browse/NAS-139155

Reduce hung device recovery time using NSSR and optimized timeouts: - Skip abort commands and go directly to reset (following FreeBSD) - Trigger NSSR on hotplug-capable slots for fast device removal - Skip explicit queue deletion during reset (saves admin_timeout) - Detect hung devices after first CAP.TO by checking controller state - Skip re-enable in reset_work for hung devices (saves second CAP.TO) - Remove namespaces immediately for hung devices after DEAD state NSSR path (NSSR supported + hotplug-capable): After io_timeout (30s), trigger NSSR which causes PCIe link down. pciehp detects link state change and calls nvme_remove() for immediate device removal. Total recovery time: ~30s. Non-NSSR path (NSSR not supported or non-hotplug slot): After io_timeout (30s), proceed with controller reset. Hung device detection after first CAP.TO prevents second CAP.TO wait in reset_work. With default CAP.TO (~45s), total recovery time: ~75s. Previously, hung device recovery took io_timeout×2 + admin_timeout + CAP.TO×2. With defaults (io_timeout=30s, admin_timeout=60s, CAP.TO=45s), this was ~210s. Namespace removal for hung devices cleans up stale block devices that previously remained visible despite dead controller, enabling proper drive replacement workflows. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

bugclerk · 2026-02-04T17:22:27Z

This PR has been merged and conversations have been locked.
If you would like to discuss more about this issue please use our forums or raise a Jira ticket.

ixhamza requested a review from amotin January 30, 2026 20:51

ixhamza mentioned this pull request Jan 30, 2026

NAS-139155 / None / nvme-pci: optimize hung device timeouts and remove stale namespaces #236

Closed

bugclerk changed the title ~~nvme-pci: use NSSR to avoid CAP.TO and optimize timeout handling~~ NAS-139155 / None / nvme-pci: use NSSR to avoid CAP.TO and optimize timeout handling Jan 30, 2026

ixhamza force-pushed the NAS-139155-nssr-reset branch 2 times, most recently from 995ffa8 to cbcfe20 Compare February 3, 2026 20:01

amotin approved these changes Feb 4, 2026

View reviewed changes

ixhamza force-pushed the NAS-139155-nssr-reset branch from cbcfe20 to 72b5f00 Compare February 4, 2026 15:05

ixhamza merged commit dfe61fb into truenas/linux-6.18 Feb 4, 2026
6 checks passed

bugclerk added the no-time-tracked label Feb 4, 2026

truenas locked as resolved and limited conversation to collaborators Feb 4, 2026

ixhamza deleted the NAS-139155-nssr-reset branch February 4, 2026 17:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NAS-139155 / None / nvme-pci: use NSSR to avoid CAP.TO and optimize timeout handling #238

NAS-139155 / None / nvme-pci: use NSSR to avoid CAP.TO and optimize timeout handling #238

Uh oh!

ixhamza commented Jan 30, 2026 •

edited

Loading

Uh oh!

bugclerk commented Jan 30, 2026

Uh oh!

Uh oh!

bugclerk commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NAS-139155 / None / nvme-pci: use NSSR to avoid CAP.TO and optimize timeout handling #238

NAS-139155 / None / nvme-pci: use NSSR to avoid CAP.TO and optimize timeout handling #238

Uh oh!

Conversation

ixhamza commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Uh oh!

bugclerk commented Jan 30, 2026

Uh oh!

Uh oh!

bugclerk commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ixhamza commented Jan 30, 2026 •

edited

Loading