Skip to content

Conversation

@ixhamza
Copy link
Member

@ixhamza ixhamza commented Jan 30, 2026

Reduce hung device recovery time using NSSR and optimized timeouts:

  • Skip abort commands and go directly to reset (following FreeBSD)
  • Trigger NSSR on hotplug-capable slots for fast device removal
  • Skip explicit queue deletion during reset (saves admin_timeout)
  • Detect hung devices after first CAP.TO by checking controller state
  • Skip re-enable in reset_work for hung devices (saves second CAP.TO)
  • Remove namespaces immediately for hung devices after DEAD state

NSSR path (NSSR supported + hotplug-capable):
After io_timeout (30s), trigger NSSR which causes PCIe link down. pciehp detects link state change and calls nvme_remove() for immediate device removal. Total recovery time: ~30s.

Non-NSSR path (NSSR not supported or non-hotplug slot):
After io_timeout (30s), proceed with controller reset. Hung device detection after first CAP.TO prevents second CAP.TO wait in reset_work(). With default CAP.TO (~45s), total recovery time: ~75s.

Previously, hung device recovery took io_timeout×2 + admin_timeout + CAP.TO×2. With defaults (io_timeout=30s, admin_timeout=60s, CAP.TO=45s), this was ~210s.

Namespace removal for hung devices cleans up stale block devices that previously remained visible despite dead controller, enabling proper drive replacement workflows.

Testing

@ixhamza ixhamza requested a review from amotin January 30, 2026 20:51
@bugclerk bugclerk changed the title nvme-pci: use NSSR to avoid CAP.TO and optimize timeout handling NAS-139155 / None / nvme-pci: use NSSR to avoid CAP.TO and optimize timeout handling Jan 30, 2026
@bugclerk
Copy link

@ixhamza ixhamza force-pushed the NAS-139155-nssr-reset branch 2 times, most recently from 995ffa8 to cbcfe20 Compare February 3, 2026 20:01
Reduce hung device recovery time using NSSR and optimized timeouts:
 - Skip abort commands and go directly to reset (following FreeBSD)
 - Trigger NSSR on hotplug-capable slots for fast device removal
 - Skip explicit queue deletion during reset (saves admin_timeout)
 - Detect hung devices after first CAP.TO by checking controller state
 - Skip re-enable in reset_work for hung devices (saves second CAP.TO)
 - Remove namespaces immediately for hung devices after DEAD state

NSSR path (NSSR supported + hotplug-capable):
After io_timeout (30s), trigger NSSR which causes PCIe link down. pciehp
detects link state change and calls nvme_remove() for immediate device
removal. Total recovery time: ~30s.

Non-NSSR path (NSSR not supported or non-hotplug slot):
After io_timeout (30s), proceed with controller reset. Hung device
detection after first CAP.TO prevents second CAP.TO wait in reset_work.
With default CAP.TO (~45s), total recovery time: ~75s.

Previously, hung device recovery took io_timeout×2 + admin_timeout +
CAP.TO×2. With defaults (io_timeout=30s, admin_timeout=60s, CAP.TO=45s),
this was ~210s.

Namespace removal for hung devices cleans up stale block devices that
previously remained visible despite dead controller, enabling proper
drive replacement workflows.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
@ixhamza ixhamza force-pushed the NAS-139155-nssr-reset branch from cbcfe20 to 72b5f00 Compare February 4, 2026 15:05
@ixhamza ixhamza merged commit dfe61fb into truenas/linux-6.18 Feb 4, 2026
6 checks passed
@bugclerk
Copy link

bugclerk commented Feb 4, 2026

This PR has been merged and conversations have been locked.
If you would like to discuss more about this issue please use our forums or raise a Jira ticket.

@truenas truenas locked as resolved and limited conversation to collaborators Feb 4, 2026
@ixhamza ixhamza deleted the NAS-139155-nssr-reset branch February 4, 2026 17:22
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants