Skip to content

NAS-139155 / None / nvme-pci: optimize hung device timeouts and remove stale namespaces#236

Closed
ixhamza wants to merge 1 commit intotruenas/linux-6.18from
NAS-139155
Closed

NAS-139155 / None / nvme-pci: optimize hung device timeouts and remove stale namespaces#236
ixhamza wants to merge 1 commit intotruenas/linux-6.18from
NAS-139155

Conversation

@ixhamza
Copy link
Member

@ixhamza ixhamza commented Jan 26, 2026

Reduce hung device detection time by eliminating redundant waits:

  • Skip explicit queue deletion during reset (saves admin_timeout)
  • Detect hung device after first CAP.TO by checking CSTS.RDY
  • Skip re-enable in reset_work for hung devices (saves CAP.TO)

Total savings: admin_timeout + CAP.TO for hung devices. For example, with default admin_timeout (60s) and CAP.TO of 45s (Customer's Scenario), this saves 105 seconds.

For hung devices, namespaces are removed after transitioning to DEAD state. This cleans up stale block devices that previously remained visible to userspace despite the dead controller, and generates hotplug events enabling proper drive replacement workflows.

Testing

@ixhamza ixhamza requested review from amotin and yocalebo January 26, 2026 13:21
@bugclerk bugclerk changed the title nvme-pci: optimize hung device timeouts and remove stale namespaces NAS-139155 / None / nvme-pci: optimize hung device timeouts and remove stale namespaces Jan 26, 2026
@bugclerk
Copy link

Reduce hung device detection time by eliminating redundant waits:
- Skip explicit queue deletion during reset (saves admin_timeout)
- Detect hung device after first CAP.TO by checking CSTS.RDY
- Skip re-enable in reset_work for hung devices (saves CAP.TO)

Total savings: admin_timeout + CAP.TO for hung devices. For example,
with default admin_timeout (60s) and CAP.TO of 45s, this saves 105
seconds.

For hung devices, namespaces are removed after transitioning to DEAD
state. This cleans up stale block devices that previously remained
visible to userspace despite the dead controller, and generates
hotplug events enabling proper drive replacement workflows.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
@ixhamza
Copy link
Member Author

ixhamza commented Jan 30, 2026

Superseded by #238.

@ixhamza ixhamza closed this Jan 30, 2026
@ixhamza ixhamza deleted the NAS-139155 branch January 30, 2026 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants