Skip to content

bf: pass aiohttp timeouts to fsspec to fix test hang#1795

Merged
yarikoptic merged 2 commits intomasterfrom
bf-test-nwb2asset-hang
Feb 14, 2026
Merged

bf: pass aiohttp timeouts to fsspec to fix test hang#1795
yarikoptic merged 2 commits intomasterfrom
bf-test-nwb2asset-hang

Conversation

@yarikoptic
Copy link
Member

Summary

Fix the indefinite hang in test_nwb2asset_remote_asset that has been blocking dandi-archive CLI integration tests since November 24, 2025.

Root cause

RemoteReadableAsset.open() calls fsspec.open(url) with no socket-level timeouts. When h5py reads an NWB file through fsspec's HTTP backend, the call chain is:

nwb2asset() → get_metadata() → _get_pynwb_metadata()
  → open_readable() → RemoteReadableAsset.open()
    → fsspec.open(url).open()          # aiohttp, no timeout
      → h5py.File(fp) + NWBHDF5IO     # holds HDF5 global lock

This creates conditions for a deadlock documented in h5py/h5py#2019:

Why it surfaced on November 24

Analysis of tinuous CI logs for dandi-archive showed:

Date dandi-cli dandischema test result
Nov 21 0.73.2+5 (8492260) >= 0.11.1, < 0.12.0 XFAIL in 0.4s
Nov 24 0.73.2+11 (fa3ce1e) ~= 0.12.0 hung 6 hours

With dandischema 0.11.x, the test hit a quick model validation mismatch (XFAIL before reaching fsspec). PR #1744 updated to dandischema 0.12.0 (vendor-configurable models, schema 0.7.0), which fixed that mismatch — so the test now reaches the fsspec HTTP read, which hangs.

Additionally, dandi-cli's tox.ini global --timeout=300 is not read when tests run from the dandi-archive rootdir via pytest --pyargs dandi.

Changes

  • a6109be bf: pass aiohttp timeouts to fsspec in RemoteReadableAsset.open()
    • Passes ClientTimeout(total=120, sock_read=60, sock_connect=30) so stalled connections raise TimeoutError instead of blocking forever
  • 166324a bf(test): add timeout to test_nwb2asset_remote_asset to prevent CI hang
    • Adds @pytest.mark.timeout(120) as belt-and-suspenders (works regardless of rootdir config)

Companion change

dandi-archive cli-integration.yml will also add --timeout=300 to the pytest command line and remove the -k "not test_nwb2asset_remote_asset" exclusion — but that needs a new dandi-cli release containing these fixes for the release matrix variant to pick them up.

References

🤖 Generated with Claude Code

yarikoptic and others added 2 commits February 12, 2026 13:20
When dandi-cli tests are run from dandi-archive CI via
`pytest --pyargs dandi`, the rootdir is dandi-archive, so
dandi-cli's tox.ini [pytest] config (which has --timeout=300)
is NOT read. This means the test has no timeout protection.

With dandischema 0.12.0, the test no longer fails quickly
(the model validation path changed), so it now reaches the
fsspec HTTP read which can hang indefinitely. The xfail marker
alone doesn't help -- xfail only catches exceptions, not hangs.

Adding @pytest.mark.timeout(120) ensures the test is killed
after 2 minutes regardless of which project runs it. The
xfail marker then catches the timeout as an expected failure.

Fixes: #1762

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The dandi-archive CLI integration tests started hanging for 6 hours
on November 24, 2025 (dandi/dandi-archive#1762).

Investigation of tinuous CI logs showed:

- Nov 21 (last success): test_nwb2asset_remote_asset XFAIL'd in 0.4s
- Nov 24 (first hang): same test hung for 6 hours until GitHub killed it
- ALL test runs on Nov 24-25 hung, across unrelated PRs

The code flow that hangs is:

  nwb2asset() → get_metadata() → _get_pynwb_metadata()
    → open_readable() → RemoteReadableAsset.open()
      → fsspec.open(url).open()
        → aiohttp HTTP read from minio in Docker
          → fsspec sync() blocks in threading.Event.wait()

The key environmental change between Nov 21 and Nov 24 was dandi-cli
PR #1744 updating dandischema from <0.12.0 to ~=0.12.0.  With
dandischema 0.11.x, the test hit a quick model validation mismatch
(completing as XFAIL in 0.4s before reaching the fsspec read).  With
dandischema 0.12.0 (vendor-configurable models, schema 0.7.0), that
mismatch no longer occurs, so the test now proceeds to the actual
fsspec HTTP read — which hangs.

The hang itself is a known interaction between h5py, fsspec, and GC:

- h5py holds a global lock while reading from Python file objects
- fsspec's sync() runs async aiohttp coroutines on a background thread
  and blocks the calling thread in threading.Event.wait()
- Without socket-level timeouts, aiohttp blocks forever on stalled
  connections (aio-libs/aiohttp#11740)
- GC running during this window can deadlock with h5py's lock
  (h5py/h5py#2019)

The fix: pass explicit ClientTimeout to aiohttp via fsspec's
client_kwargs so that stalled connections raise TimeoutError instead
of blocking indefinitely.

Additionally, the dandi-archive CI never had a pytest --timeout because
dandi-cli's tox.ini [pytest] addopts (--timeout=300) are not read when
pytest runs from the dandi-archive rootdir via `pytest --pyargs dandi`.

References:
- fsspec/filesystem_spec#1666
- h5py/h5py#2019
- aio-libs/aiohttp#11740
- #1762
- #1450

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@yarikoptic yarikoptic added the patch Increment the patch version when merged label Feb 13, 2026
@codecov
Copy link

codecov bot commented Feb 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.11%. Comparing base (a3dd9f9) to head (a6109be).
⚠️ Report is 10 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1795   +/-   ##
=======================================
  Coverage   75.11%   75.11%           
=======================================
  Files          84       84           
  Lines       11921    11923    +2     
=======================================
+ Hits         8954     8956    +2     
  Misses       2967     2967           
Flag Coverage Δ
unittests 75.11% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yarikoptic yarikoptic added the release Create a release when this pr is merged label Feb 13, 2026
@yarikoptic
Copy link
Member Author

I will keep it to update and rerun after we merge other minor PRs and to release

@yarikoptic
Copy link
Member Author

other PRs might need more time, let's proceed!

@yarikoptic yarikoptic merged commit 830debb into master Feb 14, 2026
39 of 40 checks passed
@yarikoptic yarikoptic deleted the bf-test-nwb2asset-hang branch February 14, 2026 03:30
@github-actions
Copy link

🚀 PR was released in 0.74.3 🚀

yarikoptic added a commit to dandi/dandi-archive that referenced this pull request Feb 14, 2026
… here"

This reverts commit c88d5c3.

Should not be needed with 0.74.3 after

- dandi/dandi-cli#1795

but who knows?!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

patch Increment the patch version when merged release Create a release when this pr is merged released

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants