bf: pass aiohttp timeouts to fsspec to fix test hang#1795
Merged
yarikoptic merged 2 commits intomasterfrom Feb 14, 2026
Merged
Conversation
When dandi-cli tests are run from dandi-archive CI via `pytest --pyargs dandi`, the rootdir is dandi-archive, so dandi-cli's tox.ini [pytest] config (which has --timeout=300) is NOT read. This means the test has no timeout protection. With dandischema 0.12.0, the test no longer fails quickly (the model validation path changed), so it now reaches the fsspec HTTP read which can hang indefinitely. The xfail marker alone doesn't help -- xfail only catches exceptions, not hangs. Adding @pytest.mark.timeout(120) ensures the test is killed after 2 minutes regardless of which project runs it. The xfail marker then catches the timeout as an expected failure. Fixes: #1762 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The dandi-archive CLI integration tests started hanging for 6 hours on November 24, 2025 (dandi/dandi-archive#1762). Investigation of tinuous CI logs showed: - Nov 21 (last success): test_nwb2asset_remote_asset XFAIL'd in 0.4s - Nov 24 (first hang): same test hung for 6 hours until GitHub killed it - ALL test runs on Nov 24-25 hung, across unrelated PRs The code flow that hangs is: nwb2asset() → get_metadata() → _get_pynwb_metadata() → open_readable() → RemoteReadableAsset.open() → fsspec.open(url).open() → aiohttp HTTP read from minio in Docker → fsspec sync() blocks in threading.Event.wait() The key environmental change between Nov 21 and Nov 24 was dandi-cli PR #1744 updating dandischema from <0.12.0 to ~=0.12.0. With dandischema 0.11.x, the test hit a quick model validation mismatch (completing as XFAIL in 0.4s before reaching the fsspec read). With dandischema 0.12.0 (vendor-configurable models, schema 0.7.0), that mismatch no longer occurs, so the test now proceeds to the actual fsspec HTTP read — which hangs. The hang itself is a known interaction between h5py, fsspec, and GC: - h5py holds a global lock while reading from Python file objects - fsspec's sync() runs async aiohttp coroutines on a background thread and blocks the calling thread in threading.Event.wait() - Without socket-level timeouts, aiohttp blocks forever on stalled connections (aio-libs/aiohttp#11740) - GC running during this window can deadlock with h5py's lock (h5py/h5py#2019) The fix: pass explicit ClientTimeout to aiohttp via fsspec's client_kwargs so that stalled connections raise TimeoutError instead of blocking indefinitely. Additionally, the dandi-archive CI never had a pytest --timeout because dandi-cli's tox.ini [pytest] addopts (--timeout=300) are not read when pytest runs from the dandi-archive rootdir via `pytest --pyargs dandi`. References: - fsspec/filesystem_spec#1666 - h5py/h5py#2019 - aio-libs/aiohttp#11740 - #1762 - #1450 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1795 +/- ##
=======================================
Coverage 75.11% 75.11%
=======================================
Files 84 84
Lines 11921 11923 +2
=======================================
+ Hits 8954 8956 +2
Misses 2967 2967
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Member
Author
|
I will keep it to update and rerun after we merge other minor PRs and to release |
Member
Author
|
other PRs might need more time, let's proceed! |
yarikoptic-gitmate
approved these changes
Feb 14, 2026
|
🚀 PR was released in |
yarikoptic
added a commit
to dandi/dandi-archive
that referenced
this pull request
Feb 14, 2026
… here" This reverts commit c88d5c3. Should not be needed with 0.74.3 after - dandi/dandi-cli#1795 but who knows?!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix the indefinite hang in
test_nwb2asset_remote_assetthat has been blocking dandi-archive CLI integration tests since November 24, 2025.Root cause
RemoteReadableAsset.open()callsfsspec.open(url)with no socket-level timeouts. When h5py reads an NWB file through fsspec's HTTP backend, the call chain is:This creates conditions for a deadlock documented in h5py/h5py#2019:
sync()blocks the calling thread inthreading.Event.wait()waiting for the aiohttp coroutineWhy it surfaced on November 24
Analysis of tinuous CI logs for dandi-archive showed:
>= 0.11.1, < 0.12.0~= 0.12.0With dandischema 0.11.x, the test hit a quick model validation mismatch (XFAIL before reaching fsspec). PR #1744 updated to dandischema 0.12.0 (vendor-configurable models, schema 0.7.0), which fixed that mismatch — so the test now reaches the fsspec HTTP read, which hangs.
Additionally, dandi-cli's
tox.iniglobal--timeout=300is not read when tests run from the dandi-archive rootdir viapytest --pyargs dandi.Changes
bf: pass aiohttp timeouts to fsspec in RemoteReadableAsset.open()ClientTimeout(total=120, sock_read=60, sock_connect=30)so stalled connections raiseTimeoutErrorinstead of blocking foreverbf(test): add timeout to test_nwb2asset_remote_asset to prevent CI hang@pytest.mark.timeout(120)as belt-and-suspenders (works regardless of rootdir config)Companion change
dandi-archive
cli-integration.ymlwill also add--timeout=300to the pytest command line and remove the-k "not test_nwb2asset_remote_asset"exclusion — but that needs a new dandi-cli release containing these fixes for thereleasematrix variant to pick them up.References
h5py,gc, andfsspech5py/h5py#2019 — Deadlock from h5py + GC + file-like objects🤖 Generated with Claude Code