Skip to content

fix: close stale S3BucketRegionCache on session refresh; guard _sessions=None; serialize set_session() under asyncio.Lock#1028

Closed
josh-ag2 wants to merge 5 commits into
fsspec:mainfrom
josh-ag2:fix/close-stale-region-cache-on-refresh
Closed

fix: close stale S3BucketRegionCache on session refresh; guard _sessions=None; serialize set_session() under asyncio.Lock#1028
josh-ag2 wants to merge 5 commits into
fsspec:mainfrom
josh-ag2:fix/close-stale-region-cache-on-refresh

Conversation

@josh-ag2

Copy link
Copy Markdown

Problem

Three related issues in S3FileSystem.set_session() that together cause aiohttp Unclosed client session / Unclosed connector warnings when the server closes connections mid-session:

  1. Stale _s3creator is silently abandoned on refresh. When set_session(refresh=True) replaces self._s3creator with a new S3BucketRegionCache, the old cache is never closed. It holds aiobotocore-created aiohttp.ClientSession objects (1 general + up to N regional). GC'd without an explicit close() → aiohttp emits warnings.

  2. Vacuous-truth refresh on empty _sessions. all(_.closed for _ in []) is always True, so a freshly-entered AIOHTTPSession with _sessions = {} (no requests sent yet) incorrectly triggers refresh=True. This causes the next issue to fire spuriously.

  3. Concurrent callers (e.g. asyncio.gather) each create their own creator. When self._s3 is None, multiple concurrent coroutines all fall through to the setup path. Without serialization, Task B's Fix 1 (close stale creator) tears down Task A's creator while Task A is mid-request — setting _sessions = None on an active AIOHTTPSession, producing:

    AttributeError: 'NoneType' object has no attribute 'get'
    

    at aiobotocore/httpsession.py.

The stale-check also needs to guard against newer aiobotocore setting _sessions = None (rather than {}) after __aexit__.

Changes

s3fs/core.py

  • Fix 1 — Close stale _s3creator on refresh: Before self._s3creator = s3creator, call await old_creator.__aexit__(None, None, None) so all regional aiobotocore clients and their aiohttp sessions are closed cleanly.

  • Fix 2 — Guard stale-session check: Change the stale-session refresh condition to:

    if sessions is not None and sessions and all(_.closed for _ in sessions.values()):
        refresh = True
    • sessions is not None — newer aiobotocore sets _sessions = None after __aexit__; treat None as "not stale, just closed"
    • sessions (non-empty) — an empty dict means no requests have been sent yet; not stale
  • Fix 3 — Serialize creator setup under asyncio.Lock: Wrap the entire creator-setup path in async with self._setup_lock. Concurrent tasks wait at the lock, then re-check self._s3 is not None and reuse the already-created session rather than creating their own. This makes Fix 1 safe: the old creator is only closed once, by the task that wins the lock, before any new sessions are open.

s3fs/tests/test_s3fs.py

  • test_stale_creator_closed_on_refresh — patches __aexit__ on the old creator and asserts it was called during set_session(refresh=True)

docs/source/changelog.rst — Unreleased section added.

Motivation

Discovered while integrating zarr v3's FsspecStore.close() (zarr-developers/zarr-python#4003). zarr's asyncio.gather fires 4 concurrent FsspecStore.get() calls on the same S3FileSystem instance, triggering issue 3. Issues 1 and 2 produce residual unclosed-session warnings once the render completes.

Related: aiobotocore/aiobotocore#866

Test

All existing tests pass. New test confirms __aexit__ is called on the old creator during refresh.

josh-ag2 and others added 3 commits May 24, 2026 08:55
When set_session(refresh=True) replaces _s3creator, the old cache is
now closed via __aexit__ before being discarded.  Without this, the
aiobotocore ClientSessions it holds are garbage-collected without an
explicit close(), producing aiohttp "Unclosed client session" warnings
at process exit.

Also track the weakref finalizer as self._finalizer so that an explicit
_close() / close() call can detach it, preventing a double-close when
the S3FileSystem is subsequently garbage-collected.

Closes #NNNN.  See also: aio-libs/aiobotocore#866

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…essions=None

The _close() / close = sync_wrapper(_close) addition was reverted.
zarr's FsspecStore.close() calls _close_fs(fs) which uses
fs.set_session() + session.close() — it never calls fs.close() directly,
so exposing a sync close() wrapper introduced new behaviour (potential
premature shutdown) with no caller.

Also guard the stale-session check at set_session() against the newer
aiobotocore behaviour where AIOHTTPSession.__aexit__() sets _sessions=None
(older versions left it as an empty dict). Without this guard,
_sessions.values() raises AttributeError after an explicit session.close().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… on empty _sessions

Two related bugs caused concurrent callers (e.g. zarr's asyncio.gather in
AsyncGroup.open) to tear down each other's in-flight S3 sessions:

1. The stale-session check treated an empty _sessions dict ({}  — no
   requests sent yet, not genuinely stale) the same as all-closed sessions
   because all([]) is vacuously True.  Added `sessions and` guard so an
   empty dict never triggers a refresh.

2. When multiple coroutines each saw self._s3 is None simultaneously they
   each created their own creator, and Fix 1 (close stale creator on
   refresh) then closed a creator whose client was still in use by a peer,
   setting _sessions=None mid-request.  Serialise the entire creator-setup
   path under a per-instance asyncio.Lock (created lazily; the check +
   assignment between two non-await lines is atomic in asyncio).  Waiting
   tasks re-check self._s3 under the lock and return early if a peer
   already did the setup.

Also guard against _sessions=None (newer aiobotocore sets it to None after
__aexit__ rather than leaving an empty dict) in the stale-session check to
prevent AttributeError on .values().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@martindurant martindurant left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I have wrapped my head around it now. It's a pity that indenting such a big block messes us the look of the diff so badly; it's only the bit with awaits that needs to go inside the lock block correct? Maybe the kwargs can be worked out beforehand with the same indentation as before.

Comment thread s3fs/tests/test_s3fs.py
Comment thread s3fs/core.py
if not hasattr(self, "_setup_lock"):
self._setup_lock = asyncio.Lock()
async with self._setup_lock:
# Re-check under the lock: a concurrent task may have set up the

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there was no await between creation of the lock and acquiring it, no other coroutine could have run. The only way the lock might have been acquired is in another thread - but asyncio threads are not thread-safe anyway (and all async code should be running in the one event loop on the one thread)

OK, so the model is, that one coroutine might have run to the awaits below (get_client, __aenter__) while this coroutine is waiting for the lock?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's exactly right. A coroutine can yield control at the await get_client(...) and await __aenter__() calls below the lock, allowing a sibling coroutine (e.g. launched via asyncio.gather) to also enter set_session() while the first is suspended. Without the lock they'd each create a separate _s3creator, and when the first resumes and replaces self._s3creator, it calls __aexit__ on the stale one — tearing down a session the sibling is still actively using.

Comment thread s3fs/core.py Outdated
Comment on lines +618 to +619
if not hasattr(self, "_setup_lock"):
self._setup_lock = asyncio.Lock()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to use self._setup_lock = None in __init__.
Note that we don't remove the lock - should it be made in __init__ instead?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on declaring it in __init__ — happy to add self._setup_lock = None there for clarity. The reason the actual asyncio.Lock() creation stays lazy rather than moving entirely to __init__ is that AbstractFileSystem uses instance caching via __new__, so an S3FileSystem instance may be constructed outside any async context and potentially reused across different event loops. Creating the Lock lazily inside set_session() (which is always called from within an async context) ensures it's bound to the right loop. On Python 3.10+ this is less of a concern since locks no longer bind at construction time, but keeping it lazy is the safer cross-version choice.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Python 3.10+ this is less of a concern

We don't support 3.9 any more

josh-ag2 and others added 2 commits June 2, 2026 16:15
- Move asyncio.Lock() creation from lazy hasattr check into __init__,
  valid now that Python 3.9 is no longer supported and locks no longer
  bind to the event loop at construction time.
- Extend the stale-session check to treat sessions=None as stale.
  aiobotocore 3.x sets _sessions=None after __aexit__, whereas 2.x
  leaves a dict of closed sessions; without this fix test_session_close
  fails on aiobotocore 3.x because the second asyncio.run reuses a
  closed creator that is never refreshed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Only the awaits, session creation, and state mutations need to be
inside the lock block.  The kwargs computation is pure synchronous
reads that can sit at the original indentation level, keeping the
diff readable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@martindurant

Copy link
Copy Markdown
Member

Can you check with #1020 - does that also offer the right guard now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants