fix(nfs): upgrade read-only handle on WRITE instead of returning STALE#178
Merged
Conversation
If macOS (or any other NFSv3 client) issues a READ before a WRITE on the same file — which is the common case via stat/`ls` populating the attribute cache — `read()` opens a Lazy/read-only handle and pools it. The subsequent WRITE handler peeks the pool, finds the read-only handle, calls `virtual_fs.write()` which returns EBADF, and `errno_to_nfs` maps that to NFS3ERR_STALE. macOS NFS treats STALE on WRITE as a hard failure and silently discards the pending writes from its buffer — `dd` reports success but the bytes never reach the server. `fsync(2)` on the file later returns ESTALE. Fix: on EBADF, evict the read-only handle and open writable, then retry the write. Mirrors the existing EBADF retry already in `read()`.
Three tests that exercise the bug fix: * `write_after_read_upgrades_handle_instead_of_returning_stale` — main regression test: prior READ pools a Lazy handle, then WRITE must upgrade rather than surface NFS3ERR_STALE. * `second_write_reuses_writable_handle` — fast path remains the fast path after the upgrade (no churn on subsequent writes). * `write_without_prior_read_opens_writable_directly` — slow path works standalone when nothing is pooled. Verified by reverting `nfs.rs::write()` to the pre-fix version: tests 1 and 3 fail with NFS3ERR_STALE (test 2 panics before it can run).
Contributor
POSIX Compliance (pjdfstest) |
Contributor
Benchmark Results |
e78465c to
9f6943b
Compare
Codex review pointed out a concrete race in the slow path of the EBADF upgrade: two concurrent NFS WRITE RPCs on the same ino can both peek a read-only handle, both get EBADF, and both fall through to the slow path. They open distinct writable handles fh_A and fh_B (serialized by virtual_fs's per-ino staging lock). Writer A inserts fh_A into the pool; writer B's `insert_handle` then releases fh_A as `replaced`. Writer A's subsequent pwrite hits EBADF on the now-released fh_A and maps back to NFS3ERR_STALE — the very silent-data-loss this code path exists to prevent. Reorder the slow path so the fresh fh is used PRIVATELY for the write, then published to the pool only on success. Nothing else can see the fh until its pwrite has committed to the staging file, so no other task can release it out from under us. Removes the race without introducing a per-inode mutex (the simpler-than-locking alternative suggested by Codex). Also: in the fast-path EBADF branch, remove the stale pool entry before re-entering, mirroring the analogous EBADF retry in `read()`. Pre-fix the entry survived `evict_handle` (which only releases the VFS handle), so a concurrent caller could peek and use a freshly-released fh. Guarded by a `peek == Some(fh)` check to avoid removing a different fh installed by a concurrent successful upgrader.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
NFS WRITEs on macOS were silently dropping bytes when the file had been READ first. Root cause is server-side: a read-only pooled handle from a prior NFS READ returns
EBADFonvirtual_fs.write, whicherrno_to_nfsmaps toNFS3ERR_STALE. macOS NFS treats STALE on WRITE as fatal and silently flushes its write buffer —ddreports success but the bytes never reach the server.fsync(2)later returns ESTALE.Fix
nfs.rs::write():peek == Some(fh)so a concurrent successful upgrader isn't clobbered), evict, fall through.open(writable=true)→pwrite→ only THENinsert_handle. The freshly-opened fh stays private to this task until its write commits. No other task can release it viainsert_handle'sreplacedeviction while we're between insert and pwrite — because we don't insert until after the pwrite is done.Mirrors the existing EBADF retry pattern in
read()(nfs.rs:181-200).Reproducer (pre-fix)
Server log shows:
End-to-end validation (post-fix, macOS NFS mount)
Sequential read+write:
8 concurrent dd at distinct offsets on same file (was previously hangbait):
Python fsync:
Concurrent-write race (worth calling out)
An earlier draft of this PR added a per-inode
tokio::sync::Mutexaround the slow path. Adversarial review (Codex) pointed out the correct, lighter defense: publish to the pool only after the first write succeeds. With insert_handle deferred until after pwrite, the fh is unreachable by other tasks during its critical window — no mutex needed, race closed by invariant.Concretely, if two NFS WRITE RPCs on the same ino both peek a Lazy handle and both hit EBADF:
virtual_fs.open(writable=true)is internally serialized by VirtualFs's per-ino staging lock, so they produce distinct fh_A and fh_B.replaced=fh_A→ evict_handle releases fh_A. Writer A then pwrites against fh_A → EBADF → STALE. Same silent-data-loss this PR is fixing.Relation to other PRs
Tests
write_after_read_upgrades_handle_instead_of_returning_stale— main regression test (READ then WRITE).second_write_reuses_writable_handle— fast path is reused after upgrade.write_without_prior_read_opens_writable_directly— slow path standalone.write()body to main's version: tests fail withNFS3ERR_STALE. Tests are load-bearing for the regression.cargo test --features fuse,nfs --lib→ 342/342 pass.cargo clippy --features nfs --all-targets -- -D warnings→ clean.XciD/hf-mount-testbucket on macOS NFS as described above.