fix(nfs): stable filehandles across server restarts#177
Closed
XciD wants to merge 1 commit into
Closed
Conversation
The default nfsserve `id_to_fh` embeds `SystemTime::now()` at startup as the generation number, so every restart invalidates every previously- issued handle. macOS NFS client caches filehandles across `umount` and remount of the same mount point, so after `umount → kill nfs server → restart → mount`, the kernel keeps using the cached handles. The server returns NFS3ERR_STALE, and the macOS client *silently drops the write* (WRITE RPC never reaches the server, dd reports success, an explicit fsync(2) on the file returns ESTALE). Symptom in practice: opening an existing file for write after a remount on macOS appears to succeed but the data never lands in the bucket. Override `id_to_fh` / `fh_to_id` on NFSAdapter to use a generation number derived from the mount source identifier (`bucket/<id>` or `<type>/<repo>/<rev>`) via FNV-1a. Same source → same gen across restarts, so cached client handles stay valid. Different sources → different gens, so a handle from bucket A is properly rejected when mounting bucket B. Std's `DefaultHasher` is per-process randomized, which would defeat the whole point — hence the inline FNV-1a (no new deps). Tests cover: determinism, cross-source rejection, malformed handles, and the simulated-restart round-trip that locks in the regression fix.
Contributor
POSIX Compliance (pjdfstest) |
Contributor
Benchmark Results |
Member
Author
|
Closing — root cause analysis was wrong. I attributed the symptom to stale filehandles cached by macOS NFS client across umount. After actually testing this fix on macOS, the bug still reproduces. Reading the log carefully revealed the real mechanism:
The bug is server-side: `nfs.rs::write()` on main doesn't upgrade a read-only pool handle when WRITE arrives. PR #41 already adds the upgrade path (`Err(EBADF) => evict + open(writable=true) + retry`). So this issue is fixed once #41 lands. The generation-number-from-source change in this PR is orthogonal and doesn't address anything currently observable. Could be useful defensively, but not as a fix for this bug — leaving it out. |
XciD
added a commit
that referenced
this pull request
May 22, 2026
#178) ## Summary NFS WRITEs on macOS were silently dropping bytes when the file had been READ first. Root cause is server-side: a read-only pooled handle from a prior NFS READ returns `EBADF` on `virtual_fs.write`, which `errno_to_nfs` maps to `NFS3ERR_STALE`. macOS NFS treats STALE on WRITE as fatal and silently flushes its write buffer — `dd` reports success but the bytes never reach the server. `fsync(2)` later returns ESTALE. ## Fix `nfs.rs::write()`: 1. **Fast path EBADF**: handle was opened read-only. Remove the now-stale entry from the pool (guarded by `peek == Some(fh)` so a concurrent successful upgrader isn't clobbered), evict, fall through. 2. **Slow path**: `open(writable=true)` → `pwrite` → only THEN `insert_handle`. The freshly-opened fh stays private to this task until its write commits. No other task can release it via `insert_handle`'s `replaced` eviction while we're between insert and pwrite — because we don't insert until after the pwrite is done. Mirrors the existing EBADF retry pattern in `read()` (nfs.rs:181-200). ## Reproducer (pre-fix) ```bash hf-mount-nfs bucket X/y /mnt ls /mnt # triggers READ → pools Lazy handle dd if=/dev/urandom of=/mnt/existing-file bs=1M count=1 seek=100 conv=notrunc,fsync # dd reports success; file in bucket is unchanged. python3 -c "import os; fd=os.open('/mnt/existing-file', os.O_RDWR); os.fsync(fd)" # OSError: [Errno 70] Stale NFS file handle ``` Server log shows: ``` open: ino=2, writable=false ← from the READ read: fh=1, offset=... write: ino=2, fh=1, offset=104857600, len=1048576 ← EBADF returned, mapped to STALE (no further server activity; macOS dropped the write buffer) ``` ## End-to-end validation (post-fix, macOS NFS mount) Sequential read+write: ``` write: ino=4, fh=1 → EBADF release: fh=1 ← pool entry removed open: writable=true ← slow path write: ino=4, fh=2, offset=0 ← succeeds on fresh fh ``` 8 concurrent dd at distinct offsets on same file (was previously hangbait): ``` all 8 dd done in .009s 8 writes server-side on fh=2 (fast path reused after initial upgrade) zero release intempestif, zero NFS3ERR_STALE ``` Python fsync: ``` fsync OK ✓ # before fix: [Errno 70] Stale NFS file handle ``` ## Concurrent-write race (worth calling out) An earlier draft of this PR added a per-inode `tokio::sync::Mutex` around the slow path. Adversarial review (Codex) pointed out the correct, lighter defense: **publish to the pool only after the first write succeeds**. With insert_handle deferred until after pwrite, the fh is unreachable by other tasks during its critical window — no mutex needed, race closed by invariant. Concretely, if two NFS WRITE RPCs on the same ino both peek a Lazy handle and both hit EBADF: - Both reach the slow path; `virtual_fs.open(writable=true)` is internally serialized by VirtualFs's per-ino staging lock, so they produce distinct fh_A and fh_B. - Pre-fix (insert-then-write): writer A inserts fh_A → writer B inserts fh_B, `replaced=fh_A` → evict_handle releases fh_A. Writer A then pwrites against fh_A → EBADF → STALE. Same silent-data-loss this PR is fixing. - Post-fix (write-then-insert): writer A pwrites to fh_A in private (no one else knows about it), then inserts. Writer B does the same with fh_B. Both pwrites succeed; the loser of the insert race has its fh released by the winner's insert, but its bytes are already in the staging file. ## Relation to other PRs - **PR #41 (sparse writes)** contains the same EBADF→upgrade fix in its nfs.rs. This PR isolates the change so it can land independently. When #41 merges, this PR becomes a no-op merge. The write-before-publish reordering here will need to be ported into #41's nfs.rs (the same race exists there). - **PR #177 (stable filehandles)** was an earlier misdiagnosis attributing the symptom to macOS client-side filehandle caching across umount. Closed. ## Tests - `write_after_read_upgrades_handle_instead_of_returning_stale` — main regression test (READ then WRITE). - `second_write_reuses_writable_handle` — fast path is reused after upgrade. - `write_without_prior_read_opens_writable_directly` — slow path standalone. - Verified by reverting just the `write()` body to main's version: tests fail with `NFS3ERR_STALE`. Tests are load-bearing for the regression. - `cargo test --features fuse,nfs --lib` → 342/342 pass. - `cargo clippy --features nfs --all-targets -- -D warnings` → clean. - End-to-end validated against `XciD/hf-mount-test` bucket on macOS NFS as described above.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Override
id_to_fh/fh_to_idinNFSAdapterso the embedded generation number derives from the mount source identifier (bucket/<id>or<type>/<repo>/<rev>) instead ofnfsserve's defaultSystemTime::now()at startup. Same source = same generation across restarts, so cached client handles stay valid afterumount+remount.Bug
On macOS, this sequence silently loses writes:
```
hf-mount-nfs bucket X/y /tmp/m # session #1, write a file
umount /tmp/m
^C the hf-mount-nfs process
hf-mount-nfs bucket X/y /tmp/m # session #2, same mount point
dd of=/tmp/m/existing-file ... # reports success
```
But:
Root cause
`nfsserve`'s default `id_to_fh` embeds `SystemTime::now()` at server startup as a generation number. Every restart of the server invalidates every previously-issued filehandle.
The macOS NFS kernel client caches filehandles across `umount` of the same mount point (as an attribute-cache optimization). When the server comes back with a new generation number, the cached handles get `NFS3ERR_STALE` on the next operation. The kernel then silently discards pending writes without surfacing an error to userspace.
By design, each end is defensible in isolation:
Their combination produces silent data loss.
Fix
Derive the generation number from the source identifier rather than the startup time:
```rust
let fh_gen = fnv1a_64(virtual_fs.source_identifier().as_bytes());
```
FNV-1a is used inline (6 lines, no new dep) because std's `DefaultHasher` is randomized per-process, which would defeat the purpose.
Trade-off
Inode numbers within a source are not guaranteed stable across restarts (hf-mount allocates them in tree-listing order). If the listing changes between mounts, a client cached handle may resolve to a different file. The risk is bounded: clients re-LOOKUP on miss, and the issue only matters for handles held across a restart. We accept this slight reduction vs upstream's strict "every handle expires on restart" because the alternative is silent write loss.
Tests
`cargo test --lib --features fuse,nfs` → 344/344 pass.