nixos-tests: disable SQLite WAL to prevent SIGBUS in CI#1615
Open
amaanq wants to merge 1 commit intoNixOS:masterfrom
Open
nixos-tests: disable SQLite WAL to prevent SIGBUS in CI#1615amaanq wants to merge 1 commit intoNixOS:masterfrom
amaanq wants to merge 1 commit intoNixOS:masterfrom
Conversation
SQLite in WAL mode mmaps a shared memory file that can fault under concurrent access, which kills nix with SIGBUS. This has been causing spurious CI failures with no useful logs occasioanlly. Disabling WAL eliminates the shared memory file entirely.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
SQLite in WAL mode mmaps a shared memory file that can fault under concurrent
access, which kills nix with SIGBUS.
Solution
Disabling WAL eliminates the shared memory entirely, which will never allow for the fault to occur anymore.
Additional Context
This was the cause for spurious CI failures we've been seeing recently (1 2)
I spammed CI in my fork with a sigbus handler to actually find the root cause of this....see here :) https://github.com/amaanq/hydra/actions/runs/23701271798/job/69045258619?pr=1#step:5:8309
Relevant snippet
Note that it is impossible to debug these without such a handler, but I don't think this should be installed by default here.
Alternatives Considered
I'd considered setting exclusive locking mode for sqlite in Nix upstream via
PRAGMA locking_mode = EXCLUSIVE, as it'll keep the WAL-index in heap memory rather than in a mmapped shared memory file, but this has a huge downside of each process locking the database as the index cannot be shared across multiple processes. In practice I doubt it would have much of an effect as I imagine most users aren't running many concurrent Nix processes that need database access, .but in Hydra we ran into this SIGBUS due to our concurrent VM tests spawning multiple nix processes.I'm not too particularly happy with this solution, I'd also thought of maybe just increasing the VM's memory to 2GB but that isn't guaranteed to fix it depending on whether the cause for the faults was the file being truncated or the kernel evicting the backing page.