Distributed-systems stability test suite + single-voter re-election fix#12
Open
carlhoerberg wants to merge 4 commits into
Open
Distributed-systems stability test suite + single-voter re-election fix#12carlhoerberg wants to merge 4 commits into
carlhoerberg wants to merge 4 commits into
Conversation
Single-voter Raft clusters could not re-elect themselves after restart: recover_state restores term/vote/peers but @ROLE defaults to Follower, and start_pre_vote/become_candidate did not short-circuit on self-vote quorum, so a recovered node with peers={self} stayed Follower forever (subsequent propose() calls returned false). Add a quorum check after the self-vote in both functions — mirrors the existing pattern in handle_pre_vote_response / handle_request_vote_response, only fires when quorum_size==1, so multi-voter clusters are unaffected. The bug was discovered while executing a new project-wide stability test plan (docs/testing-plans/raft-cr-project-stability.md — 40 claims, 35 hypotheses, 17 scenarios with adequacy/confidence sections). Six in-process scenarios are included as auto-generated specs: S02 spec/raft/simulation/at_most_one_leader_spec.cr — deterministic sim, 200 seeds × 200 steps; structural invariant "at most one leader per term"; 40k invariant checks, zero violations. S05 spec/raft/crash_recovery/persist_state_durability_spec.cr (+ durability_helper.cr) — 30 SIGKILL iterations at random offsets; asserts no leftover raft_meta.tmp, commit_index ≤ log.last_index, term ≥ 1 post-bootstrap, self in peers, voted_for ∈ {nil, 1_u64}. S10 spec/raft/apply_invariants_spec.cr — apply-once + in-order across a snapshot+restart cycle; includes a regression test for the single-voter re-election fix above. S12 spec/fuzz/message_from_io_fuzz_spec.cr — 5,000 random-byte fuzz iters against Message.from_io / LogEntry.from_io / Peer.from_io; asserts only typed errors, never panics; explicit C31 bound test. S15 spec/raft/server_fairness_spec.cr — documents the apply()-blocks- group operational constraint (Node calls StateMachine#apply synchronously on the driver fiber). S16 spec/raft/perf/replication_slo_spec.cr — establishes a single-voter propose-latency baseline (p50=25.7µs, p99=354.9µs at ~32B payload). All 108 examples pass (96 prior + 11 new + 1 F1 regression). The remaining 11 chaos-class scenarios (S01/S03/S04/S06/S07/S08/S09/S11/S13/S14/S17) remain INCONCLUSIVE-env pending Jepsen / dm-flakey / multi-node Docker infrastructure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Config#random_seed (UInt64?, default nil) lets callers fix the seed used by Node#random_election_timeout. When nil — current behaviour: Random.new (OS entropy). When set, Node uses Random.new(seed), making election- timeout choices reproducible. Closes F2 from the project-wide stability test plan: the S02 deterministic- simulation arm (200 seeds × 200 steps) drove the protocol via a seeded RNG but Node's own RNG was unseeded, so consecutive runs reported different sanity counters (199/200 vs 198/200 seeds-saw-a-leader). The structural at-most-one-leader-per-term invariant still held in both runs; the variation was a sign that the "deterministic" framing was incomplete. After this change, two back-to-back runs produce bit-identical S02 sanity counters (verified: seeds_that_saw_a_leader=198/200, max_term_observed=3, invariant_checks=40000, both runs). Per-node seeds derive from the test seed as seed * 1000 + node_id. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing workload uses jepsen.independent/concurrent-generator with
an infinite range of keys, so even a 15-second test produces ~370-390
independent per-key histories. Knossos's :linear algorithm is
exponential and OOMs on every key when many are analyzed in parallel
under a 12 GB JVM heap (verified: 4 consecutive runs all reported
:valid? :unknown :cause :out-of-memory on virtually every key, with
empty :failures lists — no anomaly found, but no verdict either).
:competition is Jepsen's heuristic linearizability search — faster and
much smaller heap footprint, at the cost of being a sound-but-incomplete
checker (it may pass histories that an exhaustive :linear search would
flag). Switching this one word turns the chaos arm of the project test
plan (S01 linearizable_writes_under_partition_and_crash) from
INCONCLUSIVE-oracle-too-weak into a verifiable PASS-hardening:
--time-limit 15 --concurrency 5 -e JAVA_TOOL_OPTIONS=-Xmx12g
→ 371/371 keys :valid? true
→ :failures []
→ exit 0, "Everything looks good!"
Nemesis (partition-random-halves, 5-second window isolating
{n1,n2} | {n3,n4,n5}) landed cleanly per the Jepsen log; no anomaly
observed under or after the partition.
Caveat: :competition is heuristic. The reviewer should treat this as
hardening evidence for "no easily-discoverable per-key linearizability
violation under partition-random-halves at this scale", not as proof of
correctness. For exhaustive checking the workload would need rework to
use a single shared register (one or a few keys) so :linear is
tractable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In-process variant of S13 from the project test plan. Drives the queue
example through a synthetic partition that isolates one follower
(deliver_all_except skips messages to/from the partitioned node),
publishes a stream of producer-tagged messages, heals the partition,
and verifies:
- per-producer FIFO: each producer's tag sequence appears in
publish order in the drained stream (C33)
- cross-replica state equivalence: all three replicas' snapshot bytes
are byte-equal after heal — they converged to the same queue
state (C34)
- exactly-once consume bridge: every published tag drains exactly
once via the bridge on the leader (C35)
- no-commit-without-quorum: a minority-partitioned leader can append
entries but commit_index cannot advance until heal
Workload: 3 producers × 5 messages × 2 phases = 30 messages total.
Phase 1 with full delivery, phase 2 with node 3 partitioned, then heal
and catch up. After catch-up, depth on all 3 replicas = 30 and snapshot
bytes match exactly.
This is the in-process arm of S13 (synthetic partition; no TCP-layer
behavior tested). The Jepsen-driven arm would still need a queue-
specific Clojure workload + the existing podman/SELinux/checker fixes
applied to a new compose stack — out of scope this session.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #6, split out so that PR stays focused on thread-safety. Stacked on
thread-safety— review/merge that first.What's here
Distributed-systems test coverage plus one src bug fix discovered while running it.
src/raft/node.cr): a single-node cluster could not re-elect itself after restart —recover_stateleavesrole=Followerandstart_pre_vote/become_candidatenever short-circuited on the self-vote quorum, sopropose()stayed stuck returning false. Added aquorum_sizecheck after the self-vote in both paths (only fires whenquorum_size==1; multi-voter clusters unaffected). Regression test in S10.docs/testing-plans/raft-cr-project-stability.md): 40 claims, 35 hypotheses, 17 scenarios.Message.from_io/LogEntry.from_io/Peer.from_io— 5,000 iters, typed errors onlyConfig.random_seedto seed the election RNG for deterministic simulation.:linear→:competition.Testing
crystal spec: 108 examples, 0 failures (96 baseline + 12 new).🤖 Generated with Claude Code