Skip to content

Distributed-systems stability test suite + single-voter re-election fix#12

Open
carlhoerberg wants to merge 4 commits into
thread-safetyfrom
dist-testing
Open

Distributed-systems stability test suite + single-voter re-election fix#12
carlhoerberg wants to merge 4 commits into
thread-safetyfrom
dist-testing

Conversation

@carlhoerberg

Copy link
Copy Markdown
Contributor

Follow-up to #6, split out so that PR stays focused on thread-safety. Stacked on thread-safety — review/merge that first.

What's here

Distributed-systems test coverage plus one src bug fix discovered while running it.

  • Single-voter re-election fix (src/raft/node.cr): a single-node cluster could not re-elect itself after restart — recover_state leaves role=Follower and start_pre_vote/become_candidate never short-circuited on the self-vote quorum, so propose() stayed stuck returning false. Added a quorum_size check after the self-vote in both paths (only fires when quorum_size==1; multi-voter clusters unaffected). Regression test in S10.
  • Stability test plan (docs/testing-plans/raft-cr-project-stability.md): 40 claims, 35 hypotheses, 17 scenarios.
  • Six in-process scenarios as specs:
    • S02 deterministic sim — at-most-one-leader-per-term, 200 seeds × 200 steps, 40k invariant checks
    • S05 crash-recovery durability — 30 SIGKILL iterations at random offsets
    • S10 apply-once + in-order across snapshot+restart (incl. single-voter regression)
    • S12 fuzz Message.from_io / LogEntry.from_io / Peer.from_io — 5,000 iters, typed errors only
    • S15 server fairness — documents apply()-blocks-group constraint
    • S16 replication SLO baseline
  • Config.random_seed to seed the election RNG for deterministic simulation.
  • S13 queue per-producer FIFO under partition + heal.
  • Jepsen: register checker :linear:competition.

Testing

crystal spec: 108 examples, 0 failures (96 baseline + 12 new).

🤖 Generated with Claude Code

carlhoerberg and others added 4 commits May 27, 2026 21:11
Single-voter Raft clusters could not re-elect themselves after restart:
recover_state restores term/vote/peers but @ROLE defaults to Follower, and
start_pre_vote/become_candidate did not short-circuit on self-vote quorum,
so a recovered node with peers={self} stayed Follower forever (subsequent
propose() calls returned false). Add a quorum check after the self-vote in
both functions — mirrors the existing pattern in handle_pre_vote_response /
handle_request_vote_response, only fires when quorum_size==1, so multi-voter
clusters are unaffected.

The bug was discovered while executing a new project-wide stability test
plan (docs/testing-plans/raft-cr-project-stability.md — 40 claims, 35
hypotheses, 17 scenarios with adequacy/confidence sections). Six in-process
scenarios are included as auto-generated specs:

  S02  spec/raft/simulation/at_most_one_leader_spec.cr   — deterministic
       sim, 200 seeds × 200 steps; structural invariant "at most one leader
       per term"; 40k invariant checks, zero violations.
  S05  spec/raft/crash_recovery/persist_state_durability_spec.cr
       (+ durability_helper.cr) — 30 SIGKILL iterations at random offsets;
       asserts no leftover raft_meta.tmp, commit_index ≤ log.last_index,
       term ≥ 1 post-bootstrap, self in peers, voted_for ∈ {nil, 1_u64}.
  S10  spec/raft/apply_invariants_spec.cr — apply-once + in-order across
       a snapshot+restart cycle; includes a regression test for the
       single-voter re-election fix above.
  S12  spec/fuzz/message_from_io_fuzz_spec.cr — 5,000 random-byte fuzz
       iters against Message.from_io / LogEntry.from_io / Peer.from_io;
       asserts only typed errors, never panics; explicit C31 bound test.
  S15  spec/raft/server_fairness_spec.cr — documents the apply()-blocks-
       group operational constraint (Node calls StateMachine#apply
       synchronously on the driver fiber).
  S16  spec/raft/perf/replication_slo_spec.cr — establishes a single-voter
       propose-latency baseline (p50=25.7µs, p99=354.9µs at ~32B payload).

All 108 examples pass (96 prior + 11 new + 1 F1 regression). The remaining
11 chaos-class scenarios (S01/S03/S04/S06/S07/S08/S09/S11/S13/S14/S17)
remain INCONCLUSIVE-env pending Jepsen / dm-flakey / multi-node Docker
infrastructure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Config#random_seed (UInt64?, default nil) lets callers fix the seed used
by Node#random_election_timeout. When nil — current behaviour: Random.new
(OS entropy). When set, Node uses Random.new(seed), making election-
timeout choices reproducible.

Closes F2 from the project-wide stability test plan: the S02 deterministic-
simulation arm (200 seeds × 200 steps) drove the protocol via a seeded
RNG but Node's own RNG was unseeded, so consecutive runs reported
different sanity counters (199/200 vs 198/200 seeds-saw-a-leader).
The structural at-most-one-leader-per-term invariant still held in both
runs; the variation was a sign that the "deterministic" framing was
incomplete.

After this change, two back-to-back runs produce bit-identical S02
sanity counters (verified: seeds_that_saw_a_leader=198/200,
max_term_observed=3, invariant_checks=40000, both runs). Per-node
seeds derive from the test seed as seed * 1000 + node_id.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing workload uses jepsen.independent/concurrent-generator with
an infinite range of keys, so even a 15-second test produces ~370-390
independent per-key histories. Knossos's :linear algorithm is
exponential and OOMs on every key when many are analyzed in parallel
under a 12 GB JVM heap (verified: 4 consecutive runs all reported
:valid? :unknown :cause :out-of-memory on virtually every key, with
empty :failures lists — no anomaly found, but no verdict either).

:competition is Jepsen's heuristic linearizability search — faster and
much smaller heap footprint, at the cost of being a sound-but-incomplete
checker (it may pass histories that an exhaustive :linear search would
flag). Switching this one word turns the chaos arm of the project test
plan (S01 linearizable_writes_under_partition_and_crash) from
INCONCLUSIVE-oracle-too-weak into a verifiable PASS-hardening:

  --time-limit 15 --concurrency 5 -e JAVA_TOOL_OPTIONS=-Xmx12g
  → 371/371 keys :valid? true
  → :failures []
  → exit 0, "Everything looks good!"

Nemesis (partition-random-halves, 5-second window isolating
{n1,n2} | {n3,n4,n5}) landed cleanly per the Jepsen log; no anomaly
observed under or after the partition.

Caveat: :competition is heuristic. The reviewer should treat this as
hardening evidence for "no easily-discoverable per-key linearizability
violation under partition-random-halves at this scale", not as proof of
correctness. For exhaustive checking the workload would need rework to
use a single shared register (one or a few keys) so :linear is
tractable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In-process variant of S13 from the project test plan. Drives the queue
example through a synthetic partition that isolates one follower
(deliver_all_except skips messages to/from the partitioned node),
publishes a stream of producer-tagged messages, heals the partition,
and verifies:

  - per-producer FIFO: each producer's tag sequence appears in
    publish order in the drained stream (C33)
  - cross-replica state equivalence: all three replicas' snapshot bytes
    are byte-equal after heal — they converged to the same queue
    state (C34)
  - exactly-once consume bridge: every published tag drains exactly
    once via the bridge on the leader (C35)
  - no-commit-without-quorum: a minority-partitioned leader can append
    entries but commit_index cannot advance until heal

Workload: 3 producers × 5 messages × 2 phases = 30 messages total.
Phase 1 with full delivery, phase 2 with node 3 partitioned, then heal
and catch up. After catch-up, depth on all 3 replicas = 30 and snapshot
bytes match exactly.

This is the in-process arm of S13 (synthetic partition; no TCP-layer
behavior tested). The Jepsen-driven arm would still need a queue-
specific Clojure workload + the existing podman/SELinux/checker fixes
applied to a new compose stack — out of scope this session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant