Skip to content

fix(partitions): stop holding partition refs across .await#3557

Open
hubcio wants to merge 6 commits into
masterfrom
partitions-async-ub
Open

fix(partitions): stop holding partition refs across .await#3557
hubcio wants to merge 6 commits into
masterfrom
partitions-async-ub

Conversation

@hubcio

@hubcio hubcio commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

The poll-read handler and the consensus tick ran as separate
tasks but each held a &IggyPartition (from get_by_ns) across
an .await. The pump task could meanwhile reallocate the
partitions vec (ReconcileOp::InsertOwned) or take a &mut for
the same namespace, dangling the held reference: a
use-after-free on partition create/delete during a parked
read, and &/&mut aliasing UB on ordinary concurrent
produce+consume. The single-threaded runtime prevents neither.

The read handler now builds an owned poll plan synchronously
on the pump, then runs the disk read, straddle and offset
persist+apply off it on owned data alone (consumer offsets are
already Arc, the journal tail is snapshotted), so no task holds
a partition ref off the pump. The consensus tick is folded into
the pump select! arm, and the view-change and commit handlers
address one partition by namespace instead of scanning all. A
with_partition closure plus a debug tripwire enforce it.

@github-actions github-actions Bot added the S-waiting-on-review PR is waiting on a reviewer label Jun 23, 2026
@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 58.69906% with 527 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.57%. Comparing base (307fdb1) to head (ec86d06).

Files with missing lines Patch % Lines
core/partitions/src/iggy_partition.rs 52.30% 152 Missing and 3 partials ⚠️
core/partitions/src/poll_plan.rs 60.05% 146 Missing and 5 partials ⚠️
core/server-ng/src/dispatch.rs 0.00% 56 Missing ⚠️
core/partitions/src/journal.rs 53.44% 53 Missing and 1 partial ⚠️
core/server-ng/src/bootstrap.rs 61.81% 41 Missing and 1 partial ⚠️
core/partitions/src/iggy_partitions.rs 80.00% 38 Missing ⚠️
core/shard/src/router.rs 0.00% 15 Missing ⚠️
core/simulator/src/lib.rs 0.00% 11 Missing ⚠️
core/shard/src/lib.rs 96.87% 2 Missing ⚠️
core/simulator/src/main.rs 0.00% 2 Missing ⚠️
... and 1 more
Additional details and impacted files
@@              Coverage Diff              @@
##             master    #3557       +/-   ##
=============================================
- Coverage     74.07%   53.57%   -20.50%     
  Complexity      937      937               
=============================================
  Files          1249     1247        -2     
  Lines        128248   112909    -15339     
  Branches     104116    88820    -15296     
=============================================
- Hits          94994    60489    -34505     
- Misses        30219    49424    +19205     
+ Partials       3035     2996       -39     
Components Coverage Δ
Rust Core 48.51% <58.69%> (-26.20%) ⬇️
Java SDK 62.44% <ø> (ø)
C# SDK 71.40% <ø> (-0.66%) ⬇️
Python SDK 88.88% <ø> (ø)
PHP SDK 84.29% <ø> (ø)
Node SDK 91.26% <ø> (-0.10%) ⬇️
Go SDK 40.14% <ø> (ø)
Files with missing lines Coverage Δ
core/partitions/src/lib.rs 0.00% <ø> (ø)
core/partitions/src/offset_storage.rs 12.12% <ø> (ø)
core/server-ng/src/server_error.rs 92.10% <ø> (ø)
core/server-ng/src/partition_reconciler.rs 86.89% <88.88%> (-0.04%) ⬇️
core/shard/src/lib.rs 74.84% <96.87%> (+0.87%) ⬆️
core/simulator/src/main.rs 0.00% <0.00%> (ø)
core/simulator/src/lib.rs 93.04% <0.00%> (ø)
core/shard/src/router.rs 0.00% <0.00%> (ø)
core/partitions/src/iggy_partitions.rs 73.35% <80.00%> (+7.94%) ⬆️
core/server-ng/src/bootstrap.rs 10.86% <61.81%> (+3.55%) ⬆️
... and 4 more

... and 328 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@hubcio hubcio force-pushed the partitions-async-ub branch 2 times, most recently from 6064f97 to 3e075d6 Compare June 25, 2026 12:56

@ryankert01 ryankert01 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lg, also unit test passes locally.

hubcio added 2 commits June 26, 2026 11:27
The poll-read handler and the consensus tick ran as separate
tasks but each held a &IggyPartition (from get_by_ns) across
an .await. The pump task could meanwhile reallocate the
partitions vec (ReconcileOp::InsertOwned) or take a &mut for
the same namespace, dangling the held reference: a
use-after-free on partition create/delete during a parked
read, and &/&mut aliasing UB on ordinary concurrent
produce+consume. The single-threaded runtime prevents neither.

The read handler now builds an owned poll plan synchronously
on the pump, then runs the disk read, straddle and offset
persist+apply off it on owned data alone, so no task holds a
partition ref off the pump. The consensus tick is folded into
the pump select! arm, and the view-change and commit handlers
address one partition by namespace. The owned poll types move
to poll_plan.rs (the borrow contract documented once), and the
resident decode-walk and VSR-handler prologue are de-duplicated.
Resident polls walked the journal's whole tail, but that tail holds
replicated-but-uncommitted prepares ahead of the commit point, and a
view change can roll them back. Serving them is a dirty read.
MessageLookup now carries an inclusive `ceiling` (the commit offset);
select_batch_slice stops at the first offset past it, so no poll,
straddle continuation, or disk walk returns rollbackable data.

Two more off-pump correctness fixes ride along, all preserving the
"no partition ref across .await" contract from the previous commit:

- Off-pump auto-commit upserts via fetch_max (upsert_offset_max) so a
  stale auto-commit racing a newer explicit StoreConsumerOffset cannot
  rewind the stored offset. The explicit pump path keeps store(), which
  may legitimately rewind.

- delete_consumer_group_offset held &self across the unlink .await. It
  becomes reclaim_dead_group_offsets: a synchronous in-memory papaya
  remove that returns the owned file paths for the reconciler to unlink
  off-borrow, so no partition reference survives the await.

Also: a resident auto-commit that needs no disk persist (no offset_path,
sim/dev) applies inline on the pump instead of spawning a detached task;
resident entries are cloned only when a resident tail exists; and
push_selected_batch_fragments is shared by the resident and disk walks,
taking the batch base as a byte offset.
@hubcio hubcio force-pushed the partitions-async-ub branch from 3e075d6 to dd8bbe5 Compare June 26, 2026 09:37
hubcio added 4 commits June 26, 2026 13:34
The partition-ref-across-await fix made the consume poll path and the
produce/commit pump run as sibling tasks over the same partition, but
nothing exercised that aliasing window under real-runtime load across
topologies. This adds a data-plane hammer: many producers and consumers
all pound a single partition concurrently for a fixed duration, then the
consumers drain and assert no message loss and a strictly contiguous
offset log (every reader sees 0..total, count equals the sum of sends).

Runs the four topologies via the harness matrix: single-node and 3-node
cluster, each with one and two shards per node. Cluster variants are
#[ignore] (heavy) and run on demand.

Kept strictly data plane (poll by explicit offset, auto_commit = false,
no mid-run topic or partition mutation) so it never drives the metadata
consensus plane concurrently. That sidesteps a separate, still-open
on_ack journal-durability race that panics the primary under concurrent
metadata ops and currently gates concurrent_produce_consume under vsr.
Shard 0 opened its client listeners the instant
broadcast_metadata_bundle returned, which only proves peers
received the metadata bundle - not that they finished loading
their on-disk partitions. Peers still scan live shared metadata
in build_shard_for_thread and load each partition's segments via
walk_dir. With more than one shard, a partition created by the
first client lands in metadata while a peer's load scan is still
running; the peer then tries to load a partition whose segment
dir was never materialized and aborts the whole node with
CannotReadPartitions. Deterministic in release with 2+ shards,
hidden in debug by slower shard startup. Pre-existing on master.

Add a reverse barrier symmetric to the metadata handoff: each
peer signals once its load completes, and shard 0 drains one
signal per peer before binding listeners. A partition created
after the fence takes the runtime reconciler path, which creates
its dir, instead of the bootstrap load path. Both sides poll the
shutdown flag so a sibling failure mid-boot aborts cleanly rather
than hanging.
The 2s/debug hammer passed even against a server with a boot-time
shard race because the window was too short and debug timing hid
it. Extend the hammer to 20s and exercise it in release so the
data plane is driven hard enough to surface boot and steady-state
faults across all four single-node and cluster topologies.

A dead server made this dangerous: the harness root client retries
connect with no cap and parked polls never re-check the consumer
deadline, so a boot crash wedged the whole suite indefinitely (one
run sat 47 minutes). Wrap the test body in a wall-clock timeout
above MAX_TEST_DURATION so a slow-but-progressing consumer still
trips its own informative deadline first, while a wedge fails fast
instead of hanging CI.
The Cell import backs the borrow_active borrow tripwire (BorrowGuard),
which is #[cfg(debug_assertions)] only. In release builds Cell is
unused, so a plain cargo build --release warned. Gate the import to
match its uses; no behavior change in either profile.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-review PR is waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants