Context
The current transport (see ARCHITECTURE.md §4 "TCPTransport") uses one TCP connection per peer pair, with all Raft groups multiplexed by group_id. This keeps connection count O(N²) in cluster size rather than O(N² × groups), which is essential for multi-raft scaling.
The trade-off is head-of-line blocking on the shared connection: a slow large message on group A delays a small heartbeat on group B. The 1 MB max_append_entries_size cap mitigates the worst case but doesn't eliminate the underlying issue.
For workloads where individual user messages are large (e.g. AMQP brokers where messages can be many MB), this trade-off becomes a real concern.
Mitigation paths
ARCHITECTURE.md §6 "What is intentionally not optimized (yet)" lays out four options for the HOL problem:
- Don't put bodies in the Raft log. Keep metadata + reference in Raft; replicate bodies out-of-band. This is what RabbitMQ quorum queues do via the shared message store. Recommended for the AMQP integration. Application-level decision; out of scope for the transport.
- Multiple TCP connections per peer pair — typically a "control" connection (heartbeats, votes, small AppendEntries) plus a "bulk" connection. Cheap to implement, retains most of the multi-raft connection-count benefit. This issue.
- gRPC / HTTP/2 streams — partial benefit; TCP-layer HOL persists; significant complexity.
- QUIC — eliminates TCP HOL but immature for this workload.
Open questions for discussion
- Is option 2 the right fix, or should we lean on option 1 instead for AMQP integrations and accept the current single-connection design for everything else?
- If we add a second connection, what's the routing rule?
- Size-based (entries > N bytes go on the bulk connection)?
- Type-based (heartbeats + votes always on control)?
- Priority-based (a
priority field on Message)?
- Per-group or per-peer split? Per-peer is simpler; per-group affinity might give better isolation but multiplies connection count.
- Connection lifecycle. Both connections need symmetric lifecycle, error handling, and reconnect logic — the existing per-peer fiber model doubles.
- Interaction with heartbeat aggregation (see related issue) — if heartbeats are batched into one small message per peer per interval, the HOL pressure on the control plane drops significantly, possibly making this less urgent.
Why this is a discussion, not a fix
The right answer depends on:
- The expected message-size distribution for the target workload.
- Whether option 1 (bodies outside the log) is being pursued in parallel for AMQP.
- The complexity budget for the transport layer.
This is opening the conversation, not committing to a design.
Context
The current transport (see
ARCHITECTURE.md§4 "TCPTransport") uses one TCP connection per peer pair, with all Raft groups multiplexed bygroup_id. This keeps connection count O(N²) in cluster size rather than O(N² × groups), which is essential for multi-raft scaling.The trade-off is head-of-line blocking on the shared connection: a slow large message on group A delays a small heartbeat on group B. The 1 MB
max_append_entries_sizecap mitigates the worst case but doesn't eliminate the underlying issue.For workloads where individual user messages are large (e.g. AMQP brokers where messages can be many MB), this trade-off becomes a real concern.
Mitigation paths
ARCHITECTURE.md§6 "What is intentionally not optimized (yet)" lays out four options for the HOL problem:Open questions for discussion
priorityfield onMessage)?Why this is a discussion, not a fix
The right answer depends on:
This is opening the conversation, not committing to a design.