Skip to content

Conversation

@kim
Copy link
Contributor

@kim kim commented Jan 27, 2026

Changes the commitlog (and durability) write API, such that the caller decides how many transactions are in a single commit, and has to supply the transaction offsets.

This simplifies commitlog-side buffering logic to essentially a BufWriter (which, of course, we must not forget to flush). This will help throughput, but offers less opportunity to retry failed writes. This is probably a good thing, as disks can fail in erratic ways, and we should rather crash and re-verify the commitlog (suffix) than continue writing.

To that end, this patch liberally raises panics when there is a chance that internal state could be "poisoned" by partial writes, which may be debatable.

Motivation

The main motivation is to avoid maintaining the transaction offset in two places in such a way that they could diverge. As ordering commits is the responsibility of the datastore, we make it authoritative on this matter -- the commitlog will still check that offsets are contiguous, and refuse to commit if that's not the case.

A secondary, related motivation is the following:

A "commit" is an atomic unit of storage, meaning that a torn (partial) write of a commit will render the entire commit corrupt. There hasn't been a compelling case where we would want this, and have always configured the server to write exactly one transaction per commit.
The code to handle buffering of transactions is, however, rather complex, as it tries hard to allow the caller to retry writes at commit boundaries. An unfortunate consequence of this is that we'd flush to the OS very often, leaving throughput performance on the table.

So, if there is a compelling case for batching multiple transactions in a commit, it should be the datastore's responsibility.

API and ABI breaking changes

Breaks internal APIs only.

Expected complexity level and risk

5 - Mostly for the risk

Testing

Existing tests.

@kim kim linked an issue Jan 28, 2026 that may be closed by this pull request
@kim
Copy link
Contributor Author

kim commented Jan 28, 2026

Nominating @gefjon and @Centril because they appeared in the reviewer suggestions.
@Centril specifically for hints on how the get rid of Box<[_]> allocations.
@Shubham8287 because of previous work on the commitlog.
@joshua-spacetime because of suggesting the change initially.

@kim kim marked this pull request as ready for review January 28, 2026 16:25
/// Tests that, when a partial write occurs, we can read all flushed commits
/// up until the faulty one.
#[test]
fn reopen() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this was supposed to test originally, so I removed it.

/// up until the faulty one.
#[test]
fn reopen() {
fn read_log_up_to_partial_write() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically the test previously named reopen.

let writer = &mut self.head;
let committed = writer.commit(transactions)?;
if writer.len() >= self.opts.max_segment_size {
self.flush().expect("failed to flush segment");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed a bit surprising to me at first -- but the BufWriter has no way of knowing how many bytes did make it. So if flush fails, the buffer is basically garbage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make datastore responsible for maintaining the transaction offset

2 participants