diff --git a/README.md b/README.md index bd03b1e..fb8245b 100644 --- a/README.md +++ b/README.md @@ -74,7 +74,7 @@ Noah Tigner's [Portfolio Website](https://noahtigner.com) - [x] [Chapter 10 - Leader Election](https://noahtigner.com/articles/database-internals-chapter-10/) - [x] [Chapter 11 - Replication & Consistency](https://noahtigner.com/articles/database-internals-chapter-11/) - [x] [Chapter 12 - Anti-Entropy & Dissemination](https://noahtigner.com/articles/database-internals-chapter-12/) - - [ ] [Chapter 13 - Distributed Transactions](https://noahtigner.com/articles/database-internals-chapter-13/) + - [x] [Chapter 13 - Distributed Transactions](https://noahtigner.com/articles/database-internals-chapter-13/) - [ ] [Chapter 14 - Consensus](https://noahtigner.com/articles/database-internals-chapter-14/) - [ ] [Summary & Thoughts](https://noahtigner.com/articles//) diff --git a/src/assets/articles/databaseInternals.md b/src/assets/articles/databaseInternals.md index a40c1fc..e20f911 100644 --- a/src/assets/articles/databaseInternals.md +++ b/src/assets/articles/databaseInternals.md @@ -45,7 +45,7 @@ This is a collection of my notes on Chapter 10 - Leader Election - [x] Chapter 11 - Replication & Consistency - [x] Chapter 12 - Anti-Entropy & Dissemination -- [ ] Chapter 13 - Distributed Transactions +- [x] Chapter 13 - Distributed Transactions - [ ] Chapter 14 - Consensus --- diff --git a/src/assets/articles/databaseInternalsChapter13.md b/src/assets/articles/databaseInternalsChapter13.md new file mode 100644 index 0000000..aa4d8b7 --- /dev/null +++ b/src/assets/articles/databaseInternalsChapter13.md @@ -0,0 +1,224 @@ +--- +title: Database Internals Ch. 13 - Distributed Transactions +description: Notes on Chapter 13 of Database Internals by Alex Petrov. Distributed Transactions, including two-phase commit, Spanner, partitioning, sharding, consistent hashing, and coordination avoidance. +published: March 26, 2026 +updated: March 26, 2026 +minutesToRead: 9 +path: /articles/database-internals-chapter-13/ +image: /images/database-internals.jpg +tags: + - 'reading notes' + - 'databases' + - 'distributed systems' +collection: + slug: database-internals + title: Database Internals + shortTitle: Ch. 13 - Distributed Transactions + shortDescription: Distributed Transactions, including two-phase commit, Spanner, partitioning, sharding, consistent hashing, and coordination avoidance. + order: 13 +--- + +## Database Internals - Ch. 13 - Distributed Transactions + +

9 minute read • March 26, 2026

+ +This post contains my notes on Chapter 13 of _Database Internals_ by Alex Petrov. These notes are intended as a reference and are not meant as a substitute for the original text. I found Timilehin Adeniran's notes on _Designing Data-Intensive Applications_ extremely helpful while reading that book, so I thought I'd try to do the same here. + +--- + +### Introduction + +In chapter 11 we discussed single-object, single-operation consistency models. +We now shift to discussing models where multiple ops are executed atomically and concurrently. +The main focus of transaction processing is to determine permissible histories and interleaving scenarios. +History can be represented as a dependency graph. +Histories are serializable if they are equivalent to some sequential execution of transactions. + +None of the single-partition transaction processing approaches we discussed so far solve the problem of multi-partition transactions, which require coordination between servers, distributed commits, and rollback protocols. + +To ensure atomicity, transactions must be recoverable, meaning that they can be completely rolled back if they were aborted or timed out. +Changes also have to be either propagated to all of the nodes involved in the transaction or none of them. + +--- + +### Making Operations Appear Atomic + +To make multiple (possibly remote) operations appear atomic, we need to use a class of algorithm called "atomic commitment". +These algorithms disallow disagreements between participants by not committing if even one participant voted against it. +This also means that failed processes have to reach the same conclusion as the rest of the cohort. +These algorithms don't work in the presence of Byzantine failures. + +These algorithms only tell the system if the proposed transaction should be executed or not. +They don't set strict requirements around prepare, commit, or rollback ops. +Implementations have to decide when data is ready to commit, how to perform the commit itself, and how rollbacks should work. + +--- + +### Two-Phase Commit + +Two-phase commit (2PC) is the most straightforward protocol for distributed commitment. +It allows multi-partition atomic updates. +During the first phase, "prepare", the value is distributed to the nodes and votes are collected on if they can commit the change. +In the second phase, "commit/abort", nodes just flip the switch and make the change visible. +2PC requires a leader/coordinator that holds state, organizes votes, and decides to commit or abort. +The leader can be fixed through the lifetime of the system or picked with a leader election algorithm. + +#### Cohort Failures in 2PC + +If one of the cohorts (participants) is unavailable during the prepare phase, the coordinator will abort the transaction. +This negatively impacts availability. +If a node fails during the commit/abort phase, it enters into an inconsistent state and cannot respond to requests until it has caught up. + +#### Coordination Failures in 2PC + +If a cohort does not receive a commit or abort command during the second phase, it must find out which decision was made by contacting either the coordinator or a peer. +Because of the possibility of a permanent coordinator failure, 2PC is a blocking atomic commitment algorithm. +It is often chosen due to its simplicity and low overhead, but it requires proper recovery mechanisms and backup coordinator nodes. + +--- + +### Three-Phase Commit + +Three-phase commit (3PC) is an improvement over 2PC that makes the protocol robust against coordinator failure and avoids undecided states. +3PC changes the first phase to "propose" and adds a middle "prepare" phase, where cohorts are notified about the vote results, and either sent a prepare or abort message. +This allows cohorts to carry on even if the coordinator fails. +Timeouts are also introduced on the cohort side. + +#### Coordination Failures in 3PC + +All state transitions are coordinated, and cohorts can't move to the next phase until everyone is done with the previous one. +If the coordinator fails, 3PC avoids blocking and allows cohorts to proceed deterministically. +The biggest issue with 3PC is the possibility of a "split brain" due to a network partition. +While it mostly solves the blocking problem of 2PC, it comes with a larger message overhead and does not work well in the presence of network partitions. +It is therefore not widely used. + +--- + +### Distributed Transactions with Calvin + +Traditional database systems execute transactions using two-phase locking or optimistic concurrency control and have no deterministic transaction order. +This means that nodes have to be coordinated to preserve order. +Deterministic transaction ordering removes coordinator overhead during the execution phase, and since all replicas get the same inputs, they also produce the same outputs. +This approach is commonly known as "Calvin", a fast distributed transaction protocol. + +To achieve deterministic order, Calvin used a "sequencer" - an entrypoint for all transactions that determines their execution order and establishes a global transaction input sequence. +Batches, called "epochs", are used to reduce contention. +Once replicated, these batches get forwarded to the scheduler, which orchestrates transaction execution. +Worker threads then proceed with the execution itself. +Calvin uses the Paxos consensus algorithm for determining which transactions make it into the current epoch. + +--- + +### Distributed Transactions with Spanner + +Unlike Calvin, Spanner uses 2PC over consensus groups per partition (shard). +It uses a high-precision wall-clock API called "TrueTime" to achieve consistency and impose a transaction order. + +Spanner offers three main operations: + +- Read-write transactions - require locks, pessimistic concurrency control, and presence of the leader replica +- Read-only transactions - lock-free, can be executed on any replica +- Snapshot reads - consistent, lock-free view of the data at the given timestamp. A leader is only required when requesting the latest timestamp + +Spanner uses Paxos for consistent transaction log replication, 2PC for cross-shard transactions, and TrueTime for deterministic transaction ordering. +This means that multi-partition transactions have a higher cost compared to Calvin, but Spanner usually wins in terms of availability. + +--- + +### Database Partitioning + +Partitioning is simply a logical division of data into smaller manageable segments. +The most straightforward approach is to split the data into ranges. +Clients then route requests based on the routing key. +This is typically called "sharding", where every replica set acts as the single source for a subset of data. + +We want to distribute reads and writes as evenly as possible, sizing partitions appropriately. +In order to maintain balance, the DB also has to repartition the data when nodes are added or removed. +In order to reduce range hot-spotting, some DBs use a hash of the value as the routing key. +A naive approach is to map keys to nodes with something like `hash(v) % N`, where N is the number of nodes. +The downside of this is that if the number of nodes changes, the system is immediately unbalanced and needs to be repartitioned. + +#### Consistent Hashing + +In order to mitigate this problem, some DBs employ consistent hashing methods. +Hashed values are mapped to a ring, so that after the largest possible value, it wraps around to the smallest value. +Each node in the ring is responsible for the range of values between its two neighbors in the ring. +Consistent hashing helps to reduce the number of relocations required for maintaining balance. + +--- + +### Distributed Transactions with Percolator + +If serializability is not required by the application, one way to avoid write anomalies is with a transaction model called "snapshot isolation" (SI). +SI guarantees that all reads made within the transaction are consistent with a snapshot of the database. +The snapshot contains all values that were committed before the transaction started. +If there's a write conflict between two transactions, only one of them will commit (usually called "first commit wins"). + +SI has several convenient properties: + +- It prevents read skew +- It allows only repeatable reads of committed data +- Values are consistent +- Conflicting writes are aborted and retried to prevent inconsistency + +Despite this, histories under SI are not serializable, and we can end up with write skew. + +"Percolator" is a library that implements a transactional API on top of the wide-column store Bigtable. +SI is an important abstraction for Percolator. +Many other DBMSs and MVCC systems offer the SI isolation level. + +--- + +### Coordination Avoidance + +Invariant Confluence (I-Confluence) is a property that ensures that two invariant-valid but diverged DB states can be merged into a single valid DB state. +Because any two valid states can be merged into a valid state, I-Confluent ops can be executed without any additional coordination, which significantly improves performance and scalability potential. + +A system model that allows coordinator avoidance has to guarantee the following properties: + +- Global validity - required invariants are always satisfied +- Availability - if all nodes holding state are reachable by the client, the transaction has to reach a commit or abort decision. +- Convergence - nodes can maintain their local state independently, but they have to be able to reach the same state +- Coordinator freedom - local transaction execution is independent from the ops against the local state on behalf of other nodes + +One example of coordinator avoidance is Read-Atomic Multi-Partition (RAMP). +RAMP transactions use MVCC and metadata of in-flight ops to fetch any missing state updates from other nodes. +This allows read and write ops to be executed concurrently. + +RAMP provides two properties: + +- Synchronization independence - transactions are independent of each other +- Partition independence - clients don't have to contact partitions whose values aren't involved in their transactions + +RAMP provides the read-atomic isolation level. +It also offers atomic write visibility without requiring mutual exclusion. +To allow readers and writers to proceed without blocking other concurrent readers and writers, writes in RAMP are made visible using a non-blocking variant of 2PC. + +--- + +### Other Resources + +Yugabyte provided a great talk comparing and contrasting Calvin and Spanner. ByteByteGo has a great video, article, and chapter in System Design Interview about consistent hashing. + +
+ + +
+ +--- + +

Database Internals by Alex Petrov (O'Reilly). Copyright 2019 Oleksander Petrov, 978-1-492-04034-7