From 8e322564c82e2e7275fc1426e71872bef61a8e48 Mon Sep 17 00:00:00 2001
From: Noah Tigner
5 minute read • March 11, 2026
+4 minute read • March 11, 2026
This post contains my notes on Chapter 10 of _Database Internals_ by Alex Petrov. These notes are intended as a reference and are not meant as a substitute for the original text. I found Timilehin Adeniran's notes on _Designing Data-Intensive Applications_ extremely helpful while reading that book, so I thought I'd try to do the same here. diff --git a/src/assets/articles/databaseInternalsChapter11.md b/src/assets/articles/databaseInternalsChapter11.md index 229f9ff..4f232c7 100644 --- a/src/assets/articles/databaseInternalsChapter11.md +++ b/src/assets/articles/databaseInternalsChapter11.md @@ -1,8 +1,8 @@ --- title: Database Internals Ch. 11 - Replication and Consistency -description: Notes on Chapter 11 of Database Internals by Alex Petrov. Replication and consistency in distributed systems, CAP, and CDRTs. +description: Notes on Chapter 11 of Database Internals by Alex Petrov. Replication and consistency in distributed systems, CAP, and CRDTs. published: March 18, 2026 -updated: March 18, 2026 +updated: March 29, 2026 minutesToRead: 10 path: /articles/database-internals-chapter-11/ image: /images/database-internals.jpg @@ -14,7 +14,7 @@ collection: slug: database-internals title: Database Internals shortTitle: Ch. 11 - Replication and Consistency - shortDescription: Replication and consistency in distributed systems, CAP, and CDRTs. + shortDescription: Replication and consistency in distributed systems, CAP, and CRDTs. order: 11 --- @@ -40,7 +40,7 @@ To make the system highly available, we need to design it in a way that allows h ### Infamous CAP -Availability measures the system's ability to respond to every request successfully. We would also like each operation to be (atomically / linearizably) consistent. Ideally, we would like to achieve both availability and consistency while tolerating network partitions. The CAP conjecture describes the tradeoffs between consistency \(C), availability (A), and partition tolerance (P). The conjecture states that at most two of the three can be achieved. +Availability measures the system's ability to respond to every request successfully. We would also like each operation to be (atomically / linearizably) consistent. Ideally, we would like to achieve both availability and consistency while tolerating network partitions. The CAP conjecture describes the tradeoffs between consistency C, availability A, and partition tolerance P. The conjecture states that a system can only choose between consistency and availability when a partition occurs. The two most common approaches are "AP" and "CP". CP systems prefer failing requests to serving potentially inconsistent data. AP systems loosen the C requirements and allow serving potentially inconsistent values during the request. @@ -64,7 +64,7 @@ From the client's perspective, distributed systems act as if storage is shared, Registers can be accessed by multiple readers and writes simultaneously. When it comes to concurrent ops, there are three types of registers: - Safe - reads to the safe registers may return arbitrary values within the range of the register during a concurrent write op -- Regular - read ops return the value of the most recently completed write, or the value of the write that overlaps with the current reade op +- Regular - read ops return the value of the most recently completed write, or the value of the write that overlaps with the current read op - Atomic - every write op has a single moment before which every read returns an old value and after which every read returns a new value. This guarantees linearizability. --- @@ -123,7 +123,7 @@ Following CAP principles, we can tune our eventual consistency with three parame - Write consistency W - the number of nodes that have to acknowledge a write for it to succeed - Read consistency R - the number of nodes that have to respond to a read operation for it to succeed -Choosing levels where R + W > N gaurantees that the most recently written value is returned. Write-heavy systems sometimes pick W = 1 and R = N, which allows writes to be acknowledged by just one node, but requires all replicas to be available for reads. Increasing W or R increases latency and raises requirements for node availability. Decreasing them improves system availability while sacrificing consistency. +Choosing levels where R + W > N helps reduce the chance of stale reads by forcing read and write quorums to overlap. Write-heavy systems sometimes pick W = 1 and R = N, which allows writes to be acknowledged by just one node, but requires all replicas to be available for reads. Increasing W or R increases latency and raises requirements for node availability. Decreasing them improves system availability while sacrificing consistency. A level of floor(N / 2) + 1 is called a "quorum", or majority of votes. In a system with 2f + 1 nodes, the system can keep responding even when up to f become unavailable. This does not, however, guarantee monotonicity in cases of incomplete writes. @@ -140,11 +140,11 @@ Witness replicas help reduce storage costs while preserving consistency. --- -### Strong Eventual Consistency and CDRTs +### Strong Eventual Consistency and CRDTs -Under strong eventual consistency, updates are allowed to propagate to servers late or out of order, but when all updates finally propagate to target nodes, conflicts between them can be resolved and they can be merged to produce the same valid state. Under some conditions, we can relax our consistency requirements by allowing operations to preserve additional state that allows the diverged states to be reconciled (merged) after execution. This is often implemented with Conflict-Free Replicated Data Types (CDRTs), as in the case of Redis. CDRTs are specialized data structures that preclude the existence of conflicts and allow ops to be applied in any order without changing the results. They are extremely useful in distributed systems and are often used in eventually consistent systems. +Under strong eventual consistency, updates are allowed to propagate to servers late or out of order, but when all updates finally propagate to target nodes, conflicts between them can be resolved and they can be merged to produce the same valid state. Under some conditions, we can relax our consistency requirements by allowing operations to preserve additional state that allows the diverged states to be reconciled (merged) after execution. This is often implemented with Conflict-Free Replicated Data Types (CRDTs), as in the case of Redis. CRDTs are specialized data structures that preclude the existence of conflicts and allow ops to be applied in any order without changing the results. They are extremely useful in distributed systems and are often used in eventually consistent systems. -The simplest CDRTs are operations-based Commutative Replicated Data Types (CmRDTs), which require ops to be side-effect free, commutative, and causally ordered. Another example is the unordered Grow-Only Set (G-Set), which supports additions, removals, merges, etc. A more complex example is Martin Kleppmann's conflict-free replicated JSON data type, which allows modifications on deeply-nested JSON documents. +The simplest CRDTs are operations-based Commutative Replicated Data Types (CmRDTs), which require ops to be side-effect free, commutative, and causally ordered. Another example is the unordered Grow-Only Set (G-Set), which supports additions and merges. A more complex example is Martin Kleppmann's conflict-free replicated JSON data type, which allows modifications on deeply nested JSON documents. --- diff --git a/src/assets/articles/databaseInternalsChapter12.md b/src/assets/articles/databaseInternalsChapter12.md index c1b6233..25d09b6 100644 --- a/src/assets/articles/databaseInternalsChapter12.md +++ b/src/assets/articles/databaseInternalsChapter12.md @@ -1,8 +1,8 @@ --- -title: Database Internals Ch. 12 - Anti-Entropy and Dissemination +title: Database Internals Ch. 12 - Anti-Entropy & Dissemination description: Notes on Chapter 12 of Database Internals by Alex Petrov. Anti-Entropy and Dissemination in distributed systems, including read repair, hinted handoff, Merkle Trees, and gossip dissemination. published: March 21, 2026 -updated: March 21, 2026 +updated: March 29, 2026 minutesToRead: 7 path: /articles/database-internals-chapter-12/ image: /images/database-internals.jpg @@ -13,7 +13,7 @@ tags: collection: slug: database-internals title: Database Internals - shortTitle: Ch. 12 - Anti-Entropy and Dissemination + shortTitle: Ch. 12 - Anti-Entropy & Dissemination shortDescription: Anti-Entropy and Dissemination in distributed systems, including read repair, hinted handoff, Merkle Trees, and gossip dissemination. order: 12 --- diff --git a/src/assets/articles/databaseInternalsChapter13.md b/src/assets/articles/databaseInternalsChapter13.md index aa4d8b7..9d789b3 100644 --- a/src/assets/articles/databaseInternalsChapter13.md +++ b/src/assets/articles/databaseInternalsChapter13.md @@ -2,7 +2,7 @@ title: Database Internals Ch. 13 - Distributed Transactions description: Notes on Chapter 13 of Database Internals by Alex Petrov. Distributed Transactions, including two-phase commit, Spanner, partitioning, sharding, consistent hashing, and coordination avoidance. published: March 26, 2026 -updated: March 26, 2026 +updated: March 29, 2026 minutesToRead: 9 path: /articles/database-internals-chapter-13/ image: /images/database-internals.jpg @@ -133,7 +133,7 @@ Clients then route requests based on the routing key. This is typically called "sharding", where every replica set acts as the single source for a subset of data. We want to distribute reads and writes as evenly as possible, sizing partitions appropriately. -In order to maintain balance, the DB also has to repartition the data when nodes are added or removed. +In order to maintain balance, the database also has to repartition the data when nodes are added or removed. In order to reduce range hot-spotting, some DBs use a hash of the value as the routing key. A naive approach is to map keys to nodes with something like `hash(v) % N`, where N is the number of nodes. The downside of this is that if the number of nodes changes, the system is immediately unbalanced and needs to be repartitioned. @@ -171,7 +171,7 @@ Many other DBMSs and FLP Impossibility shows that it is impossible to guarantee consensus in a completely asynchronous system in unbounded time. +FLP Impossibility shows that deterministic consensus cannot guarantee both safety and termination in a completely asynchronous system if even one process may fail. We've discussed the tradeoffs between failure detection accuracy and speed. Consensus algorithms assume an async model and guarantee safety while using an external failure detection algorithm to guarantee liveness. Because failure detection is not always fully accurate, there will be some situations where the algorithm waits for a process that is incorrectly accused of being faulty. @@ -80,7 +80,7 @@ It uses a hierarchical distributed key-value store, which is used to ensure a to Processes in ZAB are either a follower or a (temporary) leader. The leader executes algorithm steps, broadcasts messages to followers, and establishes the event order. -All writes and reads of the most recent values are routed to the leader. +All writes, and reads that require the most recent values, are routed to the leader. The protocol timeline is split into epochs, with one leader per epoch. The process starts by using leader election to find a prospective leader. diff --git a/src/assets/articles/databaseInternalsChapter5.md b/src/assets/articles/databaseInternalsChapter5.md index e647504..f870569 100644 --- a/src/assets/articles/databaseInternalsChapter5.md +++ b/src/assets/articles/databaseInternalsChapter5.md @@ -1,8 +1,8 @@ --- -title: Database Internals Ch. 5 - Transaction Processing & Recovery +title: Database Internals Ch. 5 - Transaction Processing and Recovery description: Notes on Chapter 5 of Database Internals by Alex Petrov. Transaction Processing and Recovery in Database Management Systems. published: February 27, 2026 -updated: March 18, 2026 +updated: March 29, 2026 minutesToRead: 12 path: /articles/database-internals-chapter-5/ image: /images/database-internals.jpg @@ -13,7 +13,7 @@ tags: collection: slug: database-internals title: Database Internals - shortTitle: Ch. 5 - Transaction Processing & Recovery + shortTitle: Ch. 5 - Transaction Processing and Recovery shortDescription: Transaction Processing and Recovery in Database Management Systems. order: 5 --- @@ -31,14 +31,14 @@ This post contains my notes on Chapter 5 of Martin Kleppmann and others have raised concerns over the assumptions we make with ACID, it is still an important concept to learn. In short, ACID means: 1. Atomicity - transactions are indivisible, meaning all-or-nothing. All steps within a transaction are either committed (applied) or aborted (rolled back and possibly retried). -2. Consistency - an app-specific guarantee (controlled by the app, not the DBMS); each transaction brings the DB from one valid state to another with all constraints and rules intact. +2. Consistency - an app-specific guarantee (controlled by the app, not the DBMS); each transaction brings the database from one valid state to another with all constraints and rules intact. 3. Isolation - concurrent transactions can execute without interference. -4. Durability - once a transaction has been committed, all db state must be persisted to disk in order to survive system failures, restarts, etc. +4. Durability - once a transaction has been committed, all database state must be persisted to disk in order to survive system failures, restarts, etc. There are several components required to manage transactions: - Lock manager - guards access to resources and prevents concurrent accesses that would violate data integrity -- Page cache - serves as an intermediary between persistent storage and the rest of the storage engine. All changes to the DB state are applied here first. +- Page cache - serves as an intermediary between persistent storage and the rest of the storage engine. All changes to the database state are applied here first. - Log manager - holds a history of the operations applied to cached pages that are not yet synced with persistent storage. This guarantees that operations won't be lost in case of crashes. It is also referenced when aborting transactions. --- @@ -114,7 +114,7 @@ Concurrency control is a set of techniques for handling interactions between con #### Serializability -A "schedule" is a list of ops required to execute a set of transactions from the db's perspective. A schedule is "complete" if it contains all ops from every transaction executed in it. It is "serial" when transactions are executed independently and in serial (one after the other). "Serializable" schedules allow us to execute transactions concurrently while maintaining the correctness of a serial schedule. +A "schedule" is a list of ops required to execute a set of transactions from the database's perspective. A schedule is "complete" if it contains all ops from every transaction executed in it. It is "serial" when transactions are executed independently and in serial (one after the other). "Serializable" schedules allow us to execute transactions concurrently while maintaining the correctness of a serial schedule. #### Transaction Isolation @@ -165,11 +165,11 @@ With Pessimistic Concurrency Control (PCC), transaction conflicts are determined #### Lock-Based Concurrency Control -Lock-based concurrency control schemes are a form of PCC that use locks on db objects instead of using concurrency control to resolve schedules. Downsides include contention and scalability issues. Two-phase locking (2PL) is a common approach. +Lock-based concurrency control schemes are a form of PCC that use locks on database objects instead of using concurrency control to resolve schedules. Downsides include contention and scalability issues. Two-phase locking (2PL) is a common approach. When locks are introduced into the system we must consider and handle deadlocks. Strategies exist such as timeouts and "Conservative 2PL", but they limit concurrency. Typically, DBMS use a transaction manager to detect and avoid deadlocks. This is usually done with a "waits-for" graph. Cycles in the graph represent deadlocks. Detection can be done periodically or continuously. Transaction managers typically prioritize older transactions. -Locks are used to isolate and schedule overlapping transactions and manage DB contents, but not internal storage structures. They can guard either a single key or a set of keys, and are stored outside of the tree and managed by the DB lock manager. Latches, on the other hand, guard physical representations - tree structure and page contents. Since a modification on a leaf level might propagate up to higher levels, latches might have to be obtained on multiple levels. To increase concurrency, latches should be held for the smallest possible duration. Readers-Writes Locks (RWLs) allow multiple concurrent readers access to an object, with only writers needing to obtain exclusive access. "Latch crabbing" is a simple and optimistic method that allows holding latches for less time and releasing them as soon as it's clear that the executing operation doesn't need them anymore. +Locks are used to isolate and schedule overlapping transactions and manage database contents, but not internal storage structures. They can guard either a single key or a set of keys, and are stored outside of the tree and managed by the database lock manager. Latches, on the other hand, guard physical representations - tree structure and page contents. Since a modification on a leaf level might propagate up to higher levels, latches might have to be obtained on multiple levels. To increase concurrency, latches should be held for the smallest possible duration. Readers-Writes Locks (RWLs) allow multiple concurrent readers access to an object, with only writers needing to obtain exclusive access. "Latch crabbing" is a simple and optimistic method that allows holding latches for less time and releasing them as soon as it's clear that the executing operation doesn't need them anymore. Blink-Trees, which use high keys and sibling links, allow a state called a "half-split". This approach can reduce contention and simplify concurrent access while reducing the number of locks held during tree state modifications. More importantly, it allows reads concurrent to structural tree changes and helps prevent deadlocks. diff --git a/src/assets/articles/databaseInternalsChapter6.md b/src/assets/articles/databaseInternalsChapter6.md index 03da94a..5f01d04 100644 --- a/src/assets/articles/databaseInternalsChapter6.md +++ b/src/assets/articles/databaseInternalsChapter6.md @@ -2,7 +2,7 @@ title: Database Internals Ch. 6 - B-Tree Variants description: Notes on Chapter 6 of Database Internals by Alex Petrov. B-Tree implementation techniques, optimizations, and real-world variants. published: March 1, 2026 -updated: March 1, 2026 +updated: March 29, 2026 minutesToRead: 6 path: /articles/database-internals-chapter-6/ image: /images/database-internals.jpg @@ -26,8 +26,12 @@ This post contains my notes on Chapter 6 of [!NOTE] > The book claims that the subtrees will have size sqr(N), but I believe they are actually sqrt(N). diff --git a/src/assets/articles/databaseInternalsChapter7.md b/src/assets/articles/databaseInternalsChapter7.md index 7ec27dd..14eb519 100644 --- a/src/assets/articles/databaseInternalsChapter7.md +++ b/src/assets/articles/databaseInternalsChapter7.md @@ -26,8 +26,12 @@ This post contains my notes on Chapter 7 of 11 minute read • March 8, 2026 +10 minute read • March 8, 2026
This post contains my notes on Chapter 8 of _Database Internals_ by Alex Petrov. These notes are intended as a reference and are not meant as a substitute for the original text. I found Timilehin Adeniran's notes on _Designing Data-Intensive Applications_ extremely helpful while reading that book, so I thought I'd try to do the same here. @@ -43,7 +43,7 @@ Every concurrency problem has some properties of a distributed system. Threads a #### Shared State in a Distributed System -We can try to introduce some notion of shared memory to a distributed system, such as a database. Even if we solve the problems with concurrent access to it, we still cannot guarantee that all processes are in sync. To access this db, process can send messages over the communication medium. We'll therefore have to describe the system in terms of "synchrony" - whether the system is async, or if we can make some assumptions about timing. These assumptions give us options like timeouts and retries. +We can try to introduce some notion of shared memory to a distributed system, such as a database. Even if we solve the problems with concurrent access to it, we still cannot guarantee that all processes are in sync. To access this database, a process can send messages over the communication medium. We'll therefore have to describe the system in terms of "synchrony" - whether the system is async, or if we can make some assumptions about timing. These assumptions give us options like timeouts and retries. We don't always know the "nature" of an issue - if we haven't received a response because of a network issue, because the resource is overloaded, or because of a system crash. "Failure models" describe the ways in which failures can occur and how we decide to handle them. "Fault tolerance" describes the degree to which our system keeps operating correctly even when failures occur. diff --git a/src/assets/articles/databaseInternalsSummary.md b/src/assets/articles/databaseInternalsSummary.md new file mode 100644 index 0000000..be8aaae --- /dev/null +++ b/src/assets/articles/databaseInternalsSummary.md @@ -0,0 +1,227 @@ +--- +title: Database Internals - Summary & Review +description: Summary and review of Database Internals by Alex Petrov. +published: March 29, 2026 +updated: March 29, 2026 +minutesToRead: 10 +path: /articles/database-internals-summary/ +image: /images/database-internals.jpg +tags: + - 'reading notes' + - 'databases' + - 'distributed systems' +collection: + slug: database-internals + title: Database Internals + shortTitle: Summary & Review + shortDescription: Summary and review of Database Internals by Alex Petrov. + order: 15 +--- + +## Database Internals - Summary & Review + +10 minute read • March 29, 2026
+ +This post contains my summary and review of _Database Internals_ by Alex Petrov. These notes are intended as a reference and are not meant as a substitute for the original text. I found Timilehin Adeniran's notes on _Designing Data-Intensive Applications_ extremely helpful while reading that book, so I thought I'd try to do the same here. + +--- + +### Part I - Storage Engines + +#### B-Trees and LSM Trees + +Storage engines are shaped less by asymptotic complexity than by hardware behavior, access patterns, and operational tradeoffs. +B-Trees and LSM Trees are the clearest example of this. +Both are ordered structures optimized for disk-backed storage, but they make very different choices around buffering, mutability, and maintenance. + +B-Trees are the most commonly used example of a read-oriented structure. +They use wide nodes, high fanout, and low height to reduce seeks while preserving efficient point lookups and range scans. +The real complexity is not just the tree itself, but everything required to make it practical: slotted pages, separator keys, sibling links, overflow handling, rebalancing, compression, bulk loading, and variants such as copy-on-write trees, buffered trees, and Bw-Trees. +Storage-engine design is about preserving ordered access while balancing read performance, write cost, space usage, and concurrency. + +LSM Trees start from the opposite side of the tradeoff space. +Instead of optimizing in-place updates, they buffer writes in memory, flush sorted immutable structures to disk, and use compaction to merge and reconcile data over time. +This reduces the cost of small writes and takes advantage of sequential I/O, but it pushes work into later maintenance and introduces read, write, and space amplification tradeoffs. +That is why components such as memtables, SSTables, tombstones, bloom filters, and leveled or size-tiered compaction matter so much. +One especially useful connection is that B-Trees and related structures often still appear inside LSM-based systems, whether for indexing or for comparison. + +| | Buffered | Mutable | Ordered | +| ------------ | -------- | ------- | ------- | +| B+Trees | | ✓ | ✓ | +| WiredTiger | ✓ | ✓ | ✓ | +| LA-Trees | ✓ | ✓ | ✓ | +| CoW Trees | | | ✓ | +| 2C LSM Trees | ✓ | | ✓ | +| MC LSM Trees | ✓ | | ✓ | +| FD-Trees | ✓ | | ✓ | +| BitCask | | | | +| WiscKey | ✓\* | | ✓\* | +| BW-Trees | | | \* | + +Buffering, immutability, and ordering properties of discussed storage structures
+ +#### Transactions + +Transactions are the indivisible logical unit of work in database management systems. +They allow us to represent multiple operations in a single step. +ACID (atomicity, consistency, isolation, durability) is one of the most important concepts related to databases. +Transaction processing usually involves components such as the lock manager, page cache, and log manager. + +Most databases use a 2-level memory hierarchy: slower persistent storage (disk) and faster main memory (RAM). +Pages are cached in memory to reduce the number of disk accesses. +Page replacement algorithms use eviction policies such as FIFO, LRU, CLOCK, and LFU. +These policies have various tradeoffs surrounding precision (hit rate), overhead, and complexity. + +#### Recovery + +The WAL is an append-only auxiliary on-disk structure used for crash and transaction recovery. It has several functions: + +- It allows the page cache to buffer updates to disk-resident pages while ensuring durability +- It persists all ops on disk until cached copies of pages affected by these ops are synced on disk +- It allows lost in-memory changes to be reconstructed from the operation log in case of crash + +The WAL is usually coupled with a primary storage structure by the interface that allows trimming it whenever a checkpoint is reached. +Checkpoints tell the log system that log records up to a certain point aren’t required anymore. +“Fuzzy checkpointing” allows this to happen asynchronously and is a more practical approach. + +#### Concurrency Control + +Concurrency control is a set of techniques for handling interactions between concurrently executing transactions. They can be grouped into three buckets: + +- Optimistic Concurrency Control (OCC) +- Pessimistic Concurrency Control (PCC) +- Multiversion Concurrency Control (MVCC) + +A schedule is a list of ops required to execute a set of transactions from the db’s perspective. +A schedule is “complete” if it contains all ops from every transaction executed in it. +It is “serial” when transactions are executed independently and in serial (one after the other). +Serializable schedules allow us to execute transactions concurrently while maintaining the correctness of a serial schedule. + +Isolation levels specify how and when parts of the transaction should become visible to other concurrent transactions. + +#### Read & Write Anomalies + +Read anomalies include: + +- “Dirty” reads - when a transaction reads uncommitted changes from other transactions +- Non-repeatable “fuzzy” reads - when a transaction queries the same row twice and gets different results +- “Phantom” reads - when a transaction queries a set of rows twice and gets different results (the range-query equivalent of a fuzzy read) + +Write anomalies include: + +- “Lost” updates - when two transactions attempt to update the same value and the second transaction has no knowledge of the first and overwrites its updates without taking its updates into account +- “Dirty” writes - when a transaction takes an uncommitted value (dirty read) and modifies and saves it +- Write “skew” - when each individual transaction in a set respects the invariants, but the combination of the transactions does not + +| | Dirty | Non-Repeatable | Phantom | +| ---------------- | ------- | -------------- | ------- | +| Read Uncommitted | Allowed | Allowed | Allowed | +| Read Committed | - | Allowed | Allowed | +| Repeatable Read | - | - | Allowed | +| Serializable | - | - | - | + +Isolation levels and allowed anomalies
+ +--- + +### Part II - Distributed Systems + +#### Distributed Algorithms + +Distributed algorithms serve many purposes, such as: + +- Coordination - a process that supervises the actions and behaviors of several workers +- Cooperation - multiple participants relying on one another for finishing their task +- Dissemination - process cooperating in spreading the information to all interested parties +- Consensus - achieving agreement among multiple processes + +#### Two Generals, FLP Impossibility, and Byzantine Failures + +The Two Generals problem is a thought experiment that shows that it is impossible to achieve an agreement between two parties if communication is asynchronous and links fail. +FLP Impossibility shows that deterministic consensus cannot guarantee both safety and termination in a completely asynchronous system if even one process may fail. +Arbitrary (a.k.a. “Byzantine”) faults are where a process continues executing algorithm steps, but in a way that contradicts the algorithm. +These can be caused by software bugs, malicious actors, etc. + +#### Failure Detection + +Failures can occur at the link level or at the process level. +There are always tradeoffs between wrongly suspecting alive processes of being dead (false-positives) and giving dead processes the benefit of doubt (false-negatives). + +We can query the state of a remote process by triggering one of two periodic processes: + +- Ping - checks if the process is still alive by sending it a message and asserting that it responds within a specified timeframe +- Heartbeat - the process actively notifies its peers that it’s still running by sending messages to them + +Gossip provides another approach that avoids relying on a single-node view to make the decision. +Gossip collects and distributes the state of neighboring processes, with unresponsive nodes eventually being considered failed. +It increases the number of messages in the system, but allows info to spread more reliably. +In addition to failure detection, gossip is used for information propagation and dissemination. + +#### Leader Election + +To reduce synchronous overhead and the number of message round-trips required to reach a decision, some algorithms elect a leader process. +The leader is responsible for executing and coordinating steps of a distributed algorithm. +Possible solutions include the Bully algorithm, Invitation algorithm, and Ring algorithm. + +#### Replication & Consistency + +The CAP conjecture describes the tradeoffs between consistency C, availability A, and partition tolerance P. +The conjecture states that a system can only choose between consistency and availability when a partition occurs. +The two most common approaches are “AP” and “CP”. +CP systems prefer failing requests to serving potentially inconsistent data. +AP systems loosen the C requirements and allow serving potentially inconsistent values during the request. + +A consistency model can be thought of as a contract between participants. +They describe what expectations clients might have about returned values in the presence of replication and concurrent accesses. +Consistency models include strict consistency, linearizability, sequential consistency, causal consistency, and eventual consistency. + +Some systems opt for eventual consistency and use tunable parameters that follow the CAP conjecture. +Strong eventual consistency is gaining traction with Conflict-Free Replicated Data Types (CRDTs). + +#### Distributed Transactions + +To make multiple (possibly remote) operations appear atomic, we need to use a class of algorithm called “atomic commitment”. +These algorithms disallow disagreements between participants by not committing if even one participant voted against it. +Two-phase commit (2PC) is the most straightforward protocol for distributed commitment, allowing multi-partition atomic updates. + +The two most common approaches for distributed transaction are Calvin and Spanner. +Calvin sequences and batches transactions, and uses Paxos for determining which transactions make it into the current epoch (batch). +Unlike Calvin, Spanner uses 2PC over consensus groups per partition (shard). +It uses Paxos for consistent transaction log replication, 2PC for cross-shard transactions, and TrueTime for deterministic transaction ordering. +This means that multi-partition transactions have a higher cost compared to Calvin, but Spanner usually wins in terms of availability. + +#### Consensus + +Consensus algorithms in distributed systems allow multiple processes to reach an agreement on a value. +Atomic broadcast algorithms such as ZooKeeper ensure a total order of events and the atomic delivery necessary to maintain consistency between replica states. +The two most widespread consensus algorithms are Paxos and Raft, with the latter being considered easier to reason about and implement. +In adversarial environments, Byzantine fault-tolerant algorithms like PBFT must be employed. + +--- + +### Review & Thoughts + +#### Overall Review + +I found this book to be a great deep-dive into database internals, storage engines and building blocks, and distributed systems. +The first half of the book offered unique depth into structures like B-Trees and LSM Trees. +I found the second half more interesting (and more applicable to my work), but it seems to overlap heavily with books like Designing Data-Intensive Applications. + +#### Who Would I Recommend This To? + +Naturally, I would recommend this book to anyone interested in building or modifying their own storage engines. +I would also recommend it to any software engineers tasked with tuning existing systems, or picking the right tool for the job when building from the ground up. +I would not recommend this for engineers early in their career, and/or those studying for system design interviews. +Those readers would be better served by something much higher-level like Alex Xu's System Design Interview. + +#### Useful Tidbits + +This book introduced me to several data structures and algorithms that I would like to study further: + +- Merkle Trees, which can be used to build trees of content hashes (picture file-change detection in a system like Git) +- Bloom Filters, which efficiently (but probabilistically) check for the inclusion of an item in a set +- CRDTs, specifically the Automerge project + +--- + +Database Internals by Alex Petrov (O'Reilly). Copyright 2019 Oleksander Petrov, 978-1-492-04034-7
From 481bf0dd3ef730838518899868876fe0eef26833 Mon Sep 17 00:00:00 2001 From: Noah Tigner