Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ Noah Tigner's [Portfolio Website](https://noahtigner.com)
- [x] [Chapter 12 - Anti-Entropy & Dissemination](https://noahtigner.com/articles/database-internals-chapter-12/)
- [x] [Chapter 13 - Distributed Transactions](https://noahtigner.com/articles/database-internals-chapter-13/)
- [x] [Chapter 14 - Consensus](https://noahtigner.com/articles/database-internals-chapter-14/)
- [ ] [Summary & Thoughts](https://noahtigner.com/articles/<TODO>/)
- [x] [Summary & Thoughts](https://noahtigner.com/articles/database-internals-summary/)

## Available Scripts:

Expand Down
4 changes: 4 additions & 0 deletions src/assets/articles/databaseInternals.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,10 @@ This is a collection of my notes on <a href="https://www.oreilly.com/library/vie
- [x] <a href="https://noahtigner.com/articles/database-internals-chapter-13/" target="_blank" rel="noopener">Chapter 13 - Distributed Transactions</a>
- [x] <a href="https://noahtigner.com/articles/database-internals-chapter-14/" target="_blank" rel="noopener">Chapter 14 - Consensus</a>

#### Summary, Review, and Flash Cards

- [x] <a href="https://noahtigner.com/articles/database-internals-summary/" target="_blank" rel="noopener">Summary & Review</a>

---

### Motivation
Expand Down
12 changes: 10 additions & 2 deletions src/assets/articles/databaseInternalsChapter1.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Database Internals Ch. 1 - Storage Engines Intro & Overview
description: Notes on Chapter 1 of Database Internals by Alex Petrov. OLTP vs. OLAP, Memory vs. Disk-Based Storage, Row vs. Column Orientation, Indexing, etc.
published: January 31, 2026
updated: March 6, 2026
updated: March 29, 2026
minutesToRead: 5
path: /articles/database-internals-chapter-1/
image: /images/database-internals.jpg
Expand All @@ -26,15 +26,19 @@ This post contains my notes on Chapter 1 of <a href="https://www.oreilly.com/lib

---

### Introduction

Database Management Systems typically fall into one of three buckets:

- Online Transaction Processing (OLTP), which handles lots of user-facing requests. Queries are often predefined and short-lived.
- Online Analytical Processing (OLAP), which handles complex aggregations used for analytics, data warehousing, etc. Best for complex, long-running, ad-hoc queries.
- Hybrid Transactional and Analytical Processing (HTAP), which are unified systems that mix OLTP and OLAP techniques.

---

### DBMS Architecture

DBMS use client/server architectures where applications are clients and nodes (db instances) are the servers. Concerns are typically separated as follows:
DBMS use client/server architectures where applications are clients and nodes (database instances) are the servers. Concerns are typically separated as follows:

- Client requests (queries) arrive through the transport system
- The transport subsystem gives queries to the query processor which parses, interpolates, and validates queries. Later, access control checks are performed
Expand All @@ -47,6 +51,8 @@ DBMS use client/server architectures where applications are clients and nodes (d
- A buffer manager, which caches data pages in-memory
- A recovery manager, which maintains the operations logs and handles recoveries

---

### Memory vs. Disk-Based DBMS

In-memory, or "main memory," systems store data primarily in memory and use disks for recovery and logging. Disk-based systems hold most data on disk and use memory for caching. Memory is much faster than disk, and although it is getting cheaper, it is still much more expensive. Memory is also volatile (less durable).
Expand Down Expand Up @@ -117,4 +123,6 @@ Data immutability means that records must be append-only or copy-on-write (repla

The final decision is whether or not records should be stored in keyed order on disk, with tradeoffs in both cases.

---

<p class="subtitle"><i>Database Internals</i> by Alex Petrov (O'Reilly). Copyright 2019 Oleksander Petrov, 978-1-492-04034-7</p>
6 changes: 3 additions & 3 deletions src/assets/articles/databaseInternalsChapter10.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
title: Database Internals Ch. 10 - Leader Election
description: Notes on Chapter 10 of Database Internals by Alex Petrov. Leader election strategies like the Bully Algorithm, Invitation Algorithm, and Ring Algorithm.
published: March 11, 2026
updated: March 11, 2026
minutesToRead: 5
updated: March 29, 2026
minutesToRead: 4
path: /articles/database-internals-chapter-10/
image: /images/database-internals.jpg
tags:
Expand All @@ -20,7 +20,7 @@ collection:

## Database Internals - Ch. 10 - Leader Election

<p class="subtitle">5 minute read • March 11, 2026</p>
<p class="subtitle">4 minute read • March 11, 2026</p>

This post contains my notes on Chapter 10 of <a href="https://www.oreilly.com/library/view/database-internals/9781492040330/" target="_blank" rel="noopener">_Database Internals_</a> by Alex Petrov. These notes are intended as a reference and are not meant as a substitute for the original text. I found <a href="https://timilearning.com/posts/ddia/notes/" target="_blank" rel="noopener">Timilehin Adeniran's notes</a> on <a href="https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/" target="_blank" rel="noopener">_Designing Data-Intensive Applications_</a> extremely helpful while reading that book, so I thought I'd try to do the same here.

Expand Down
18 changes: 9 additions & 9 deletions src/assets/articles/databaseInternalsChapter11.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
---
title: Database Internals Ch. 11 - Replication and Consistency
description: Notes on Chapter 11 of Database Internals by Alex Petrov. Replication and consistency in distributed systems, CAP, and CDRTs.
description: Notes on Chapter 11 of Database Internals by Alex Petrov. Replication and consistency in distributed systems, CAP, and CRDTs.
published: March 18, 2026
updated: March 18, 2026
updated: March 29, 2026
minutesToRead: 10
path: /articles/database-internals-chapter-11/
image: /images/database-internals.jpg
Expand All @@ -14,7 +14,7 @@ collection:
slug: database-internals
title: Database Internals
shortTitle: Ch. 11 - Replication and Consistency
shortDescription: Replication and consistency in distributed systems, CAP, and CDRTs.
shortDescription: Replication and consistency in distributed systems, CAP, and CRDTs.
order: 11
---

Expand All @@ -40,7 +40,7 @@ To make the system highly available, we need to design it in a way that allows h

### Infamous CAP

Availability measures the system's ability to respond to every request successfully. We would also like each operation to be (<a href="https://noahtigner.com/articles/database-internals-chapter-5/#introduction" target="_blank" rel="noopener">atomically</a> / <a href="https://noahtigner.com/articles/database-internals-chapter-11/#linearizability" target="_blank" rel="noopener">linearizably</a>) consistent. Ideally, we would like to achieve both availability and consistency while tolerating network partitions. The CAP conjecture describes the tradeoffs between consistency \(C), availability (A), and partition tolerance (P). The conjecture states that at most two of the three can be achieved.
Availability measures the system's ability to respond to every request successfully. We would also like each operation to be (<a href="https://noahtigner.com/articles/database-internals-chapter-5/#introduction" target="_blank" rel="noopener">atomically</a> / <a href="https://noahtigner.com/articles/database-internals-chapter-11/#linearizability" target="_blank" rel="noopener">linearizably</a>) consistent. Ideally, we would like to achieve both availability and consistency while tolerating network partitions. The CAP conjecture describes the tradeoffs between consistency <em>C</em>, availability <em>A</em>, and partition tolerance <em>P</em>. The conjecture states that a system can only choose between consistency and availability when a partition occurs.

The two most common approaches are "AP" and "CP". CP systems prefer failing requests to serving potentially inconsistent data. AP systems loosen the C requirements and allow serving potentially inconsistent values during the request.

Expand All @@ -64,7 +64,7 @@ From the client's perspective, distributed systems act as if storage is shared,
Registers can be accessed by multiple readers and writes simultaneously. When it comes to concurrent ops, there are three types of registers:

- Safe - reads to the safe registers may return arbitrary values within the range of the register during a concurrent write op
- Regular - read ops return the value of the most recently completed write, or the value of the write that overlaps with the current reade op
- Regular - read ops return the value of the most recently completed write, or the value of the write that overlaps with the current read op
- Atomic - every write op has a single moment before which every read returns an old value and after which every read returns a new value. This guarantees linearizability.

---
Expand Down Expand Up @@ -123,7 +123,7 @@ Following CAP principles, we can tune our eventual consistency with three parame
- Write consistency <em>W</em> - the number of nodes that have to acknowledge a write for it to succeed
- Read consistency <em>R</em> - the number of nodes that have to respond to a read operation for it to succeed

Choosing levels where <em>R + W > N</em> gaurantees that the most recently written value is returned. Write-heavy systems sometimes pick <em>W = 1</em> and <em>R = N</em>, which allows writes to be acknowledged by just one node, but requires all replicas to be available for reads. Increasing <em>W</em> or <em>R</em> increases latency and raises requirements for node availability. Decreasing them improves system availability while sacrificing consistency.
Choosing levels where <em>R + W > N</em> helps reduce the chance of stale reads by forcing read and write quorums to overlap. Write-heavy systems sometimes pick <em>W = 1</em> and <em>R = N</em>, which allows writes to be acknowledged by just one node, but requires all replicas to be available for reads. Increasing <em>W</em> or <em>R</em> increases latency and raises requirements for node availability. Decreasing them improves system availability while sacrificing consistency.

A level of <em>floor(N / 2) + 1</em> is called a "quorum", or majority of votes. In a system with <em>2f + 1</em> nodes, the system can keep responding even when up to <em>f</em> become unavailable. This does not, however, guarantee monotonicity in cases of incomplete writes.

Expand All @@ -140,11 +140,11 @@ Witness replicas help reduce storage costs while preserving consistency.

---

### Strong Eventual Consistency and CDRTs
### Strong Eventual Consistency and CRDTs

Under strong eventual consistency, updates are allowed to propagate to servers late or out of order, but when all updates finally propagate to target nodes, conflicts between them can be resolved and they can be merged to produce the same valid state. Under some conditions, we can relax our consistency requirements by allowing operations to preserve additional state that allows the diverged states to be reconciled (merged) after execution. This is often implemented with Conflict-Free Replicated Data Types (CDRTs), as in the case of Redis. CDRTs are specialized data structures that preclude the existence of conflicts and allow ops to be applied in any order without changing the results. They are extremely useful in distributed systems and are often used in eventually consistent systems.
Under strong eventual consistency, updates are allowed to propagate to servers late or out of order, but when all updates finally propagate to target nodes, conflicts between them can be resolved and they can be merged to produce the same valid state. Under some conditions, we can relax our consistency requirements by allowing operations to preserve additional state that allows the diverged states to be reconciled (merged) after execution. This is often implemented with Conflict-Free Replicated Data Types (CRDTs), as in the case of Redis. CRDTs are specialized data structures that preclude the existence of conflicts and allow ops to be applied in any order without changing the results. They are extremely useful in distributed systems and are often used in eventually consistent systems.

The simplest CDRTs are operations-based Commutative Replicated Data Types (CmRDTs), which require ops to be side-effect free, commutative, and causally ordered. Another example is the unordered Grow-Only Set (G-Set), which supports additions, removals, merges, etc. A more complex example is Martin Kleppmann's conflict-free replicated JSON data type, which allows modifications on deeply-nested JSON documents.
The simplest CRDTs are operations-based Commutative Replicated Data Types (CmRDTs), which require ops to be side-effect free, commutative, and causally ordered. Another example is the unordered Grow-Only Set (G-Set), which supports additions and merges. A more complex example is Martin Kleppmann's conflict-free replicated JSON data type, which allows modifications on deeply nested JSON documents.

---

Expand Down
6 changes: 3 additions & 3 deletions src/assets/articles/databaseInternalsChapter12.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
---
title: Database Internals Ch. 12 - Anti-Entropy and Dissemination
title: Database Internals Ch. 12 - Anti-Entropy & Dissemination
description: Notes on Chapter 12 of Database Internals by Alex Petrov. Anti-Entropy and Dissemination in distributed systems, including read repair, hinted handoff, Merkle Trees, and gossip dissemination.
published: March 21, 2026
updated: March 21, 2026
updated: March 29, 2026
minutesToRead: 7
path: /articles/database-internals-chapter-12/
image: /images/database-internals.jpg
Expand All @@ -13,7 +13,7 @@ tags:
collection:
slug: database-internals
title: Database Internals
shortTitle: Ch. 12 - Anti-Entropy and Dissemination
shortTitle: Ch. 12 - Anti-Entropy & Dissemination
shortDescription: Anti-Entropy and Dissemination in distributed systems, including read repair, hinted handoff, Merkle Trees, and gossip dissemination.
order: 12
---
Expand Down
6 changes: 3 additions & 3 deletions src/assets/articles/databaseInternalsChapter13.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Database Internals Ch. 13 - Distributed Transactions
description: Notes on Chapter 13 of Database Internals by Alex Petrov. Distributed Transactions, including two-phase commit, Spanner, partitioning, sharding, consistent hashing, and coordination avoidance.
published: March 26, 2026
updated: March 26, 2026
updated: March 29, 2026
minutesToRead: 9
path: /articles/database-internals-chapter-13/
image: /images/database-internals.jpg
Expand Down Expand Up @@ -133,7 +133,7 @@ Clients then route requests based on the routing key.
This is typically called "sharding", where every replica set acts as the single source for a subset of data.

We want to distribute reads and writes as evenly as possible, sizing partitions appropriately.
In order to maintain balance, the DB also has to repartition the data when nodes are added or removed.
In order to maintain balance, the database also has to repartition the data when nodes are added or removed.
In order to reduce range hot-spotting, some DBs use a hash of the value as the routing key.
A naive approach is to map keys to nodes with something like `hash(v) % N`, where N is the number of nodes.
The downside of this is that if the number of nodes changes, the system is immediately unbalanced and needs to be repartitioned.
Expand Down Expand Up @@ -171,7 +171,7 @@ Many other DBMSs and <a href="https://noahtigner.com/articles/database-internals

### Coordination Avoidance

Invariant Confluence (I-Confluence) is a property that ensures that two invariant-valid but diverged DB states can be merged into a single valid DB state.
Invariant Confluence (I-Confluence) is a property that ensures that two invariant-valid but diverged database states can be merged into a single valid database state.
Because any two valid states can be merged into a valid state, I-Confluent ops can be executed without any additional coordination, which significantly improves performance and scalability potential.

A system model that allows coordinator avoidance has to guarantee the following properties:
Expand Down
4 changes: 2 additions & 2 deletions src/assets/articles/databaseInternalsChapter14.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ This post contains my notes on Chapter 14 of <a href="https://www.oreilly.com/li
### Introduction

Consensus algorithms in distributed systems allow multiple processes to reach an agreement on a value.
<a href="https://noahtigner.com/articles/database-internals-chapter-8/#flp-impossibility" target="_blank" rel="noopener">FLP Impossibility</a> shows that it is impossible to guarantee consensus in a completely asynchronous system in unbounded time.
<a href="https://noahtigner.com/articles/database-internals-chapter-8/#flp-impossibility" target="_blank" rel="noopener">FLP Impossibility</a> shows that deterministic consensus cannot guarantee both safety and termination in a completely asynchronous system if even one process may fail.
We've discussed the <a href="https://noahtigner.com/articles/database-internals-chapter-9/#introduction" target="_blank" rel="noopener">tradeoffs between failure detection accuracy and speed</a>.
Consensus algorithms assume an async model and guarantee safety while using an external failure detection algorithm to guarantee liveness.
Because failure detection is not always fully accurate, there will be some situations where the algorithm waits for a process that is incorrectly accused of being faulty.
Expand Down Expand Up @@ -80,7 +80,7 @@ It uses a hierarchical distributed key-value store, which is used to ensure a to

Processes in ZAB are either a follower or a (temporary) leader.
The leader executes algorithm steps, broadcasts messages to followers, and establishes the event order.
All writes and reads of the most recent values are routed to the leader.
All writes, and reads that require the most recent values, are routed to the leader.

The protocol timeline is split into epochs, with one leader per epoch.
The process starts by using <a href="https://noahtigner.com/articles/database-internals-chapter-10/" target="_blank" rel="noopener">leader election</a> to find a <em>prospective</em> leader.
Expand Down
Loading
Loading