From 2351ceee680bd04e1ed080579989321f6a4727e7 Mon Sep 17 00:00:00 2001 From: Noah Tigner Date: Mon, 9 Mar 2026 21:13:49 -0600 Subject: [PATCH] Article: DBI ch9 Notes --- README.md | 7 ++ src/assets/articles/databaseInternals.md | 2 +- .../articles/databaseInternalsChapter9.md | 102 ++++++++++++++++++ 3 files changed, 110 insertions(+), 1 deletion(-) create mode 100644 src/assets/articles/databaseInternalsChapter9.md diff --git a/README.md b/README.md index d53e30d..39c9878 100644 --- a/README.md +++ b/README.md @@ -70,6 +70,13 @@ Noah Tigner's [Portfolio Website](https://noahtigner.com) - [x] [Chapter 6 - B-Tree Variants](https://noahtigner.com/articles/database-internals-chapter-6/) - [x] [Chapter 7 - Log-Structured Storage](https://noahtigner.com/articles/database-internals-chapter-7/) - [x] [Chapter 8 - Distributed Systems Intro & Overview](https://noahtigner.com/articles/database-internals-chapter-8/) + - [x] [Chapter 9 - Failure Detection](https://noahtigner.com/articles/database-internals-chapter-9/) + - [ ] [Chapter 10 - Failure Detection](https://noahtigner.com/articles/database-internals-chapter-10/) + - [ ] [Chapter 11 - Failure Detection](https://noahtigner.com/articles/database-internals-chapter-11/) + - [ ] [Chapter 12 - Failure Detection](https://noahtigner.com/articles/database-internals-chapter-12/) + - [ ] [Chapter 13 - Failure Detection](https://noahtigner.com/articles/database-internals-chapter-13/) + - [ ] [Chapter 14 - Failure Detection](https://noahtigner.com/articles/database-internals-chapter-14/) + - [ ] [Summary & Thoughts](https://noahtigner.com/articles//) ## Available Scripts: diff --git a/src/assets/articles/databaseInternals.md b/src/assets/articles/databaseInternals.md index 768e973..ddaa186 100644 --- a/src/assets/articles/databaseInternals.md +++ b/src/assets/articles/databaseInternals.md @@ -41,7 +41,7 @@ This is a collection of my notes on Chapter 8 - Distributed Systems Intro & Overview -- [ ] Chapter 9 - Failure Detection +- [x] Chapter 9 - Failure Detection - [ ] Chapter 10 - Leader Election - [ ] Chapter 11 - Replication & Consistency - [ ] Chapter 12 - Anti-Entropy & Dissemination diff --git a/src/assets/articles/databaseInternalsChapter9.md b/src/assets/articles/databaseInternalsChapter9.md new file mode 100644 index 0000000..6e8d87d --- /dev/null +++ b/src/assets/articles/databaseInternalsChapter9.md @@ -0,0 +1,102 @@ +--- +title: Database Internals Ch. 9 - Failure Detection +description: Notes on Chapter 9 of Database Internals by Alex Petrov. Failure detection in distributed systems with heartbeats, pings, and gossip. +published: March 9, 2026 +updated: March 9, 2026 +minutesToRead: 5 +path: /articles/database-internals-chapter-9/ +image: /images/database-internals.jpg +tags: + - 'reading notes' + - 'databases' + - 'distributed systems' +collection: + slug: database-internals + title: Database Internals + shortTitle: Ch. 9 - Failure Detection + shortDescription: Failure detection in distributed systems with heartbeats, pings, and gossip. + order: 9 +--- + +## Database Internals - Ch. 9 - Failure Detection + +

5 minute read • March 9, 2026

+ +This post contains my notes on Chapter 9 of _Database Internals_ by Alex Petrov. These notes are intended as a reference and are not meant as a substitute for the original text. I found Timilehin Adeniran's notes on _Designing Data-Intensive Applications_ extremely helpful while reading that book, so I thought I'd try to do the same here. + +--- + +Detecting failures in asynchronous systems is difficult since it is impossible to tell if a process has crashed or if it is just running slowly (see FLP Impossibility). Failures can occur at the link level or at the process level. There's always tradeoffs between wrongly suspecting alive processes of being dead (false-positives) and giving dead processes the benefit of doubt (false-negatives). + +A failure detector is a local subsystem responsible for identifying failed processes and excluding them from the algorithm. Failure detectors guarantee properties such as: + +- Liveness - a specific, intended event must occur, requiring the failure detector to not produce false-negatives +- Safety - unintended events must not occur, requiring the failure detector to not produce false-positives +- Completeness - the algorithm must be able to reach and complete its final step +- Efficiency - how quickly process failures are detected +- Accuracy - the measure of how many false-positives and false-negatives are produced (negative relationship with efficiency) + +--- + +### Heartbeats and Pings + +We can query the state of a remote process by triggering one of two periodic processes: + +- Ping - checks if the process is still alive by sending it a message and asserting that it responds within a specified timeframe +- Heartbeat - the process actively notifies its peers that it's still running by sending messages to them + +Many failure detection algorithms use heartbeats and timeouts. This approach can have issues with precision, and it doesn't capture process visibility from the perspective of other processes (e.g., it won't spot issues caused by unintended network partitions). + +#### Timeout-Free Failure Detector + +Some algorithms avoid relying on timeouts for detecting failures. For example, Heartbeat is a timeout-free failure detector that operates under async system assumptions. The algorithm assumes that all connections are "fair paths" (containing only fair loss links), and that each process is aware of all other processes in the network. + +Each process maintains a list of neighbors and counters associated with them. Processes start up by transmitting heartbeat messages to all neighbors. These messages contain the path that the heartbeat has traveled so far. When a process receives a heartbeat message, it appends itself to the message's path and forwards it to all of its neighbors that aren't already on the path. Heartbeat counters represent a global and normalized view of the system. If we're not careful though, the algorithm can produce false-positives. + +#### Outsourced Heartbeats + +Alternatively, outsourced heartbeats can improve reliability by using info about a process's liveness from the perspective of its neighbors. This does not require all processes to be aware of all other processes, only a subset of connected peers. This accounts for both direct and indirect reachability and allows us to make more accurate decisions. + +--- + +### Phi-Accrual Failure Detector + +Phi-accrual produces a continuous scale instead of the binary alive/dead measurement of other detection methods. It captures the probability of a process's crash by maintaining a sliding window which collects arrival times of the recent heartbeats from peers. This info is used to approximate the arrival time of the next heartbeat, which is then compared against the actual arrival time to compute the "suspicion level". + +It can be viewed as a collection of three subsystems: + +- Monitoring - collects liveness info through pings, heartbeats, or other request-response sampling +- Interpretation - decides if a process should be marked as suspicious +- Action - executes a callback whenever a process is marked as suspicious + +--- + +### Gossip and Failure Detection + +Gossip provides another approach that avoids relying on a single-node view to make the decision. Gossip collects and distributes the state of neighboring processes. Each member stores a list of other members, their heartbeat counters, and the timestamp when the counter was last updated. Periodically, each member increments its counter and distributes its list to a random neighbor. Upon receipt, lists are merged and counters are updated. Nodes also periodically check the list of states and heartbeat counters. If any node didn't update its counter for long enough, it is considered failed. This allows us to detect crashed nodes as well as nodes that are unreachable by any other cluster member. Gossip increases the number of messages in the system, but allows info to spread more reliably. + +--- + +### Reversing Failure Detection Problem Statement + +Since propagating info about failures is not always possible, and propagating it by notifying every member is expensive, one approach called FUSE (failure notification service) focuses on reliable and cheap failure propagation that works even in cases of network partitions. To detect process failures, processes are organized into groups. If one group becomes unavailable, all participants within the group fail. Every time a process fails, the failure propagates to the whole group. This allows detection in the presence of any partitions or node failures. This approach uses the absence of communication as a means of info propagation (when one process in the group fails, all processes within the group stop sending heartbeats). + +--- + +### Other Resources + +Ben Dicken has a video covering pinging, gossip, and their associated tradeoffs. + + + +--- + +

Database Internals by Alex Petrov (O'Reilly). Copyright 2019 Oleksander Petrov, 978-1-492-04034-7