A disk-backed B-Tree implementation in Python featuring fixed-size binary block storage, persistent metadata, an LRU buffer pool, deletion with rebalancing, structural validation, unit tests, and search-performance benchmarking.
This project was built as a real-world data structures and storage systems project. It demonstrates how database-style indexes can persist tree nodes on disk while reducing repeated disk access through buffer pool caching.
- Persistent B-Tree stored in a binary file
- Fixed-size block I/O through
DiskManager - Serializable B-Tree nodes with block-based child references
- Insert, search, update, delete, and sorted traversal
- Standard B-Tree split, borrow, merge, and root-shrink operations
- LRU
BufferPoolManagerwith dirty-page flushing - Structural validation for B-Tree invariants
- Unit test suite with 39 passing tests
- Benchmark comparing search with and without buffer pool
Benchmark configuration:
Items inserted: 20,000
Search operations: 50,000
Minimum degree: 32
Block size: 4096 bytes
Buffer capacity: 256 blocks
Measured result on local machine:
| Configuration | Total Search Time | Avg/Search | Found Keys |
|---|---|---|---|
| Without Buffer Pool | 11.176365 s | 223.527 µs | 25,000 |
| With Buffer Pool | 2.634216 s | 52.684 µs | 25,000 |
Buffer pool statistics:
Cache hits: 138,791
Cache misses: 10,644
Disk reads: 10,644
Disk writes during search: 0
Hit rate: 92.88%
Speedup: 4.24x
The benchmark shows that the buffer pool significantly improves repeated search performance by caching frequently accessed root and internal nodes.
persistent_btree_project/
├── README.md
├── requirements.txt
├── benchmark.py
├── btree/
│ ├── __init__.py
│ ├── disk_manager.py
│ ├── buffer_pool.py
│ ├── btree_node.py
│ └── persistent_btree.py
└── tests/
├── test_disk_manager.py
├── test_buffer_pool.py
├── test_btree_node.py
└── test_persistent_btree.py
This project uses only Python standard library modules.
Recommended version:
Python 3.10+
No external dependencies are required.
Clone the repository:
git clone https://github.com/FaizarM/persistent-btree-project.git
cd persistent-btree-projectRun all tests:
python -m unittest discover testsRun the benchmark:
python benchmark.pyOn Windows, if python is not recognized, use:
py -m unittest discover tests
py benchmark.pyDiskManager handles low-level fixed-size block storage in a binary file.
Each block is addressed by a block ID:
offset = block_id * block_size
Responsibilities:
- Allocate new disk blocks
- Read fixed-size blocks
- Write fixed-size blocks
- Pad small writes with zero bytes
- Reject oversized writes
- Check block existence
- Preserve data after reopening storage
This class simulates the page/block layer used in databases and file systems.
BTreeNode represents a single B-Tree node.
Each node stores:
block_id: physical location in disk storageis_leaf: whether the node has childrenkeys: sorted integer keysvalues: string values associated with keyschildren: child block IDs for internal nodes
Nodes are serialized to bytes using JSON with a 4-byte length prefix. The length prefix allows the deserializer to ignore zero padding in fixed-size disk blocks.
BufferPoolManager caches disk blocks in memory using an LRU replacement policy.
Features:
- Cache hits and misses
- Dirty block tracking
- Flush on eviction
- Flush all pending writes on close
- Bounded capacity
- Disk read/write counters for benchmarking
The buffer pool reduces repeated disk reads, especially for root and upper-level B-Tree nodes that are accessed frequently during search.
PersistentBTree is the main disk-backed B-Tree implementation.
Supported operations:
insert(key, value)search(key)delete(key)traverse()validate()close()
The tree persists its root block ID and configuration in a metadata file. This allows the B-Tree to be closed and reopened while preserving all data.
The implementation uses a standard B-Tree with minimum degree t.
For minimum degree t:
- Maximum keys per node:
2t - 1 - Minimum keys per non-root node:
t - 1 - Internal node with
kkeys hask + 1children - Keys in each node remain sorted
- All leaf nodes remain at the same depth
Search starts from the root and descends to the appropriate child block until the key is found or a leaf is reached.
Steps:
- Scan the current node for the first key greater than or equal to the target.
- Return the value if the key matches.
- Return
Noneif the current node is a leaf. - Otherwise, recursively search the correct child block.
Insertion follows the standard non-full insertion strategy.
Steps:
- Update the value if the key already exists.
- If the root is full, allocate a new root and split the old root.
- Descend into a non-full child.
- Split full children before descending.
- Insert the key-value pair into a leaf.
- Persist modified nodes to disk.
Deletion supports all major B-Tree deletion cases:
- Delete directly from a leaf
- Replace internal keys with predecessor
- Replace internal keys with successor
- Borrow from the left sibling
- Borrow from the right sibling
- Merge siblings when borrowing is impossible
- Shrink the root when it becomes empty
This preserves B-Tree balance after deletion.
The implementation stores data in two files:
btree_storage.bin # fixed-size binary block storage
btree_storage.meta # root block ID and configuration metadata
Each B-Tree node is stored as a block in btree_storage.bin.
The metadata file stores:
{
"root_block_id": 0,
"minimum_degree": 3,
"block_size": 4096
}Because internal nodes store child block IDs, the full tree can be reconstructed by loading the root block and following child pointers.
The validate() method checks whether the B-Tree satisfies key invariants.
Validation includes:
- Keys are sorted in every node
- Key count stays within B-Tree bounds
- Leaf nodes have no children
- Internal nodes have
len(keys) + 1children - Subtree key ranges are valid
- All leaves are at the same depth
This is especially useful after complex insertion and deletion workloads.
The project uses Python's built-in unittest framework.
Run tests:
python -m unittest discover testsCurrent result:
Ran 39 tests
OK
Test coverage includes:
- Fixed-size block allocation, reads, and writes
- Persistence after reopening disk storage
- Buffer pool cache hits, dirty writes, flushing, and LRU eviction
- Node serialization and validation
- Insert and search correctness
- Root splitting
- Updating existing keys
- Deletion from leaf and internal nodes
- Borrowing, merging, and root shrinking
- Persistence after reopening the B-Tree
- Structural validation after large insert/delete workloads
- Running with and without buffer pool
Let:
n = number of keys
t = minimum degree
h = B-Tree height
The height of the B-Tree is:
h = O(log_t n)
Each node contains at most 2t - 1 keys.
| Operation | Complexity |
|---|---|
| Search | O(t log_t n) |
| Insert | O(t log_t n) |
| Delete | O(t log_t n) |
| Traversal | O(n) |
| Space | O(n) |
When t is treated as a fixed configuration value, search, insert, and delete are commonly expressed as:
O(log n)
Buffer pool memory usage:
O(buffer_capacity * block_size)
A normal Python tree would use object references, but persistent storage requires stable disk addresses.
Solution:
- Store each node in a fixed-size binary block
- Reference child nodes by block ID
- Maintain root location in metadata
Fixed-size blocks contain padding after the actual serialized node data.
Solution:
- Prefix serialized node payloads with a 4-byte length field
- Decode only the actual payload length
- Ignore trailing zero padding
Deletion can cause node underflow and requires careful rebalancing.
Solution:
- Implement predecessor and successor replacement
- Borrow from siblings when possible
- Merge children when necessary
- Shrink the root after deletion
Repeated searches access the same upper-level nodes many times.
Solution:
- Add an LRU buffer pool
- Keep frequently used blocks in memory
- Flush dirty blocks on eviction and close
- Track cache statistics for benchmarking
This implementation is educational and optimized for clarity rather than production database performance.
Current limitations:
- Integer keys only
- String values only
- JSON serialization is readable but not space-optimal
- Deleted disk blocks are not reused through a free list
- No concurrency control
- No write-ahead log or crash recovery
- No range-query API beyond full sorted traversal
- Duplicate keys are not stored; inserting an existing key updates its value
Possible extensions:
- Compact binary serialization
- Free-list management for deleted blocks
- Range query API
- Bulk loading
- Configurable key and value types
- Write-ahead logging
- Concurrency-safe operations
- More detailed I/O profiling
This project implements a persistent B-Tree storage engine in Python with fixed-size block storage and an LRU buffer pool. It supports insertion, search, deletion, persistence, validation, and benchmarking.
The benchmark demonstrates that buffer pool caching improves repeated search workloads by reducing disk reads. The test suite verifies correctness across storage, caching, serialization, B-Tree operations, persistence, and structural invariants.