Persistent B-Tree Storage Engine

A disk-backed B-Tree implementation in Python featuring fixed-size binary block storage, persistent metadata, an LRU buffer pool, deletion with rebalancing, structural validation, unit tests, and search-performance benchmarking.

This project was built as a real-world data structures and storage systems project. It demonstrates how database-style indexes can persist tree nodes on disk while reducing repeated disk access through buffer pool caching.

Highlights

Persistent B-Tree stored in a binary file
Fixed-size block I/O through DiskManager
Serializable B-Tree nodes with block-based child references
Insert, search, update, delete, and sorted traversal
Standard B-Tree split, borrow, merge, and root-shrink operations
LRU BufferPoolManager with dirty-page flushing
Structural validation for B-Tree invariants
Unit test suite with 39 passing tests
Benchmark comparing search with and without buffer pool

Benchmark Result

Benchmark configuration:

Items inserted: 20,000
Search operations: 50,000
Minimum degree: 32
Block size: 4096 bytes
Buffer capacity: 256 blocks

Measured result on local machine:

Configuration	Total Search Time	Avg/Search	Found Keys
Without Buffer Pool	11.176365 s	223.527 µs	25,000
With Buffer Pool	2.634216 s	52.684 µs	25,000

Buffer pool statistics:

Cache hits: 138,791
Cache misses: 10,644
Disk reads: 10,644
Disk writes during search: 0
Hit rate: 92.88%
Speedup: 4.24x

The benchmark shows that the buffer pool significantly improves repeated search performance by caching frequently accessed root and internal nodes.

Project Structure

persistent_btree_project/
├── README.md
├── requirements.txt
├── benchmark.py
├── btree/
│   ├── __init__.py
│   ├── disk_manager.py
│   ├── buffer_pool.py
│   ├── btree_node.py
│   └── persistent_btree.py
└── tests/
    ├── test_disk_manager.py
    ├── test_buffer_pool.py
    ├── test_btree_node.py
    └── test_persistent_btree.py

Requirements

This project uses only Python standard library modules.

Recommended version:

Python 3.10+

No external dependencies are required.

Quick Start

Clone the repository:

git clone https://github.com/FaizarM/persistent-btree-project.git
cd persistent-btree-project

Run all tests:

python -m unittest discover tests

Run the benchmark:

python benchmark.py

On Windows, if python is not recognized, use:

py -m unittest discover tests
py benchmark.py

Core Components

DiskManager

DiskManager handles low-level fixed-size block storage in a binary file.

Each block is addressed by a block ID:

offset = block_id * block_size

Responsibilities:

Allocate new disk blocks
Read fixed-size blocks
Write fixed-size blocks
Pad small writes with zero bytes
Reject oversized writes
Check block existence
Preserve data after reopening storage

This class simulates the page/block layer used in databases and file systems.

BTreeNode

BTreeNode represents a single B-Tree node.

Each node stores:

block_id: physical location in disk storage
is_leaf: whether the node has children
keys: sorted integer keys
values: string values associated with keys
children: child block IDs for internal nodes

Nodes are serialized to bytes using JSON with a 4-byte length prefix. The length prefix allows the deserializer to ignore zero padding in fixed-size disk blocks.

BufferPoolManager

BufferPoolManager caches disk blocks in memory using an LRU replacement policy.

Features:

Cache hits and misses
Dirty block tracking
Flush on eviction
Flush all pending writes on close
Bounded capacity
Disk read/write counters for benchmarking

The buffer pool reduces repeated disk reads, especially for root and upper-level B-Tree nodes that are accessed frequently during search.

PersistentBTree

PersistentBTree is the main disk-backed B-Tree implementation.

Supported operations:

insert(key, value)
search(key)
delete(key)
traverse()
validate()
close()

The tree persists its root block ID and configuration in a metadata file. This allows the B-Tree to be closed and reopened while preserving all data.

B-Tree Algorithm

The implementation uses a standard B-Tree with minimum degree t.

For minimum degree t:

Maximum keys per node: 2t - 1
Minimum keys per non-root node: t - 1
Internal node with k keys has k + 1 children
Keys in each node remain sorted
All leaf nodes remain at the same depth

Search

Search starts from the root and descends to the appropriate child block until the key is found or a leaf is reached.

Steps:

Scan the current node for the first key greater than or equal to the target.
Return the value if the key matches.
Return None if the current node is a leaf.
Otherwise, recursively search the correct child block.

Insert

Insertion follows the standard non-full insertion strategy.

Steps:

Update the value if the key already exists.
If the root is full, allocate a new root and split the old root.
Descend into a non-full child.
Split full children before descending.
Insert the key-value pair into a leaf.
Persist modified nodes to disk.

Delete

Deletion supports all major B-Tree deletion cases:

Delete directly from a leaf
Replace internal keys with predecessor
Replace internal keys with successor
Borrow from the left sibling
Borrow from the right sibling
Merge siblings when borrowing is impossible
Shrink the root when it becomes empty

This preserves B-Tree balance after deletion.

Persistence Design

The implementation stores data in two files:

btree_storage.bin   # fixed-size binary block storage
btree_storage.meta  # root block ID and configuration metadata

Each B-Tree node is stored as a block in btree_storage.bin.

The metadata file stores:

{
  "root_block_id": 0,
  "minimum_degree": 3,
  "block_size": 4096
}

Because internal nodes store child block IDs, the full tree can be reconstructed by loading the root block and following child pointers.

Structural Validation

The validate() method checks whether the B-Tree satisfies key invariants.

Validation includes:

Keys are sorted in every node
Key count stays within B-Tree bounds
Leaf nodes have no children
Internal nodes have len(keys) + 1 children
Subtree key ranges are valid
All leaves are at the same depth

This is especially useful after complex insertion and deletion workloads.

Testing

The project uses Python's built-in unittest framework.

Run tests:

python -m unittest discover tests

Current result:

Ran 39 tests
OK

Test coverage includes:

Fixed-size block allocation, reads, and writes
Persistence after reopening disk storage
Buffer pool cache hits, dirty writes, flushing, and LRU eviction
Node serialization and validation
Insert and search correctness
Root splitting
Updating existing keys
Deletion from leaf and internal nodes
Borrowing, merging, and root shrinking
Persistence after reopening the B-Tree
Structural validation after large insert/delete workloads
Running with and without buffer pool

Complexity Analysis

Let:

n = number of keys
t = minimum degree
h = B-Tree height

The height of the B-Tree is:

h = O(log_t n)

Each node contains at most 2t - 1 keys.

Operation	Complexity
Search	`O(t log_t n)`
Insert	`O(t log_t n)`
Delete	`O(t log_t n)`
Traversal	`O(n)`
Space	`O(n)`

When t is treated as a fixed configuration value, search, insert, and delete are commonly expressed as:

O(log n)

Buffer pool memory usage:

O(buffer_capacity * block_size)

Challenges and Solutions

Disk-based storage instead of in-memory references

A normal Python tree would use object references, but persistent storage requires stable disk addresses.

Solution:

Store each node in a fixed-size binary block
Reference child nodes by block ID
Maintain root location in metadata

Safe decoding from padded blocks

Fixed-size blocks contain padding after the actual serialized node data.

Solution:

Prefix serialized node payloads with a 4-byte length field
Decode only the actual payload length
Ignore trailing zero padding

B-Tree deletion complexity

Deletion can cause node underflow and requires careful rebalancing.

Solution:

Implement predecessor and successor replacement
Borrow from siblings when possible
Merge children when necessary
Shrink the root after deletion

Reducing disk I/O

Repeated searches access the same upper-level nodes many times.

Solution:

Add an LRU buffer pool
Keep frequently used blocks in memory
Flush dirty blocks on eviction and close
Track cache statistics for benchmarking

Limitations

This implementation is educational and optimized for clarity rather than production database performance.

Current limitations:

Integer keys only
String values only
JSON serialization is readable but not space-optimal
Deleted disk blocks are not reused through a free list
No concurrency control
No write-ahead log or crash recovery
No range-query API beyond full sorted traversal
Duplicate keys are not stored; inserting an existing key updates its value

Future Improvements

Possible extensions:

Compact binary serialization
Free-list management for deleted blocks
Range query API
Bulk loading
Configurable key and value types
Write-ahead logging
Concurrency-safe operations
More detailed I/O profiling

Conclusion

This project implements a persistent B-Tree storage engine in Python with fixed-size block storage and an LRU buffer pool. It supports insertion, search, deletion, persistence, validation, and benchmarking.

The benchmark demonstrates that buffer pool caching improves repeated search workloads by reducing disk reads. The test suite verifies correctness across storage, caching, serialization, B-Tree operations, persistence, and structural invariants.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Persistent B-Tree Storage Engine

Highlights

Benchmark Result

Project Structure

Requirements

Quick Start

Core Components

DiskManager

BTreeNode

BufferPoolManager

PersistentBTree

B-Tree Algorithm

Search

Insert

Delete

Persistence Design

Structural Validation

Testing

Complexity Analysis

Challenges and Solutions

Disk-based storage instead of in-memory references

Safe decoding from padded blocks

B-Tree deletion complexity

Reducing disk I/O

Limitations

Future Improvements

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
btree		btree
tests		tests
.gitignore		.gitignore
README.md		README.md
benchmark.py		benchmark.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Persistent B-Tree Storage Engine

Highlights

Benchmark Result

Project Structure

Requirements

Quick Start

Core Components

DiskManager

BTreeNode

BufferPoolManager

PersistentBTree

B-Tree Algorithm

Search

Insert

Delete

Persistence Design

Structural Validation

Testing

Complexity Analysis

Challenges and Solutions

Disk-based storage instead of in-memory references

Safe decoding from padded blocks

B-Tree deletion complexity

Reducing disk I/O

Limitations

Future Improvements

Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages