Skip to content

FaizarM/persistent-btree-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Persistent B-Tree Storage Engine

A disk-backed B-Tree implementation in Python featuring fixed-size binary block storage, persistent metadata, an LRU buffer pool, deletion with rebalancing, structural validation, unit tests, and search-performance benchmarking.

This project was built as a real-world data structures and storage systems project. It demonstrates how database-style indexes can persist tree nodes on disk while reducing repeated disk access through buffer pool caching.


Highlights

  • Persistent B-Tree stored in a binary file
  • Fixed-size block I/O through DiskManager
  • Serializable B-Tree nodes with block-based child references
  • Insert, search, update, delete, and sorted traversal
  • Standard B-Tree split, borrow, merge, and root-shrink operations
  • LRU BufferPoolManager with dirty-page flushing
  • Structural validation for B-Tree invariants
  • Unit test suite with 39 passing tests
  • Benchmark comparing search with and without buffer pool

Benchmark Result

Benchmark configuration:

Items inserted: 20,000
Search operations: 50,000
Minimum degree: 32
Block size: 4096 bytes
Buffer capacity: 256 blocks

Measured result on local machine:

Configuration Total Search Time Avg/Search Found Keys
Without Buffer Pool 11.176365 s 223.527 µs 25,000
With Buffer Pool 2.634216 s 52.684 µs 25,000

Buffer pool statistics:

Cache hits: 138,791
Cache misses: 10,644
Disk reads: 10,644
Disk writes during search: 0
Hit rate: 92.88%
Speedup: 4.24x

The benchmark shows that the buffer pool significantly improves repeated search performance by caching frequently accessed root and internal nodes.


Project Structure

persistent_btree_project/
├── README.md
├── requirements.txt
├── benchmark.py
├── btree/
│   ├── __init__.py
│   ├── disk_manager.py
│   ├── buffer_pool.py
│   ├── btree_node.py
│   └── persistent_btree.py
└── tests/
    ├── test_disk_manager.py
    ├── test_buffer_pool.py
    ├── test_btree_node.py
    └── test_persistent_btree.py

Requirements

This project uses only Python standard library modules.

Recommended version:

Python 3.10+

No external dependencies are required.


Quick Start

Clone the repository:

git clone https://github.com/FaizarM/persistent-btree-project.git
cd persistent-btree-project

Run all tests:

python -m unittest discover tests

Run the benchmark:

python benchmark.py

On Windows, if python is not recognized, use:

py -m unittest discover tests
py benchmark.py

Core Components

DiskManager

DiskManager handles low-level fixed-size block storage in a binary file.

Each block is addressed by a block ID:

offset = block_id * block_size

Responsibilities:

  • Allocate new disk blocks
  • Read fixed-size blocks
  • Write fixed-size blocks
  • Pad small writes with zero bytes
  • Reject oversized writes
  • Check block existence
  • Preserve data after reopening storage

This class simulates the page/block layer used in databases and file systems.

BTreeNode

BTreeNode represents a single B-Tree node.

Each node stores:

  • block_id: physical location in disk storage
  • is_leaf: whether the node has children
  • keys: sorted integer keys
  • values: string values associated with keys
  • children: child block IDs for internal nodes

Nodes are serialized to bytes using JSON with a 4-byte length prefix. The length prefix allows the deserializer to ignore zero padding in fixed-size disk blocks.

BufferPoolManager

BufferPoolManager caches disk blocks in memory using an LRU replacement policy.

Features:

  • Cache hits and misses
  • Dirty block tracking
  • Flush on eviction
  • Flush all pending writes on close
  • Bounded capacity
  • Disk read/write counters for benchmarking

The buffer pool reduces repeated disk reads, especially for root and upper-level B-Tree nodes that are accessed frequently during search.

PersistentBTree

PersistentBTree is the main disk-backed B-Tree implementation.

Supported operations:

  • insert(key, value)
  • search(key)
  • delete(key)
  • traverse()
  • validate()
  • close()

The tree persists its root block ID and configuration in a metadata file. This allows the B-Tree to be closed and reopened while preserving all data.


B-Tree Algorithm

The implementation uses a standard B-Tree with minimum degree t.

For minimum degree t:

  • Maximum keys per node: 2t - 1
  • Minimum keys per non-root node: t - 1
  • Internal node with k keys has k + 1 children
  • Keys in each node remain sorted
  • All leaf nodes remain at the same depth

Search

Search starts from the root and descends to the appropriate child block until the key is found or a leaf is reached.

Steps:

  1. Scan the current node for the first key greater than or equal to the target.
  2. Return the value if the key matches.
  3. Return None if the current node is a leaf.
  4. Otherwise, recursively search the correct child block.

Insert

Insertion follows the standard non-full insertion strategy.

Steps:

  1. Update the value if the key already exists.
  2. If the root is full, allocate a new root and split the old root.
  3. Descend into a non-full child.
  4. Split full children before descending.
  5. Insert the key-value pair into a leaf.
  6. Persist modified nodes to disk.

Delete

Deletion supports all major B-Tree deletion cases:

  • Delete directly from a leaf
  • Replace internal keys with predecessor
  • Replace internal keys with successor
  • Borrow from the left sibling
  • Borrow from the right sibling
  • Merge siblings when borrowing is impossible
  • Shrink the root when it becomes empty

This preserves B-Tree balance after deletion.


Persistence Design

The implementation stores data in two files:

btree_storage.bin   # fixed-size binary block storage
btree_storage.meta  # root block ID and configuration metadata

Each B-Tree node is stored as a block in btree_storage.bin.

The metadata file stores:

{
  "root_block_id": 0,
  "minimum_degree": 3,
  "block_size": 4096
}

Because internal nodes store child block IDs, the full tree can be reconstructed by loading the root block and following child pointers.


Structural Validation

The validate() method checks whether the B-Tree satisfies key invariants.

Validation includes:

  • Keys are sorted in every node
  • Key count stays within B-Tree bounds
  • Leaf nodes have no children
  • Internal nodes have len(keys) + 1 children
  • Subtree key ranges are valid
  • All leaves are at the same depth

This is especially useful after complex insertion and deletion workloads.


Testing

The project uses Python's built-in unittest framework.

Run tests:

python -m unittest discover tests

Current result:

Ran 39 tests
OK

Test coverage includes:

  • Fixed-size block allocation, reads, and writes
  • Persistence after reopening disk storage
  • Buffer pool cache hits, dirty writes, flushing, and LRU eviction
  • Node serialization and validation
  • Insert and search correctness
  • Root splitting
  • Updating existing keys
  • Deletion from leaf and internal nodes
  • Borrowing, merging, and root shrinking
  • Persistence after reopening the B-Tree
  • Structural validation after large insert/delete workloads
  • Running with and without buffer pool

Complexity Analysis

Let:

n = number of keys
t = minimum degree
h = B-Tree height

The height of the B-Tree is:

h = O(log_t n)

Each node contains at most 2t - 1 keys.

Operation Complexity
Search O(t log_t n)
Insert O(t log_t n)
Delete O(t log_t n)
Traversal O(n)
Space O(n)

When t is treated as a fixed configuration value, search, insert, and delete are commonly expressed as:

O(log n)

Buffer pool memory usage:

O(buffer_capacity * block_size)

Challenges and Solutions

Disk-based storage instead of in-memory references

A normal Python tree would use object references, but persistent storage requires stable disk addresses.

Solution:

  • Store each node in a fixed-size binary block
  • Reference child nodes by block ID
  • Maintain root location in metadata

Safe decoding from padded blocks

Fixed-size blocks contain padding after the actual serialized node data.

Solution:

  • Prefix serialized node payloads with a 4-byte length field
  • Decode only the actual payload length
  • Ignore trailing zero padding

B-Tree deletion complexity

Deletion can cause node underflow and requires careful rebalancing.

Solution:

  • Implement predecessor and successor replacement
  • Borrow from siblings when possible
  • Merge children when necessary
  • Shrink the root after deletion

Reducing disk I/O

Repeated searches access the same upper-level nodes many times.

Solution:

  • Add an LRU buffer pool
  • Keep frequently used blocks in memory
  • Flush dirty blocks on eviction and close
  • Track cache statistics for benchmarking

Limitations

This implementation is educational and optimized for clarity rather than production database performance.

Current limitations:

  • Integer keys only
  • String values only
  • JSON serialization is readable but not space-optimal
  • Deleted disk blocks are not reused through a free list
  • No concurrency control
  • No write-ahead log or crash recovery
  • No range-query API beyond full sorted traversal
  • Duplicate keys are not stored; inserting an existing key updates its value

Future Improvements

Possible extensions:

  • Compact binary serialization
  • Free-list management for deleted blocks
  • Range query API
  • Bulk loading
  • Configurable key and value types
  • Write-ahead logging
  • Concurrency-safe operations
  • More detailed I/O profiling

Conclusion

This project implements a persistent B-Tree storage engine in Python with fixed-size block storage and an LRU buffer pool. It supports insertion, search, deletion, persistence, validation, and benchmarking.

The benchmark demonstrates that buffer pool caching improves repeated search workloads by reducing disk reads. The test suite verifies correctness across storage, caching, serialization, B-Tree operations, persistence, and structural invariants.

About

Disk-backed persistent B-Tree implementation in Python with fixed-size block storage, LRU buffer pool, deletion, validation, tests, and benchmark.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages