Skip to content

sagar0/RDBMS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RDBMS

A modular, minimal relational database management system written in modern C++. RDBMS implements a minimal but end-to-end SQL pipeline—from lexical analysis through semantic binding, volcano-style query execution, and pluggable storage—designed as baseline infrastructure for researching and extending database internals state-of-the art in the future.

The storage engine is deliberately decoupled from the SQL front end, enabling engine swaps (in-memory, heap-file disk, or future B-Trees / LSM-Trees) without rewriting the parser or execution layer.

Almost all existing RDBMS out there are too heavy for most of the projects and use-case that independent developers and small teams work on, on a daily basis. They are heavily bloated. So I implemented my own and it wasn't hard as I knew most of the functionality already. It might miss many important things, and get some things wrong. I also woulnd't expect this to be run in production in its current state or ever.


Table of Contents


Design Goals

Goal Implementation
Modularity Strict boundaries between parser, catalog, binder, executor, and storage
Pluggable storage IStorageEngine / IScanner abstract raw byte records; SQL layers never touch disk layout directly
Modern C++ C++20, RAII, smart pointers, std::optional, std::variant, std::string_view
Cross-platform CMake build verified on macOS (Apple Clang) and Linux (GCC 11+ / Clang)
Test-driven Google Test cases covering every major subsystem
Readable codebase Headers separated from implementations; interfaces isolated under include/rdbms/

Architecture

RDBMS follows a classic layered, monolithic architecture. Each layer depends only on the abstractions below it.

flowchart TB
    subgraph client [Client Layer]
        REPL["rdbms_repl"]
    end

    subgraph frontend [SQL Front End]
        Lexer["Lexer"]
        Parser["Recursive Descent Parser"]
        AST["AST"]
        Binder["Semantic Binder"]
    end

    subgraph metadata [Metadata]
        Catalog["CatalogManager"]
    end

    subgraph execution [Execution Engine — Volcano Model]
        Planner["Plan Builder"]
        Scan["SequentialScanNode"]
        Filter["FilterNode"]
    end

    subgraph storage [Storage Layer]
        Codec["Tuple Codec"]
        ISE["IStorageEngine"]
        InMem["InMemoryStorageEngine"]
        Disk["DiskStorageEngine"]
        BPM["BufferPoolManager"]
        DM["DiskManager"]
    end

    REPL --> Lexer
    Lexer --> Parser
    Parser --> AST
    AST --> Binder
    Binder --> Catalog
    Binder --> Planner
    Planner --> Filter
    Filter --> Scan
    Scan --> ISE
    Scan --> Codec
    ISE --> InMem
    ISE --> Disk
    Disk --> BPM
    BPM --> DM
    InMem --> Codec
    Disk --> Codec
Loading

The REPL selects either InMemoryStorageEngine or DiskStorageEngine at startup via --storage. Both implement the same IStorageEngine contract.

Request lifecycle (SELECT)

SQL string
  → Lexer (tokens)
  → Parser (AST)
  → Binder (validates against CatalogManager, produces BoundStatement)
  → Plan tree: FilterNode → SequentialScanNode
  → IStorageEngine::GetScanner (raw bytes)
  → Tuple codec (deserialize)
  → Predicate evaluation (WHERE)
  → ASCII table formatting (stdout)

Request lifecycle (INSERT)

SQL string → Parser → Binder → Tuple codec (serialize) → IStorageEngine::InsertRecord

DDL (CREATE TABLE) updates both the catalog (schema metadata) and the storage engine (physical table slot).


Supported SQL (v1)

v1 intentionally supports a minimal ANSI subset. Keywords are case-insensitive. Trailing semicolons are optional.

CREATE TABLE

CREATE TABLE users (id INT, name VARCHAR);
Rule Detail
Types INT, VARCHAR only
Constraints Table must not already exist; column names must be unique
Output OK

INSERT

INSERT INTO users VALUES (1, 'Alice');
INSERT INTO users VALUES (2, 'Bob');
Rule Detail
Literals Integers (42) or single-quoted strings ('Alice')
Arity Value count must match column count
Types Values must match column types (INT ↔ integer, VARCHAR ↔ string)
Output OK

SELECT

SELECT id, name FROM users;
SELECT * FROM users;
SELECT id, name FROM users WHERE id = 1;
SELECT * FROM users WHERE name = 'Alice';
Rule Detail
Projection Explicit column list or *
WHERE Optional; supports column = literal equality only
Output ASCII table + row count, or (0 rows)

REPL controls

Command / Flag Action
exit / quit Exit the REPL (persists disk state on shutdown when using --storage=disk)
--storage=disk Use disk-backed storage (default)
--storage=memory Use in-memory storage (also accepts --storage=in-memory)
--db=<path> Database file path for disk mode (default: rdbms.db)
--help / -h Print usage

Not supported in v0.2

UPDATE, DELETE, DROP, joins, aggregates, ORDER BY, LIMIT, transactions, indexes, network protocol, and multi-statement scripts.


Quick Start

Prerequisites

Requirement Version
CMake ≥ 3.20
C++ compiler C++20 (Apple Clang, GCC 11+, or Clang)
Git Required for FetchContent (Google Test)

Build

git clone <repository-url>
cd RDBMS

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build

Run the REPL

Disk-backed (default) — data and schemas persist across restarts:

./build/rdbms_repl
./build/rdbms_repl --storage=disk
./build/rdbms_repl --storage=disk --db=mydata.db

In-memory — fast, non-persistent (useful for testing and ephemeral sessions):

./build/rdbms_repl --storage=memory

Example session (disk mode):

RDBMS v0.2 (disk-backed). Database: rdbms.db
Type SQL, or 'exit' to quit.
rdbms> CREATE TABLE users (id INT, name VARCHAR);
OK
rdbms> INSERT INTO users VALUES (1, 'Alice');
OK
rdbms> SELECT * FROM users;
+----+-------+
| id | name  |
+----+-------+
| 1  | Alice |
+----+-------+
(1 row)
rdbms> exit

On exit, disk mode writes:

  • rdbms.db — heap pages and record data
  • rdbms.db.catalog — table schemas

Restarting the REPL with the same --db path restores both schema and rows.

Run tests

ctest --test-dir build --output-on-failure

Project Layout

RDBMS/
├── CMakeLists.txt              # Root build; static lib + rdbms_repl + tests
├── cmake/
│   └── CompilerWarnings.cmake  # -Wall -Wextra -Wpedantic (cross-platform)
├── include/rdbms/              # Public headers
│   ├── ast/                    # Abstract syntax tree + visitor
│   ├── binder/                 # Semantic binder + bound AST
│   ├── catalog/                # Schema metadata (+ binary persistence)
│   ├── common/                 # Shared utilities (Expected<T,E>)
│   ├── execution/              # Volcano plan nodes + predicate eval
│   ├── parser/                 # Lexer + recursive descent parser
│   ├── repl/                   # QueryExecutor + result formatting
│   ├── storage/                # Storage interface + engines
│   │   └── disk/               # DiskManager, BufferPoolManager, DiskStorageEngine
│   └── types/                  # Tuple, Value, RecordID, DataType
├── src/                        # Implementations (mirrors include/)
├── test/                       # Google Test suites
└── src/main.cpp                # REPL entry point

Module Reference

Module Key Types Responsibility
Types Tuple, Value, RecordID, DataType Runtime value representation
Storage (interface) IStorageEngine, IScanner Physical record I/O boundary (table name + raw bytes only)
Storage (in-memory) InMemoryStorageEngine In-memory vector<uint8_t> record store
Storage (disk) DiskStorageEngine, DiskScanner Heap-file IStorageEngine with slotted pages
Storage (disk I/O) DiskManager, BufferPoolManager Page-aligned file I/O and LRU buffer pool
Tuple codec encode_tuple, decode_tuple Schema-aware serialization between Tuple and byte payloads
Catalog CatalogManager, TableSchema Schema registry; binary persistence for disk REPL mode
AST Statement, Expr, AstVisitor Parsed SQL structure (std::variant-based)
Binder Binder, BoundStatement Resolves identifiers, validates types, produces bound AST
Execution PlanNode, SequentialScanNode, FilterNode Iterator-model (Volcano) query execution
Parser Lexer, Parser Hand-rolled tokenizer + recursive descent grammar
REPL QueryExecutor, format_table End-to-end orchestration, storage backend selection, output

Storage engine contract

The storage layer is intentionally ignorant of SQL. It operates on:

  • Table names (std::string)
  • Record IDs (RecordID{page_id, slot_id})
  • Opaque byte payloads (std::vector<uint8_t>)
class IStorageEngine {
public:
    virtual void Init() = 0;
    virtual void Shutdown() = 0;
    virtual void CreateTable(const std::string& table_name) = 0;
    virtual RecordID InsertRecord(const std::string& table_name,
                                const std::vector<uint8_t>& payload) = 0;
    virtual bool UpdateRecord(const std::string& table_name, RecordID rid,
                              const std::vector<uint8_t>& payload) = 0;
    virtual std::unique_ptr<IScanner> GetScanner(const std::string& table_name) = 0;
};

InMemoryStorageEngine and DiskStorageEngine are the two concrete implementations. UpdateRecord is implemented at the storage layer but not yet exposed via SQL.

Execution model

Every plan node implements the Volcano iterator interface:

class PlanNode {
public:
    virtual void Init() = 0;
    virtual std::optional<Tuple> Next() = 0;  // std::nullopt = EOF
};

v1 builds FilterNode → SequentialScanNode trees for SELECT statements with an optional WHERE clause.


Physical Data Model

Column byte offsets are assigned at table creation time and stored in the catalog.

Type On-disk / in-memory layout
INT 8 bytes (int64_t) at column offset
VARCHAR 4-byte length prefix (uint32_t) + variable string data

Example schema users(id INT, name VARCHAR):

Column Offset Size (fixed prefix)
id 0 8 bytes
name 8 4-byte length + N bytes of string data

Disk Persistence

Disk-backed storage uses a three-layer stack behind DiskStorageEngine:

DiskStorageEngine
  └── BufferPoolManager (LRU, pin/unpin, dirty flush)
        └── DiskManager (4 KB page-aligned file I/O)
              └── <db_path>   (e.g. rdbms.db)

On-disk layout

Page / File Role
Page 0 (meta) Magic, version, serialized table directory (name → first heap page)
Pages 1+ Slotted heap pages linked by next_page_id; records stored as opaque byte payloads
<db>.catalog Binary catalog file (table schemas) written by the REPL on shutdown

REPL persistence behavior

Mode Records Schemas Survives restart?
disk DiskStorageEngine.db file CatalogManager.catalog file Yes
memory InMemoryStorageEngine (RAM) CatalogManager (RAM only) No

Build & Test

CMake targets

Target Description
rdbms Static library containing all engine components
rdbms_repl Interactive REPL executable
rdbms_tests Google Test runner

Compiler flags

Non-MSVC builds enable -Wall -Wextra -Wpedantic via cmake/CompilerWarnings.cmake.

Test coverage by area

Test file Area
build_pipeline_test.cpp Build / C++20 smoke
types_test.cpp Core types + storage interface
catalog_manager_test.cpp Schema CRUD + error paths
ast_test.cpp AST construction + visitor
binder_test.cpp Semantic validation
plan_node_test.cpp PlanNode iterator contract
execution_operators_test.cpp Scan + filter pipeline (mock storage)
disk_manager_test.cpp Page-aligned file I/O
buffer_pool_manager_test.cpp LRU buffer pool
disk_storage_engine_test.cpp Heap-file storage + restart
in_memory_storage_engine_test.cpp In-memory storage engine + codec
parser_test.cpp Lexer + parser (valid + invalid SQL)
repl_test.cpp End-to-end QueryExecutor (disk + memory)

Debug build

cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build

compile_commands.json is generated for clangd / IDE integration.


Contributing

Workflow

  1. Fork the repository and create a feature branch from main.
  2. Implement changes with corresponding Google Tests.
  3. Ensure the build is clean on your platform:
 cmake -B build -DCMAKE_BUILD_TYPE=Debug
 cmake --build build
 ctest --test-dir build --output-on-failure
  1. Open a pull request with a clear description of the architectural impact.

Code standards

  • C++20 with RAII; no owning raw pointers.
  • Prefer std::optional, std::variant, and the project-local Expected<T, E> for error propagation.
  • Keep headers minimal; use forward declarations where possible.
  • Match existing module boundaries—do not couple the parser or executor to a specific storage implementation.
  • One logical concern per commit; write descriptive commit messages (no step-number prefixes required).

Adding a new storage engine

  1. Subclass IStorageEngine and IScanner.
  2. Operate exclusively on std::vector<uint8_t> payloads.
  3. Use tuple_codec for serialization—do not duplicate layout logic.
  4. Add a dedicated test file under test/.
  5. Wire into QueryExecutor / REPL if it should be user-selectable.

License

MIT — see LICENSE.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors