RDBMS

A modular, minimal relational database management system written in modern C++. RDBMS implements a minimal but end-to-end SQL pipeline—from lexical analysis through semantic binding, volcano-style query execution, and pluggable storage—designed as baseline infrastructure for researching and extending database internals state-of-the art in the future.

The storage engine is deliberately decoupled from the SQL front end, enabling engine swaps (in-memory, heap-file disk, or future B-Trees / LSM-Trees) without rewriting the parser or execution layer.

Almost all existing RDBMS out there are too heavy for most of the projects and use-case that independent developers and small teams work on, on a daily basis. They are heavily bloated. So I implemented my own and it wasn't hard as I knew most of the functionality already. It might miss many important things, and get some things wrong. I also woulnd't expect this to be run in production in its current state or ever.

Design Goals

Goal	Implementation
Modularity	Strict boundaries between parser, catalog, binder, executor, and storage
Pluggable storage	`IStorageEngine` / `IScanner` abstract raw byte records; SQL layers never touch disk layout directly
Modern C++	C++20, RAII, smart pointers, `std::optional`, `std::variant`, `std::string_view`
Cross-platform	CMake build verified on macOS (Apple Clang) and Linux (GCC 11+ / Clang)
Test-driven	Google Test cases covering every major subsystem
Readable codebase	Headers separated from implementations; interfaces isolated under `include/rdbms/`

Architecture

RDBMS follows a classic layered, monolithic architecture. Each layer depends only on the abstractions below it.

flowchart TB
    subgraph client [Client Layer]
        REPL["rdbms_repl"]
    end

    subgraph frontend [SQL Front End]
        Lexer["Lexer"]
        Parser["Recursive Descent Parser"]
        AST["AST"]
        Binder["Semantic Binder"]
    end

    subgraph metadata [Metadata]
        Catalog["CatalogManager"]
    end

    subgraph execution [Execution Engine — Volcano Model]
        Planner["Plan Builder"]
        Scan["SequentialScanNode"]
        Filter["FilterNode"]
    end

    subgraph storage [Storage Layer]
        Codec["Tuple Codec"]
        ISE["IStorageEngine"]
        InMem["InMemoryStorageEngine"]
        Disk["DiskStorageEngine"]
        BPM["BufferPoolManager"]
        DM["DiskManager"]
    end

    REPL --> Lexer
    Lexer --> Parser
    Parser --> AST
    AST --> Binder
    Binder --> Catalog
    Binder --> Planner
    Planner --> Filter
    Filter --> Scan
    Scan --> ISE
    Scan --> Codec
    ISE --> InMem
    ISE --> Disk
    Disk --> BPM
    BPM --> DM
    InMem --> Codec
    Disk --> Codec

The REPL selects either InMemoryStorageEngine or DiskStorageEngine at startup via --storage. Both implement the same IStorageEngine contract.

Request lifecycle (SELECT)

SQL string
  → Lexer (tokens)
  → Parser (AST)
  → Binder (validates against CatalogManager, produces BoundStatement)
  → Plan tree: FilterNode → SequentialScanNode
  → IStorageEngine::GetScanner (raw bytes)
  → Tuple codec (deserialize)
  → Predicate evaluation (WHERE)
  → ASCII table formatting (stdout)

Request lifecycle (INSERT)

SQL string → Parser → Binder → Tuple codec (serialize) → IStorageEngine::InsertRecord

DDL (CREATE TABLE) updates both the catalog (schema metadata) and the storage engine (physical table slot).

Supported SQL (v1)

v1 intentionally supports a minimal ANSI subset. Keywords are case-insensitive. Trailing semicolons are optional.

CREATE TABLE

CREATE TABLE users (id INT, name VARCHAR);

Rule	Detail
Types	`INT`, `VARCHAR` only
Constraints	Table must not already exist; column names must be unique
Output	`OK`

INSERT

INSERT INTO users VALUES (1, 'Alice');
INSERT INTO users VALUES (2, 'Bob');

Rule	Detail
Literals	Integers (`42`) or single-quoted strings (`'Alice'`)
Arity	Value count must match column count
Types	Values must match column types (`INT` ↔ integer, `VARCHAR` ↔ string)
Output	`OK`

SELECT

SELECT id, name FROM users;
SELECT * FROM users;
SELECT id, name FROM users WHERE id = 1;
SELECT * FROM users WHERE name = 'Alice';

Rule	Detail
Projection	Explicit column list or `*`
WHERE	Optional; supports `column = literal` equality only
Output	ASCII table + row count, or `(0 rows)`

REPL controls

Command / Flag	Action
`exit` / `quit`	Exit the REPL (persists disk state on shutdown when using `--storage=disk`)
`--storage=disk`	Use disk-backed storage (default)
`--storage=memory`	Use in-memory storage (also accepts `--storage=in-memory`)
`--db=<path>`	Database file path for disk mode (default: `rdbms.db`)
`--help` / `-h`	Print usage

Not supported in v0.2

UPDATE, DELETE, DROP, joins, aggregates, ORDER BY, LIMIT, transactions, indexes, network protocol, and multi-statement scripts.

Quick Start

Prerequisites

Requirement	Version
CMake	≥ 3.20
C++ compiler	C++20 (Apple Clang, GCC 11+, or Clang)
Git	Required for FetchContent (Google Test)

Build

git clone <repository-url>
cd RDBMS

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build

Run the REPL

Disk-backed (default) — data and schemas persist across restarts:

./build/rdbms_repl
./build/rdbms_repl --storage=disk
./build/rdbms_repl --storage=disk --db=mydata.db

In-memory — fast, non-persistent (useful for testing and ephemeral sessions):

./build/rdbms_repl --storage=memory

Example session (disk mode):

RDBMS v0.2 (disk-backed). Database: rdbms.db
Type SQL, or 'exit' to quit.
rdbms> CREATE TABLE users (id INT, name VARCHAR);
OK
rdbms> INSERT INTO users VALUES (1, 'Alice');
OK
rdbms> SELECT * FROM users;
+----+-------+
| id | name  |
+----+-------+
| 1  | Alice |
+----+-------+
(1 row)
rdbms> exit

On exit, disk mode writes:

rdbms.db — heap pages and record data
rdbms.db.catalog — table schemas

Restarting the REPL with the same --db path restores both schema and rows.

Run tests

ctest --test-dir build --output-on-failure

Project Layout

RDBMS/
├── CMakeLists.txt              # Root build; static lib + rdbms_repl + tests
├── cmake/
│   └── CompilerWarnings.cmake  # -Wall -Wextra -Wpedantic (cross-platform)
├── include/rdbms/              # Public headers
│   ├── ast/                    # Abstract syntax tree + visitor
│   ├── binder/                 # Semantic binder + bound AST
│   ├── catalog/                # Schema metadata (+ binary persistence)
│   ├── common/                 # Shared utilities (Expected<T,E>)
│   ├── execution/              # Volcano plan nodes + predicate eval
│   ├── parser/                 # Lexer + recursive descent parser
│   ├── repl/                   # QueryExecutor + result formatting
│   ├── storage/                # Storage interface + engines
│   │   └── disk/               # DiskManager, BufferPoolManager, DiskStorageEngine
│   └── types/                  # Tuple, Value, RecordID, DataType
├── src/                        # Implementations (mirrors include/)
├── test/                       # Google Test suites
└── src/main.cpp                # REPL entry point

Module Reference

Module	Key Types	Responsibility
Types	`Tuple`, `Value`, `RecordID`, `DataType`	Runtime value representation
Storage (interface)	`IStorageEngine`, `IScanner`	Physical record I/O boundary (table name + raw bytes only)
Storage (in-memory)	`InMemoryStorageEngine`	In-memory `vector<uint8_t>` record store
Storage (disk)	`DiskStorageEngine`, `DiskScanner`	Heap-file `IStorageEngine` with slotted pages
Storage (disk I/O)	`DiskManager`, `BufferPoolManager`	Page-aligned file I/O and LRU buffer pool
Tuple codec	`encode_tuple`, `decode_tuple`	Schema-aware serialization between `Tuple` and byte payloads
Catalog	`CatalogManager`, `TableSchema`	Schema registry; binary persistence for disk REPL mode
AST	`Statement`, `Expr`, `AstVisitor`	Parsed SQL structure (`std::variant`-based)
Binder	`Binder`, `BoundStatement`	Resolves identifiers, validates types, produces bound AST
Execution	`PlanNode`, `SequentialScanNode`, `FilterNode`	Iterator-model (Volcano) query execution
Parser	`Lexer`, `Parser`	Hand-rolled tokenizer + recursive descent grammar
REPL	`QueryExecutor`, `format_table`	End-to-end orchestration, storage backend selection, output

Storage engine contract

The storage layer is intentionally ignorant of SQL. It operates on:

Table names (std::string)
Record IDs (RecordID{page_id, slot_id})
Opaque byte payloads (std::vector<uint8_t>)

class IStorageEngine {
public:
    virtual void Init() = 0;
    virtual void Shutdown() = 0;
    virtual void CreateTable(const std::string& table_name) = 0;
    virtual RecordID InsertRecord(const std::string& table_name,
                                const std::vector<uint8_t>& payload) = 0;
    virtual bool UpdateRecord(const std::string& table_name, RecordID rid,
                              const std::vector<uint8_t>& payload) = 0;
    virtual std::unique_ptr<IScanner> GetScanner(const std::string& table_name) = 0;
};

InMemoryStorageEngine and DiskStorageEngine are the two concrete implementations. UpdateRecord is implemented at the storage layer but not yet exposed via SQL.

Execution model

Every plan node implements the Volcano iterator interface:

class PlanNode {
public:
    virtual void Init() = 0;
    virtual std::optional<Tuple> Next() = 0;  // std::nullopt = EOF
};

v1 builds FilterNode → SequentialScanNode trees for SELECT statements with an optional WHERE clause.

Physical Data Model

Column byte offsets are assigned at table creation time and stored in the catalog.

Type	On-disk / in-memory layout
`INT`	8 bytes (`int64_t`) at column offset
`VARCHAR`	4-byte length prefix (`uint32_t`) + variable string data

Example schema users(id INT, name VARCHAR):

Column	Offset	Size (fixed prefix)
`id`	0	8 bytes
`name`	8	4-byte length + N bytes of string data

Disk Persistence

Disk-backed storage uses a three-layer stack behind DiskStorageEngine:

DiskStorageEngine
  └── BufferPoolManager (LRU, pin/unpin, dirty flush)
        └── DiskManager (4 KB page-aligned file I/O)
              └── <db_path>   (e.g. rdbms.db)

On-disk layout

Page / File	Role
Page 0 (meta)	Magic, version, serialized table directory (`name → first heap page`)
Pages 1+	Slotted heap pages linked by `next_page_id`; records stored as opaque byte payloads
`<db>.catalog`	Binary catalog file (table schemas) written by the REPL on shutdown

REPL persistence behavior

Mode	Records	Schemas	Survives restart?
`disk`	`DiskStorageEngine` → `.db` file	`CatalogManager` → `.catalog` file	Yes
`memory`	`InMemoryStorageEngine` (RAM)	`CatalogManager` (RAM only)	No

Build & Test

CMake targets

Target	Description
`rdbms`	Static library containing all engine components
`rdbms_repl`	Interactive REPL executable
`rdbms_tests`	Google Test runner

Compiler flags

Non-MSVC builds enable -Wall -Wextra -Wpedantic via cmake/CompilerWarnings.cmake.

Test coverage by area

Test file	Area
`build_pipeline_test.cpp`	Build / C++20 smoke
`types_test.cpp`	Core types + storage interface
`catalog_manager_test.cpp`	Schema CRUD + error paths
`ast_test.cpp`	AST construction + visitor
`binder_test.cpp`	Semantic validation
`plan_node_test.cpp`	PlanNode iterator contract
`execution_operators_test.cpp`	Scan + filter pipeline (mock storage)
`disk_manager_test.cpp`	Page-aligned file I/O
`buffer_pool_manager_test.cpp`	LRU buffer pool
`disk_storage_engine_test.cpp`	Heap-file storage + restart
`in_memory_storage_engine_test.cpp`	In-memory storage engine + codec
`parser_test.cpp`	Lexer + parser (valid + invalid SQL)
`repl_test.cpp`	End-to-end QueryExecutor (disk + memory)

Debug build

cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build

compile_commands.json is generated for clangd / IDE integration.

Contributing

Workflow

Fork the repository and create a feature branch from main.
Implement changes with corresponding Google Tests.
Ensure the build is clean on your platform:

 cmake -B build -DCMAKE_BUILD_TYPE=Debug
 cmake --build build
 ctest --test-dir build --output-on-failure

Open a pull request with a clear description of the architectural impact.

Code standards

C++20 with RAII; no owning raw pointers.
Prefer std::optional, std::variant, and the project-local Expected<T, E> for error propagation.
Keep headers minimal; use forward declarations where possible.
Match existing module boundaries—do not couple the parser or executor to a specific storage implementation.
One logical concern per commit; write descriptive commit messages (no step-number prefixes required).

Adding a new storage engine

Subclass IStorageEngine and IScanner.
Operate exclusively on std::vector<uint8_t> payloads.
Use tuple_codec for serialization—do not duplicate layout logic.
Add a dedicated test file under test/.
Wire into QueryExecutor / REPL if it should be user-selectable.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
cmake		cmake
include/rdbms		include/rdbms
src		src
test		test
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

RDBMS

Table of Contents

Design Goals

Architecture

Request lifecycle (SELECT)

Request lifecycle (INSERT)

Supported SQL (v1)

CREATE TABLE

INSERT

SELECT

REPL controls

Not supported in v0.2

Quick Start

Prerequisites

Build

Run the REPL

Run tests

Project Layout

Module Reference

Storage engine contract

Execution model

Physical Data Model

Disk Persistence

On-disk layout

REPL persistence behavior

Build & Test

CMake targets

Compiler flags

Test coverage by area

Debug build

Contributing

Workflow

Code standards

Adding a new storage engine

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages