A modular, minimal relational database management system written in modern C++. RDBMS implements a minimal but end-to-end SQL pipeline—from lexical analysis through semantic binding, volcano-style query execution, and pluggable storage—designed as baseline infrastructure for researching and extending database internals state-of-the art in the future.
The storage engine is deliberately decoupled from the SQL front end, enabling engine swaps (in-memory, heap-file disk, or future B-Trees / LSM-Trees) without rewriting the parser or execution layer.
Almost all existing RDBMS out there are too heavy for most of the projects and use-case that independent developers and small teams work on, on a daily basis. They are heavily bloated. So I implemented my own and it wasn't hard as I knew most of the functionality already. It might miss many important things, and get some things wrong. I also woulnd't expect this to be run in production in its current state or ever.
- Design Goals
- Architecture
- Supported SQL (v1)
- Quick Start
- Project Layout
- Module Reference
- Physical Data Model
- Disk Persistence
- Build & Test
- Contributing
- License
| Goal | Implementation |
|---|---|
| Modularity | Strict boundaries between parser, catalog, binder, executor, and storage |
| Pluggable storage | IStorageEngine / IScanner abstract raw byte records; SQL layers never touch disk layout directly |
| Modern C++ | C++20, RAII, smart pointers, std::optional, std::variant, std::string_view |
| Cross-platform | CMake build verified on macOS (Apple Clang) and Linux (GCC 11+ / Clang) |
| Test-driven | Google Test cases covering every major subsystem |
| Readable codebase | Headers separated from implementations; interfaces isolated under include/rdbms/ |
RDBMS follows a classic layered, monolithic architecture. Each layer depends only on the abstractions below it.
flowchart TB
subgraph client [Client Layer]
REPL["rdbms_repl"]
end
subgraph frontend [SQL Front End]
Lexer["Lexer"]
Parser["Recursive Descent Parser"]
AST["AST"]
Binder["Semantic Binder"]
end
subgraph metadata [Metadata]
Catalog["CatalogManager"]
end
subgraph execution [Execution Engine — Volcano Model]
Planner["Plan Builder"]
Scan["SequentialScanNode"]
Filter["FilterNode"]
end
subgraph storage [Storage Layer]
Codec["Tuple Codec"]
ISE["IStorageEngine"]
InMem["InMemoryStorageEngine"]
Disk["DiskStorageEngine"]
BPM["BufferPoolManager"]
DM["DiskManager"]
end
REPL --> Lexer
Lexer --> Parser
Parser --> AST
AST --> Binder
Binder --> Catalog
Binder --> Planner
Planner --> Filter
Filter --> Scan
Scan --> ISE
Scan --> Codec
ISE --> InMem
ISE --> Disk
Disk --> BPM
BPM --> DM
InMem --> Codec
Disk --> Codec
The REPL selects either InMemoryStorageEngine or DiskStorageEngine at startup via --storage. Both implement the same IStorageEngine contract.
SQL string
→ Lexer (tokens)
→ Parser (AST)
→ Binder (validates against CatalogManager, produces BoundStatement)
→ Plan tree: FilterNode → SequentialScanNode
→ IStorageEngine::GetScanner (raw bytes)
→ Tuple codec (deserialize)
→ Predicate evaluation (WHERE)
→ ASCII table formatting (stdout)
SQL string → Parser → Binder → Tuple codec (serialize) → IStorageEngine::InsertRecord
DDL (CREATE TABLE) updates both the catalog (schema metadata) and the storage engine (physical table slot).
v1 intentionally supports a minimal ANSI subset. Keywords are case-insensitive. Trailing semicolons are optional.
CREATE TABLE users (id INT, name VARCHAR);| Rule | Detail |
|---|---|
| Types | INT, VARCHAR only |
| Constraints | Table must not already exist; column names must be unique |
| Output | OK |
INSERT INTO users VALUES (1, 'Alice');
INSERT INTO users VALUES (2, 'Bob');| Rule | Detail |
|---|---|
| Literals | Integers (42) or single-quoted strings ('Alice') |
| Arity | Value count must match column count |
| Types | Values must match column types (INT ↔ integer, VARCHAR ↔ string) |
| Output | OK |
SELECT id, name FROM users;
SELECT * FROM users;
SELECT id, name FROM users WHERE id = 1;
SELECT * FROM users WHERE name = 'Alice';| Rule | Detail |
|---|---|
| Projection | Explicit column list or * |
| WHERE | Optional; supports column = literal equality only |
| Output | ASCII table + row count, or (0 rows) |
| Command / Flag | Action |
|---|---|
exit / quit |
Exit the REPL (persists disk state on shutdown when using --storage=disk) |
--storage=disk |
Use disk-backed storage (default) |
--storage=memory |
Use in-memory storage (also accepts --storage=in-memory) |
--db=<path> |
Database file path for disk mode (default: rdbms.db) |
--help / -h |
Print usage |
UPDATE, DELETE, DROP, joins, aggregates, ORDER BY, LIMIT, transactions, indexes, network protocol, and multi-statement scripts.
| Requirement | Version |
|---|---|
| CMake | ≥ 3.20 |
| C++ compiler | C++20 (Apple Clang, GCC 11+, or Clang) |
| Git | Required for FetchContent (Google Test) |
git clone <repository-url>
cd RDBMS
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build buildDisk-backed (default) — data and schemas persist across restarts:
./build/rdbms_repl
./build/rdbms_repl --storage=disk
./build/rdbms_repl --storage=disk --db=mydata.dbIn-memory — fast, non-persistent (useful for testing and ephemeral sessions):
./build/rdbms_repl --storage=memoryExample session (disk mode):
RDBMS v0.2 (disk-backed). Database: rdbms.db
Type SQL, or 'exit' to quit.
rdbms> CREATE TABLE users (id INT, name VARCHAR);
OK
rdbms> INSERT INTO users VALUES (1, 'Alice');
OK
rdbms> SELECT * FROM users;
+----+-------+
| id | name |
+----+-------+
| 1 | Alice |
+----+-------+
(1 row)
rdbms> exit
On exit, disk mode writes:
rdbms.db— heap pages and record datardbms.db.catalog— table schemas
Restarting the REPL with the same --db path restores both schema and rows.
ctest --test-dir build --output-on-failureRDBMS/
├── CMakeLists.txt # Root build; static lib + rdbms_repl + tests
├── cmake/
│ └── CompilerWarnings.cmake # -Wall -Wextra -Wpedantic (cross-platform)
├── include/rdbms/ # Public headers
│ ├── ast/ # Abstract syntax tree + visitor
│ ├── binder/ # Semantic binder + bound AST
│ ├── catalog/ # Schema metadata (+ binary persistence)
│ ├── common/ # Shared utilities (Expected<T,E>)
│ ├── execution/ # Volcano plan nodes + predicate eval
│ ├── parser/ # Lexer + recursive descent parser
│ ├── repl/ # QueryExecutor + result formatting
│ ├── storage/ # Storage interface + engines
│ │ └── disk/ # DiskManager, BufferPoolManager, DiskStorageEngine
│ └── types/ # Tuple, Value, RecordID, DataType
├── src/ # Implementations (mirrors include/)
├── test/ # Google Test suites
└── src/main.cpp # REPL entry point
| Module | Key Types | Responsibility |
|---|---|---|
| Types | Tuple, Value, RecordID, DataType |
Runtime value representation |
| Storage (interface) | IStorageEngine, IScanner |
Physical record I/O boundary (table name + raw bytes only) |
| Storage (in-memory) | InMemoryStorageEngine |
In-memory vector<uint8_t> record store |
| Storage (disk) | DiskStorageEngine, DiskScanner |
Heap-file IStorageEngine with slotted pages |
| Storage (disk I/O) | DiskManager, BufferPoolManager |
Page-aligned file I/O and LRU buffer pool |
| Tuple codec | encode_tuple, decode_tuple |
Schema-aware serialization between Tuple and byte payloads |
| Catalog | CatalogManager, TableSchema |
Schema registry; binary persistence for disk REPL mode |
| AST | Statement, Expr, AstVisitor |
Parsed SQL structure (std::variant-based) |
| Binder | Binder, BoundStatement |
Resolves identifiers, validates types, produces bound AST |
| Execution | PlanNode, SequentialScanNode, FilterNode |
Iterator-model (Volcano) query execution |
| Parser | Lexer, Parser |
Hand-rolled tokenizer + recursive descent grammar |
| REPL | QueryExecutor, format_table |
End-to-end orchestration, storage backend selection, output |
The storage layer is intentionally ignorant of SQL. It operates on:
- Table names (
std::string) - Record IDs (
RecordID{page_id, slot_id}) - Opaque byte payloads (
std::vector<uint8_t>)
class IStorageEngine {
public:
virtual void Init() = 0;
virtual void Shutdown() = 0;
virtual void CreateTable(const std::string& table_name) = 0;
virtual RecordID InsertRecord(const std::string& table_name,
const std::vector<uint8_t>& payload) = 0;
virtual bool UpdateRecord(const std::string& table_name, RecordID rid,
const std::vector<uint8_t>& payload) = 0;
virtual std::unique_ptr<IScanner> GetScanner(const std::string& table_name) = 0;
};InMemoryStorageEngine and DiskStorageEngine are the two concrete implementations. UpdateRecord is implemented at the storage layer but not yet exposed via SQL.
Every plan node implements the Volcano iterator interface:
class PlanNode {
public:
virtual void Init() = 0;
virtual std::optional<Tuple> Next() = 0; // std::nullopt = EOF
};v1 builds FilterNode → SequentialScanNode trees for SELECT statements with an optional WHERE clause.
Column byte offsets are assigned at table creation time and stored in the catalog.
| Type | On-disk / in-memory layout |
|---|---|
INT |
8 bytes (int64_t) at column offset |
VARCHAR |
4-byte length prefix (uint32_t) + variable string data |
Example schema users(id INT, name VARCHAR):
| Column | Offset | Size (fixed prefix) |
|---|---|---|
id |
0 | 8 bytes |
name |
8 | 4-byte length + N bytes of string data |
Disk-backed storage uses a three-layer stack behind DiskStorageEngine:
DiskStorageEngine
└── BufferPoolManager (LRU, pin/unpin, dirty flush)
└── DiskManager (4 KB page-aligned file I/O)
└── <db_path> (e.g. rdbms.db)
| Page / File | Role |
|---|---|
| Page 0 (meta) | Magic, version, serialized table directory (name → first heap page) |
| Pages 1+ | Slotted heap pages linked by next_page_id; records stored as opaque byte payloads |
<db>.catalog |
Binary catalog file (table schemas) written by the REPL on shutdown |
| Mode | Records | Schemas | Survives restart? |
|---|---|---|---|
disk |
DiskStorageEngine → .db file |
CatalogManager → .catalog file |
Yes |
memory |
InMemoryStorageEngine (RAM) |
CatalogManager (RAM only) |
No |
| Target | Description |
|---|---|
rdbms |
Static library containing all engine components |
rdbms_repl |
Interactive REPL executable |
rdbms_tests |
Google Test runner |
Non-MSVC builds enable -Wall -Wextra -Wpedantic via cmake/CompilerWarnings.cmake.
| Test file | Area |
|---|---|
build_pipeline_test.cpp |
Build / C++20 smoke |
types_test.cpp |
Core types + storage interface |
catalog_manager_test.cpp |
Schema CRUD + error paths |
ast_test.cpp |
AST construction + visitor |
binder_test.cpp |
Semantic validation |
plan_node_test.cpp |
PlanNode iterator contract |
execution_operators_test.cpp |
Scan + filter pipeline (mock storage) |
disk_manager_test.cpp |
Page-aligned file I/O |
buffer_pool_manager_test.cpp |
LRU buffer pool |
disk_storage_engine_test.cpp |
Heap-file storage + restart |
in_memory_storage_engine_test.cpp |
In-memory storage engine + codec |
parser_test.cpp |
Lexer + parser (valid + invalid SQL) |
repl_test.cpp |
End-to-end QueryExecutor (disk + memory) |
cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build buildcompile_commands.json is generated for clangd / IDE integration.
- Fork the repository and create a feature branch from
main. - Implement changes with corresponding Google Tests.
- Ensure the build is clean on your platform:
cmake -B build -DCMAKE_BUILD_TYPE=Debug
cmake --build build
ctest --test-dir build --output-on-failure- Open a pull request with a clear description of the architectural impact.
- C++20 with RAII; no owning raw pointers.
- Prefer
std::optional,std::variant, and the project-localExpected<T, E>for error propagation. - Keep headers minimal; use forward declarations where possible.
- Match existing module boundaries—do not couple the parser or executor to a specific storage implementation.
- One logical concern per commit; write descriptive commit messages (no step-number prefixes required).
- Subclass
IStorageEngineandIScanner. - Operate exclusively on
std::vector<uint8_t>payloads. - Use
tuple_codecfor serialization—do not duplicate layout logic. - Add a dedicated test file under
test/. - Wire into
QueryExecutor/ REPL if it should be user-selectable.
MIT — see LICENSE.