Skip to content

Commit 213dfde

Browse files
committed
Use persistent sorted sets for DB indexes
1 parent 640c683 commit 213dfde

16 files changed

Lines changed: 579 additions & 413 deletions

File tree

bench/bench_ocaml.ml

Lines changed: 0 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -80,18 +80,6 @@ let bench config name f =
8080
in
8181
Printf.printf "%s\t%s\n%!" name (format_ms (median samples))
8282

83-
let one =
84-
{ cardinality = One
85-
; unique = None
86-
; indexed = false
87-
; is_component = false
88-
; no_history = false
89-
; doc = None
90-
; value_type = None
91-
; tuple_attrs = None
92-
; tuple_types = None
93-
}
94-
9583
let indexed =
9684
{ cardinality = One
9785
; unique = None

datascript_ocaml.opam

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ license: "MIT"
77
depends: [
88
"ocaml" {>= "5.2.1"}
99
"dune" {>= "3.17"}
10+
"persistent_sorted_set_ocaml"
1011
"sqlite3"
1112
"js_of_ocaml-compiler" {with-test}
1213
"yojson" {with-test}

docs/design.md

Lines changed: 217 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,217 @@
1+
# DataScript OCaml Design
2+
3+
This document records the current DB/index design and the parity contract with
4+
upstream DataScript's Clojure/ClojureScript implementation.
5+
6+
Primary reference:
7+
8+
- `/Users/tiensonqin/Codes/projects/datascript/src/datascript/db.cljc`
9+
10+
The goal is semantic compatibility first. Performance work is acceptable only
11+
when it preserves DataScript's immutable DB value model, datom ordering, lazy
12+
public access, and comparator-bound index behavior.
13+
14+
## DB Value Model
15+
16+
DataScript DB values are persistent immutable values. A transaction must keep
17+
the old DB usable and produce a new DB with updated indexes. In upstream
18+
DataScript this is handled by persistent sorted set indexes:
19+
20+
- `:eavt`
21+
- `:aevt`
22+
- `:avet`
23+
24+
The OCaml port follows that model with:
25+
26+
- `eavt_index : datom Persistent_sorted_set.t`
27+
- `aevt_index : datom Persistent_sorted_set.t`
28+
- `avet_index : datom Persistent_sorted_set.t`
29+
- `vaet_index : datom Persistent_sorted_set.t`
30+
31+
`VAET` is explicit in the OCaml port because the public API exposes it directly
32+
for reverse-reference style access. Upstream derives reverse-reference behavior
33+
from indexed ref attrs; the OCaml port keeps a dedicated value-attribute-entity
34+
index for that public surface.
35+
36+
The DB also keeps `datoms : datom list`. That list is the active fact set used
37+
by transaction code, serialization, and code paths that need fact-level
38+
membership independent of index order. It is not a replacement for the sorted
39+
indexes.
40+
41+
## Index Construction
42+
43+
Bulk DB construction sorts datoms with the same index comparators used by the
44+
public access paths, then builds a persistent sorted set:
45+
46+
```ocaml
47+
PSet.of_sorted_array_by (Util.compare_datom index) items
48+
```
49+
50+
This matches upstream `set/from-sorted-array` in `init-db`.
51+
52+
Empty DB construction creates empty persistent sorted sets with the same
53+
comparators. Deserialization reconstructs indexes from serialized datoms through
54+
the normal refresh path; serialized data remains plain schema/datoms/history
55+
data, not serialized index internals.
56+
57+
## Transaction Updates
58+
59+
For append-only fast paths, the OCaml port updates each persistent sorted set
60+
with `Persistent_sorted_set.add`. This gives the same structural-sharing model
61+
as upstream `set/conj`: the old DB keeps the old root, and the new DB points to
62+
updated roots that share unchanged tree structure.
63+
64+
The port still computes `history_datoms`, `unique_index`, and transaction
65+
reports separately:
66+
67+
- `history_datoms` are chronological transaction facts, not an active sorted
68+
public index.
69+
- `unique_index` is an auxiliary lookup cache for uniqueness checks, not one of
70+
DataScript's ordered datom indexes.
71+
- transaction reports preserve `db_before`, `db_after`, and `tx_data` values as
72+
immutable snapshots.
73+
74+
## Index Order
75+
76+
The OCaml datom comparators mirror upstream:
77+
78+
| Index | Order |
79+
| --- | --- |
80+
| `Eavt` | entity, attribute, value, tx |
81+
| `Aevt` | attribute, entity, value, tx |
82+
| `Avet` | attribute, value, entity, tx |
83+
| `Vaet` | value, attribute, entity, tx |
84+
85+
The value comparator is DataScript-aware. Numeric values compare numerically, so
86+
`Int 1` and `Float 1.0` are comparator-equal for index bounds. Exact AVET
87+
lookups must therefore return both facts when the only difference is numeric
88+
representation, matching upstream DataScript.
89+
90+
## Index Access
91+
92+
Upstream `IIndexAccess` uses `set/slice` and `set/rslice` for public index
93+
operations. The OCaml port follows the same shape:
94+
95+
- `datoms` uses exact prefix slices when arguments form an index prefix.
96+
- `seek_datoms` uses a lower-bound slice for prefix-compatible bounds.
97+
- `rseek_datoms` uses a reverse upper-bound slice for prefix-compatible bounds.
98+
- `index_range` slices `AVET` from an attribute/value lower bound to an
99+
attribute/value upper bound.
100+
101+
Non-prefix combinations are still supported because the OCaml public API uses
102+
named optional arguments. When callers provide a combination that is not an
103+
index prefix, the implementation falls back to ordered index iteration plus a
104+
component filter. This preserves compatibility instead of pretending every
105+
named-argument combination is a native sorted-set range.
106+
107+
Filtered DBs apply the filter predicate after slicing. This matches upstream
108+
`FilteredDB`, where the filter wraps `-datoms`, `-seek-datoms`,
109+
`-rseek-datoms`, and `-index-range`.
110+
111+
## Bound Datoms
112+
113+
Upstream bound datoms can contain `nil` components as wildcard-like comparator
114+
markers. OCaml `datom` fields are not optional, so the port uses synthetic bound
115+
datoms plus a bound-field mask. The custom slice comparator compares only the
116+
fields that participate in the requested bound.
117+
118+
This is required for behavior such as:
119+
120+
- `datoms db Aevt ~a`
121+
- `datoms db Eavt ~e ~a`
122+
- `datoms db Avet ~a ~v`
123+
- `seek_datoms db Avet ~a ~v`
124+
- `rseek_datoms db Vaet ~v ~a`
125+
- `index_range db attr ?start ?stop`
126+
127+
The actual datoms stored in the persistent sorted sets are always normal
128+
datoms. Bound masks affect only the temporary comparator used by a slice.
129+
130+
## Public Laziness
131+
132+
Public `datoms` returns `Seq.t`. The implementation must not eagerly
133+
materialize more than required at API boundaries. The current persistent sorted
134+
set slice API returns lists, so a prefix slice materializes that bounded slice
135+
before converting it to `Seq.t`. This is still materially different from
136+
whole-index materialization, and it preserves the public lazy sequence contract
137+
for callers. If `Persistent_sorted_set` grows streaming slice cursors, the OCaml
138+
DB access layer should switch to them.
139+
140+
## Schema and AVET Accessibility
141+
142+
`AVET` contains datoms whose attributes are accessible through value lookup:
143+
144+
- `:db/index true`
145+
- `:db/unique`
146+
- ref-valued attributes, matching the port's schema rules
147+
- tuple attributes that are installed as indexed tuple attrs
148+
149+
The access layer validates `AVET` attribute access and raises the upstream-style
150+
message:
151+
152+
```text
153+
Attribute :<attr> should be marked as :db/index true
154+
```
155+
156+
Schema changes rebuild or update indexes through the same DB refresh/update
157+
paths, so access reflects the DB value produced by that transaction.
158+
159+
## What Should Not Use PSS
160+
161+
Not every list in the DB is an ordered persistent index.
162+
163+
- `datoms` remains the active fact list for transaction logic and serialization.
164+
- `history_datoms` remains transaction history, not an active index root.
165+
- `unique_index` remains a lightweight uniqueness helper. It is not an
166+
upstream PSS index and does not provide ordered public access.
167+
- query rows, pull results, transaction reports, schema data, and storage
168+
payloads remain plain OCaml values.
169+
170+
Moving these to PSS would not improve parity with upstream DataScript and would
171+
make serialization and equality behavior more complex.
172+
173+
## Local Dependency
174+
175+
The repo uses the sibling project:
176+
177+
```text
178+
persistent-sorted-set-ocaml/lib -> ../../persistent-sorted-set-ocaml/lib
179+
```
180+
181+
The bridge contains its own `dune-project` so Dune sees
182+
`persistent_sorted_set_ocaml` as a package while importing only the sibling
183+
library. It intentionally does not import the sibling tests or benchmarks into
184+
this workspace.
185+
186+
## Verification Coverage
187+
188+
Important tests covering this design:
189+
190+
- `test_db__test_indexes_use_persistent_sorted_set`
191+
- `test_db__test_index_lookup_matches_upstream_numeric_comparator_bounds`
192+
- `test_datoms_returns_lazy_sequence`
193+
- `test_datoms_slices_before_filtered_predicate`
194+
- `test_vaet_index_returns_ref_datoms_by_value`
195+
- `test_incremental_writes_keep_public_datoms_indexes_correct`
196+
- `test_seek_datoms_scans_forward_from_index_tuple`
197+
- `test_rseek_datoms_scans_backward_from_index_tuple`
198+
- `test_seek_datoms_continues_across_avet_attributes`
199+
- `test_rseek_datoms_continues_across_avet_attributes`
200+
- tuple AVET lookup and range tests in `test_tuples.ml`
201+
202+
Full `dune runtest` is the verification gate for this design. It includes the
203+
native test suite, SQLite storage tests when the opam environment resolves
204+
`sqlite3`, the JS smoke test, and cross-runtime parity checks against the sibling
205+
DataScript checkout.
206+
207+
## Maintenance Rules
208+
209+
- Keep `eavt/aevt/avet/vaet` backed by `Persistent_sorted_set.t`.
210+
- Use the index comparator for all index membership, bounds, and slices.
211+
- Preserve old DB values after every transaction.
212+
- Apply filtered DB predicates after index slicing.
213+
- Prefer bounded PSS slices over whole-index iteration whenever arguments form
214+
an index prefix.
215+
- Keep non-prefix named-argument combinations correct, even if they require a
216+
filtered ordered scan.
217+
- Do not serialize PSS internals as part of DB snapshots.

docs/perf.md

Lines changed: 37 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -71,30 +71,32 @@ boundary. Added coverage checks that:
7171
- public `datoms` returns a lazy sequence
7272
- bounded datoms slicing happens before filtered-db predicate checks
7373

74-
### Bounded Index Iteration
74+
### Persistent Sorted Indexes
7575

76-
The DB stores sorted array snapshots for EAVT, AEVT, AVET, and VAET when index
77-
arrays are valid. Datoms lookup can then binary-search a bounded range and expose
78-
that range as a lazy sequence.
76+
The DB stores EAVT, AEVT, AVET, and VAET as `Persistent_sorted_set.t` values.
77+
This matches upstream DataScript's persistent sorted set model better than
78+
whole-index arrays because transactions must preserve the old immutable DB while
79+
producing a new DB with updated indexes.
7980

80-
This avoids full-index filtering for common component-constrained accesses such
81-
as entity or attribute slices.
81+
Bulk construction sorts datoms once and builds each persistent sorted set from
82+
the sorted array. Incremental safe-add paths update the persistent sorted sets
83+
with structural sharing instead of marking whole indexes stale.
8284

83-
Exact prefix slices such as `datoms db Aevt ~a` now compute both start and stop
84-
bounds up front and then lazily walk the array range by index. This avoids a
85-
per-datom predicate check inside the hot range.
85+
Exact prefix reads such as `datoms db Aevt ~a`, lower-bound reads such as
86+
`seek_datoms db Avet ~a ~v`, reverse reads such as `rseek_datoms`, and
87+
`index_range` use persistent sorted set `slice`/`rslice` bounds. Non-prefix
88+
named-argument combinations continue to fall back to ordered filtering for
89+
compatibility.
8690

8791
### Incremental Index Refresh
8892

89-
Bulk transaction paths can merge newly added datoms into sorted indexes for
90-
initial loads. For later safe incremental writes into a non-empty DB, the write
91-
path updates active datoms and metadata immediately, but marks stored index lists
92-
and arrays stale instead of maintaining all four sorted indexes on every write.
93+
Bulk construction uses sorted-array PSS builders. For safe incremental writes,
94+
the write path updates the active datom list and adds new datoms into each
95+
relevant persistent sorted set. The old DB keeps its previous set roots, and the
96+
new DB shares unchanged tree structure with them.
9397

94-
Public `datoms` still returns correct results. If stored indexes are stale, the
95-
read path builds the requested sorted index from current datoms and applies the
96-
requested component filters. A regression test covers incremental writes followed
97-
by public EAVT, AEVT, AVET, and VAET reads.
98+
A regression test covers incremental writes followed by public EAVT, AEVT,
99+
AVET, and VAET reads.
98100

99101
### Query Candidate Narrowing
100102

@@ -168,21 +170,31 @@ These constraints must remain true for future performance work:
168170
## Remaining Work
169171

170172
The current deterministic benchmark goal is satisfied. Next useful work is to
171-
validate larger workloads, mixed read/write patterns after stale incremental
172-
indexes, and public behavior around direct DB record access. DB index fields are
173-
still present in the public record type, so future API cleanup should either make
174-
those internals private or clearly document their validity flags.
173+
validate larger workloads and mixed read/write patterns against the persistent
174+
sorted set implementation. DB index fields are still present in the public record
175+
type, so future API cleanup should either make those internals private or keep
176+
their PSS representation documented.
175177

176178
## Verification
177179

178-
Latest full test command:
180+
Latest feasible native test command in the current environment:
179181

180182
```sh
181-
dune runtest
183+
dune runtest test/test_datascript.exe test/test_lru.exe test/test_conn.exe \
184+
test/test_core.exe test/test_db.exe test/test_data_readers.exe \
185+
test/test_built_ins.exe test/test_issues.exe test/test_entity.exe \
186+
test/test_listen.exe test/test_lookup_refs.exe test/test_parser.exe \
187+
test/test_parser_find.exe test/test_parser_query.exe \
188+
test/test_parser_return_map.exe test/test_parser_rules.exe \
189+
test/test_parser_where.exe test/test_pull_api.exe test/test_pull_parser.exe \
190+
test/test_query_pull.exe test/test_query_namespace.exe test/test_tuples.exe \
191+
test/test_serialize.exe test/test_storage.exe test/test_upsert.exe \
192+
test/test_util.exe
182193
```
183194

184-
It passed after the current optimization set. The run emitted linker warnings
185-
about a missing `/opt/homebrew/opt/node@22/lib` search path, but no test failure.
195+
It passed after the persistent sorted set index change. Full `dune runtest`
196+
currently depends on Dune/Findlib resolving the `sqlite3` library for the SQLite
197+
storage tests.
186198

187199
Latest cross-runtime benchmark command:
188200

0 commit comments

Comments
 (0)