|
| 1 | +# DataScript OCaml Design |
| 2 | + |
| 3 | +This document records the current DB/index design and the parity contract with |
| 4 | +upstream DataScript's Clojure/ClojureScript implementation. |
| 5 | + |
| 6 | +Primary reference: |
| 7 | + |
| 8 | +- `/Users/tiensonqin/Codes/projects/datascript/src/datascript/db.cljc` |
| 9 | + |
| 10 | +The goal is semantic compatibility first. Performance work is acceptable only |
| 11 | +when it preserves DataScript's immutable DB value model, datom ordering, lazy |
| 12 | +public access, and comparator-bound index behavior. |
| 13 | + |
| 14 | +## DB Value Model |
| 15 | + |
| 16 | +DataScript DB values are persistent immutable values. A transaction must keep |
| 17 | +the old DB usable and produce a new DB with updated indexes. In upstream |
| 18 | +DataScript this is handled by persistent sorted set indexes: |
| 19 | + |
| 20 | +- `:eavt` |
| 21 | +- `:aevt` |
| 22 | +- `:avet` |
| 23 | + |
| 24 | +The OCaml port follows that model with: |
| 25 | + |
| 26 | +- `eavt_index : datom Persistent_sorted_set.t` |
| 27 | +- `aevt_index : datom Persistent_sorted_set.t` |
| 28 | +- `avet_index : datom Persistent_sorted_set.t` |
| 29 | +- `vaet_index : datom Persistent_sorted_set.t` |
| 30 | + |
| 31 | +`VAET` is explicit in the OCaml port because the public API exposes it directly |
| 32 | +for reverse-reference style access. Upstream derives reverse-reference behavior |
| 33 | +from indexed ref attrs; the OCaml port keeps a dedicated value-attribute-entity |
| 34 | +index for that public surface. |
| 35 | + |
| 36 | +The DB also keeps `datoms : datom list`. That list is the active fact set used |
| 37 | +by transaction code, serialization, and code paths that need fact-level |
| 38 | +membership independent of index order. It is not a replacement for the sorted |
| 39 | +indexes. |
| 40 | + |
| 41 | +## Index Construction |
| 42 | + |
| 43 | +Bulk DB construction sorts datoms with the same index comparators used by the |
| 44 | +public access paths, then builds a persistent sorted set: |
| 45 | + |
| 46 | +```ocaml |
| 47 | +PSet.of_sorted_array_by (Util.compare_datom index) items |
| 48 | +``` |
| 49 | + |
| 50 | +This matches upstream `set/from-sorted-array` in `init-db`. |
| 51 | + |
| 52 | +Empty DB construction creates empty persistent sorted sets with the same |
| 53 | +comparators. Deserialization reconstructs indexes from serialized datoms through |
| 54 | +the normal refresh path; serialized data remains plain schema/datoms/history |
| 55 | +data, not serialized index internals. |
| 56 | + |
| 57 | +## Transaction Updates |
| 58 | + |
| 59 | +For append-only fast paths, the OCaml port updates each persistent sorted set |
| 60 | +with `Persistent_sorted_set.add`. This gives the same structural-sharing model |
| 61 | +as upstream `set/conj`: the old DB keeps the old root, and the new DB points to |
| 62 | +updated roots that share unchanged tree structure. |
| 63 | + |
| 64 | +The port still computes `history_datoms`, `unique_index`, and transaction |
| 65 | +reports separately: |
| 66 | + |
| 67 | +- `history_datoms` are chronological transaction facts, not an active sorted |
| 68 | + public index. |
| 69 | +- `unique_index` is an auxiliary lookup cache for uniqueness checks, not one of |
| 70 | + DataScript's ordered datom indexes. |
| 71 | +- transaction reports preserve `db_before`, `db_after`, and `tx_data` values as |
| 72 | + immutable snapshots. |
| 73 | + |
| 74 | +## Index Order |
| 75 | + |
| 76 | +The OCaml datom comparators mirror upstream: |
| 77 | + |
| 78 | +| Index | Order | |
| 79 | +| --- | --- | |
| 80 | +| `Eavt` | entity, attribute, value, tx | |
| 81 | +| `Aevt` | attribute, entity, value, tx | |
| 82 | +| `Avet` | attribute, value, entity, tx | |
| 83 | +| `Vaet` | value, attribute, entity, tx | |
| 84 | + |
| 85 | +The value comparator is DataScript-aware. Numeric values compare numerically, so |
| 86 | +`Int 1` and `Float 1.0` are comparator-equal for index bounds. Exact AVET |
| 87 | +lookups must therefore return both facts when the only difference is numeric |
| 88 | +representation, matching upstream DataScript. |
| 89 | + |
| 90 | +## Index Access |
| 91 | + |
| 92 | +Upstream `IIndexAccess` uses `set/slice` and `set/rslice` for public index |
| 93 | +operations. The OCaml port follows the same shape: |
| 94 | + |
| 95 | +- `datoms` uses exact prefix slices when arguments form an index prefix. |
| 96 | +- `seek_datoms` uses a lower-bound slice for prefix-compatible bounds. |
| 97 | +- `rseek_datoms` uses a reverse upper-bound slice for prefix-compatible bounds. |
| 98 | +- `index_range` slices `AVET` from an attribute/value lower bound to an |
| 99 | + attribute/value upper bound. |
| 100 | + |
| 101 | +Non-prefix combinations are still supported because the OCaml public API uses |
| 102 | +named optional arguments. When callers provide a combination that is not an |
| 103 | +index prefix, the implementation falls back to ordered index iteration plus a |
| 104 | +component filter. This preserves compatibility instead of pretending every |
| 105 | +named-argument combination is a native sorted-set range. |
| 106 | + |
| 107 | +Filtered DBs apply the filter predicate after slicing. This matches upstream |
| 108 | +`FilteredDB`, where the filter wraps `-datoms`, `-seek-datoms`, |
| 109 | +`-rseek-datoms`, and `-index-range`. |
| 110 | + |
| 111 | +## Bound Datoms |
| 112 | + |
| 113 | +Upstream bound datoms can contain `nil` components as wildcard-like comparator |
| 114 | +markers. OCaml `datom` fields are not optional, so the port uses synthetic bound |
| 115 | +datoms plus a bound-field mask. The custom slice comparator compares only the |
| 116 | +fields that participate in the requested bound. |
| 117 | + |
| 118 | +This is required for behavior such as: |
| 119 | + |
| 120 | +- `datoms db Aevt ~a` |
| 121 | +- `datoms db Eavt ~e ~a` |
| 122 | +- `datoms db Avet ~a ~v` |
| 123 | +- `seek_datoms db Avet ~a ~v` |
| 124 | +- `rseek_datoms db Vaet ~v ~a` |
| 125 | +- `index_range db attr ?start ?stop` |
| 126 | + |
| 127 | +The actual datoms stored in the persistent sorted sets are always normal |
| 128 | +datoms. Bound masks affect only the temporary comparator used by a slice. |
| 129 | + |
| 130 | +## Public Laziness |
| 131 | + |
| 132 | +Public `datoms` returns `Seq.t`. The implementation must not eagerly |
| 133 | +materialize more than required at API boundaries. The current persistent sorted |
| 134 | +set slice API returns lists, so a prefix slice materializes that bounded slice |
| 135 | +before converting it to `Seq.t`. This is still materially different from |
| 136 | +whole-index materialization, and it preserves the public lazy sequence contract |
| 137 | +for callers. If `Persistent_sorted_set` grows streaming slice cursors, the OCaml |
| 138 | +DB access layer should switch to them. |
| 139 | + |
| 140 | +## Schema and AVET Accessibility |
| 141 | + |
| 142 | +`AVET` contains datoms whose attributes are accessible through value lookup: |
| 143 | + |
| 144 | +- `:db/index true` |
| 145 | +- `:db/unique` |
| 146 | +- ref-valued attributes, matching the port's schema rules |
| 147 | +- tuple attributes that are installed as indexed tuple attrs |
| 148 | + |
| 149 | +The access layer validates `AVET` attribute access and raises the upstream-style |
| 150 | +message: |
| 151 | + |
| 152 | +```text |
| 153 | +Attribute :<attr> should be marked as :db/index true |
| 154 | +``` |
| 155 | + |
| 156 | +Schema changes rebuild or update indexes through the same DB refresh/update |
| 157 | +paths, so access reflects the DB value produced by that transaction. |
| 158 | + |
| 159 | +## What Should Not Use PSS |
| 160 | + |
| 161 | +Not every list in the DB is an ordered persistent index. |
| 162 | + |
| 163 | +- `datoms` remains the active fact list for transaction logic and serialization. |
| 164 | +- `history_datoms` remains transaction history, not an active index root. |
| 165 | +- `unique_index` remains a lightweight uniqueness helper. It is not an |
| 166 | + upstream PSS index and does not provide ordered public access. |
| 167 | +- query rows, pull results, transaction reports, schema data, and storage |
| 168 | + payloads remain plain OCaml values. |
| 169 | + |
| 170 | +Moving these to PSS would not improve parity with upstream DataScript and would |
| 171 | +make serialization and equality behavior more complex. |
| 172 | + |
| 173 | +## Local Dependency |
| 174 | + |
| 175 | +The repo uses the sibling project: |
| 176 | + |
| 177 | +```text |
| 178 | +persistent-sorted-set-ocaml/lib -> ../../persistent-sorted-set-ocaml/lib |
| 179 | +``` |
| 180 | + |
| 181 | +The bridge contains its own `dune-project` so Dune sees |
| 182 | +`persistent_sorted_set_ocaml` as a package while importing only the sibling |
| 183 | +library. It intentionally does not import the sibling tests or benchmarks into |
| 184 | +this workspace. |
| 185 | + |
| 186 | +## Verification Coverage |
| 187 | + |
| 188 | +Important tests covering this design: |
| 189 | + |
| 190 | +- `test_db__test_indexes_use_persistent_sorted_set` |
| 191 | +- `test_db__test_index_lookup_matches_upstream_numeric_comparator_bounds` |
| 192 | +- `test_datoms_returns_lazy_sequence` |
| 193 | +- `test_datoms_slices_before_filtered_predicate` |
| 194 | +- `test_vaet_index_returns_ref_datoms_by_value` |
| 195 | +- `test_incremental_writes_keep_public_datoms_indexes_correct` |
| 196 | +- `test_seek_datoms_scans_forward_from_index_tuple` |
| 197 | +- `test_rseek_datoms_scans_backward_from_index_tuple` |
| 198 | +- `test_seek_datoms_continues_across_avet_attributes` |
| 199 | +- `test_rseek_datoms_continues_across_avet_attributes` |
| 200 | +- tuple AVET lookup and range tests in `test_tuples.ml` |
| 201 | + |
| 202 | +Full `dune runtest` is the verification gate for this design. It includes the |
| 203 | +native test suite, SQLite storage tests when the opam environment resolves |
| 204 | +`sqlite3`, the JS smoke test, and cross-runtime parity checks against the sibling |
| 205 | +DataScript checkout. |
| 206 | + |
| 207 | +## Maintenance Rules |
| 208 | + |
| 209 | +- Keep `eavt/aevt/avet/vaet` backed by `Persistent_sorted_set.t`. |
| 210 | +- Use the index comparator for all index membership, bounds, and slices. |
| 211 | +- Preserve old DB values after every transaction. |
| 212 | +- Apply filtered DB predicates after index slicing. |
| 213 | +- Prefer bounded PSS slices over whole-index iteration whenever arguments form |
| 214 | + an index prefix. |
| 215 | +- Keep non-prefix named-argument combinations correct, even if they require a |
| 216 | + filtered ordered scan. |
| 217 | +- Do not serialize PSS internals as part of DB snapshots. |
0 commit comments