Skip to content

Relations framework. TPC-H. Parallelised load. CSV driver#68

Merged
Cianidos merged 89 commits into
mainfrom
feat/relations
Apr 29, 2026
Merged

Relations framework. TPC-H. Parallelised load. CSV driver#68
Cianidos merged 89 commits into
mainfrom
feat/relations

Conversation

@Cianidos
Copy link
Copy Markdown
Contributor

Description of Changes

Reimplements stroppy's data generation and load path on a new relational
framework, ships full TPC-H, parallelises the insert path across all SQL
drivers, and adds an ephemeral CSV sink driver. 84 commits, replacing
pkg/common/generate/ and the ParamSource / ExternalGenerator /
TpchTableSource surface with a single, seekable, proto-described pipeline.

Major areas:

  • Datagen framework (pkg/datagen/)
    • seed/ — single Derive(root, path…) seed composition (splitmix64 over
      fnv1a64 path). Only PRNG seeding path in the codebase.
    • dgproto/ + proto/stroppy/datagen.proto — new wire format:
      RelSource, Relationship, Side, Degree, Strategy, Lookup,
      StreamDraw, CohortSchedule, SCD2, Expr, Literal, InsertSpec.
    • compile/ — DAG topo sort, dependency analysis, stream id assignment.
    • expr/ — Expr evaluator (literal, col_ref, row_index, binop, call,
      if, dict_at, choose, cohort_draw/live, stream_draw + per-arm kernels,
      two-phase Draw.grammar template walker).
    • stdlib/ — 10 closed primitives (dates, format, hash, parse, permute,
      strings, uuid). Names are domain-agnostic: std.format, hash.mod,
      std.daysToDate, std.permuteIndex, std.parseInt/Float.
    • runtime/ — flat population iteration, block slots, per-attr null
      handling, relationship iteration, SCD-2 row-split, seek by row-index.
    • cohort/, lookup/ — cohort schedules with LRU + persistence,
      lookup populations; per-clone registries fix concurrent-map races.
  • Proto
    • New datagen.proto; removed Generation, InsertDescriptor,
      QueryParamDescriptor/Group, TpchTableSource, ExternalRuleRef, and
      most of the old common.proto / descriptor.proto wire. No reserved
      ranges, no milestone comments.
    • DRIVER_TYPE_CSV = 5 added for the CSV sink.
  • TS surface (internal/static/datagen.ts, 1.8k LOC)
    • Namespaces: Rel, Attr, Draw, Dict, Expr, std, Alphabet,
      DrawRT. Typed wrappers over every stdlib primitive; raw pb types
      never leak into workload code.
    • Old R.* / S.* / C.* / Dist.* / AB.* surface removed.
    • DrawRT namespace (sobek-bound Draw structs in cmd/xk6air/) for
      tx-time randomness with sample/next/seek/reset.
  • Drivers
    • pkg/driver/common/parallel_insert.go — shared within-table parallel
      insert utility; all drivers go through it.
    • InsertSpec implementation for postgres, mysql, picodata, ydb. Legacy
      InsertValues path removed across every driver.
    • pkg/driver/csv/ — ephemeral CSV driver, URL-configured, NATIVE-only.
      SHA256 golden for TPC-B SF=1 reference output.
    • pkg/driver/noop/ — honours parallelism.workers via RunParallel.
    • YDB native BulkUpsert: time.Time promoted to addressable; dialect
      compatibility fixes for tpcc/tpcb.
  • Workloads
    • workloads/tpch/ — new end-to-end TPC-H: 8-table load, spec-compliant
      orderkeys/dates/prices/variable degree, text helper, query validation.
      Dialect SQL for pg/mysql/pico/ydb. Reference distributions.json and
      answers_sf1.json generated by build-time tools (not embedded Go
      literals — old 19936-line refanswers/answers.go gone).
    • workloads/tpcc/ — rewritten load with Rel.table + driver.insertSpec.
      Spec parity: c_last syllables, ORIGINAL marker, deterministic NULLs,
      o_c_id permutation, CC1-CC4. Tx-time randomness migrated to DrawRT.
    • workloads/tpcb/ — same rewrite pattern; tx-time randomness on DrawRT.
    • workloads/simple/ — rewritten as a framework demo.
    • populate step renamed to load_data for consistency.
    • LOAD_WORKERS env parameterises load parallelism per workload.
  • Tooling (cmd/)
    • dstparse — one-shot parser for upstream .dst files → JSON.
    • tpch-dists, tpch-answers — TPC-H dists.dss and SF=1 .ans → JSON.
    • xk6air — new draw.go / draw_arms.go / draw_ctors.go /
      draw_prng_pool.go; legacy generator_wrappers.go removed.
  • Integration tests (test/integration/)
    • Tmpfs Postgres harness + multi-DB harness for pg/mysql/picodata/ydb.
    • Smoke coverage: datagen pipeline, relationships, stage-D (Uniform
      degree + SCD-2), driver InsertSpec, csv, tpcb, tpcc, tpch multi-DB,
      tpch parallel.
    • Determinism sweep across all primitives (workers ∈ {1,4,16} → identical
      row multiset).
  • Docs
    • docs/datagen-framework.md (1.2k lines) — authoritative framework guide.
    • docs/parallelism.md — parallelism contract + tuning guide.
    • docs/bench/ — TPC-C W=50 pg sweep and two parallelism reruns.
    • Per-workload READMEs for tpcb/tpcc/tpch.
    • Dropped legacy docs/relgen-framework-subplan.md,
      docs/tpcds-empty-queries-analysis.md, docs/dst-to-json-refactor.md.

Motivation and Context

The old data path was three ways to generate a row (ParamSource,
ExternalGenerator, TpchTableSource), none of them seekable, none
expressive enough for TPC-DS, and scattered between Go and TS. Parallel
load was per-driver and inconsistent. TPC-H reference data shipped as a
20k-line Go literal.

Goals of this branch:

  1. One generation framework, under pkg/datagen/, crossing the Go↔TS
    boundary through one strict proto.
  2. Seekable by construction: every value is f(root_seed, attr_path, sub_keys, row_index). Parallel load is free row-range chunking.
  3. Closed, domain-agnostic stdlib. Workload specifics live in TS.
  4. Expressive enough for full TPC-DS green (this PR ships TPC-H; TPC-DS
    uses the same primitives).
  5. Real-DB validation at every stage via tmpfs Postgres.
  6. No backward compatibility. Old wire and old packages deleted outright.

How Has This Been Tested?

  • make build + make linter_fix clean.
  • Unit tests for every datagen package (seed, expr, stdlib, compile,
    runtime, cohort, lookup, parallel_insert).
  • Tmpfs-PG integration tests under test/integration/ — run with
    make tmpfs-up then the relevant go test target:
    • Datagen pipeline smoke + stage-D (Uniform + SCD-2).
    • Relationship parent-child smoke.
    • Driver InsertSpec smoke across pg/mysql/picodata/ydb.
    • TPC-B seed load + balance invariant.
    • TPC-C framework proof at WAREHOUSES=1.
    • TPC-H multi-DB + parallel (SF ≤ 1).
    • CSV driver smoke + determinism + golden SHA256 for TPC-B SF=1.
  • Determinism sweep: every primitive gives identical row multisets at
    workers ∈ {1, 4, 16}.
  • TPC-H SF=1 answer-set check against answers_sf1.json.
  • Parallelism benchmarks recorded under docs/bench/.

Type of Changes

  • Bug fix
  • New feature
  • Documentation improvement
  • Refactoring
  • Other — wire break (proto); legacy packages deleted.

Checklist

  • I have read the CONTRIBUTING.md
  • I have checked build and tests
  • I have updated documentation if needed

Cianidos added 30 commits April 22, 2026 22:09
Cianidos added 29 commits April 24, 2026 03:48
Deletes message Generation wholesale from common.proto (~350 LOC:
Range, Alphabet, Distribution, Rule, all string/dict variants).
Deletes InsertDescriptor, InsertMethod enum, QueryParamDescriptor,
and QueryParamGroup from descriptor.proto; TxIsolationLevel stays.
Scrubs the DriverQuery.method field and DriverConfig.default_insert_method
field that depended on the removed InsertMethod / defaultInsertMethod
surface.

Regenerates Go + TS sources via make proto. proto/ts_bundle/build.js
drops the LegacyInsertMethod collision alias — the name was only
needed while stroppy.InsertMethod and stroppy.datagen.InsertMethod
coexisted.

No reserved ranges added per CLAUDE.md Proto discipline item 2.
…enum

The previous cleanup pass conflated DriverSetup.defaultInsertMethod with
the legacy stroppy.InsertMethod enum from descriptor.proto (deleted) and
removed it entirely. But the new stroppy.datagen.InsertMethod enum in
datagen.proto carries the same NATIVE/PLAIN_BULK/PLAIN_QUERY values, and
every driver implements all three insert paths per the handoff driver
surface table. The knob is legitimate — it pins every InsertSpec's
method so cross-DB runs can compare raw insert throughput on identical
protocols.

Re-adds InsertMethodName / insertMethodMap / DriverSetup.defaultInsertMethod
pointed at the new enum. DriverX.insertSpec applies it as an override
when set, consistent with the "pin for fair comparison" intent. The five
workload literals (tpcb/tx,procs + tpcc/tx,procs + tpch/tx) now type-check.
@Cianidos Cianidos merged commit debf9f1 into main Apr 29, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant