Relations framework. TPC-H. Parallelised load. CSV driver#68
Merged
Conversation
Deletes message Generation wholesale from common.proto (~350 LOC: Range, Alphabet, Distribution, Rule, all string/dict variants). Deletes InsertDescriptor, InsertMethod enum, QueryParamDescriptor, and QueryParamGroup from descriptor.proto; TxIsolationLevel stays. Scrubs the DriverQuery.method field and DriverConfig.default_insert_method field that depended on the removed InsertMethod / defaultInsertMethod surface. Regenerates Go + TS sources via make proto. proto/ts_bundle/build.js drops the LegacyInsertMethod collision alias — the name was only needed while stroppy.InsertMethod and stroppy.datagen.InsertMethod coexisted. No reserved ranges added per CLAUDE.md Proto discipline item 2.
…enum The previous cleanup pass conflated DriverSetup.defaultInsertMethod with the legacy stroppy.InsertMethod enum from descriptor.proto (deleted) and removed it entirely. But the new stroppy.datagen.InsertMethod enum in datagen.proto carries the same NATIVE/PLAIN_BULK/PLAIN_QUERY values, and every driver implements all three insert paths per the handoff driver surface table. The knob is legitimate — it pins every InsertSpec's method so cross-DB runs can compare raw insert throughput on identical protocols. Re-adds InsertMethodName / insertMethodMap / DriverSetup.defaultInsertMethod pointed at the new enum. DriverX.insertSpec applies it as an override when set, consistent with the "pin for fair comparison" intent. The five workload literals (tpcb/tx,procs + tpcc/tx,procs + tpch/tx) now type-check.
…is the single dial
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of Changes
Reimplements stroppy's data generation and load path on a new relational
framework, ships full TPC-H, parallelises the insert path across all SQL
drivers, and adds an ephemeral CSV sink driver. 84 commits, replacing
pkg/common/generate/and theParamSource/ExternalGenerator/TpchTableSourcesurface with a single, seekable, proto-described pipeline.Major areas:
pkg/datagen/)seed/— singleDerive(root, path…)seed composition (splitmix64 overfnv1a64 path). Only PRNG seeding path in the codebase.
dgproto/+proto/stroppy/datagen.proto— new wire format:RelSource,Relationship,Side,Degree,Strategy,Lookup,StreamDraw,CohortSchedule,SCD2,Expr,Literal,InsertSpec.compile/— DAG topo sort, dependency analysis, stream id assignment.expr/— Expr evaluator (literal, col_ref, row_index, binop, call,if, dict_at, choose, cohort_draw/live, stream_draw + per-arm kernels,
two-phase
Draw.grammartemplate walker).stdlib/— 10 closed primitives (dates, format, hash, parse, permute,strings, uuid). Names are domain-agnostic:
std.format,hash.mod,std.daysToDate,std.permuteIndex,std.parseInt/Float.runtime/— flat population iteration, block slots, per-attr nullhandling, relationship iteration, SCD-2 row-split, seek by row-index.
cohort/,lookup/— cohort schedules with LRU + persistence,lookup populations; per-clone registries fix concurrent-map races.
datagen.proto; removedGeneration,InsertDescriptor,QueryParamDescriptor/Group,TpchTableSource,ExternalRuleRef, andmost of the old
common.proto/descriptor.protowire. Noreservedranges, no milestone comments.
DRIVER_TYPE_CSV = 5added for the CSV sink.internal/static/datagen.ts, 1.8k LOC)Rel,Attr,Draw,Dict,Expr,std,Alphabet,DrawRT. Typed wrappers over every stdlib primitive; raw pb typesnever leak into workload code.
R.*/S.*/C.*/Dist.*/AB.*surface removed.DrawRTnamespace (sobek-bound Draw structs incmd/xk6air/) fortx-time randomness with sample/next/seek/reset.
pkg/driver/common/parallel_insert.go— shared within-table parallelinsert utility; all drivers go through it.
InsertSpecimplementation for postgres, mysql, picodata, ydb. LegacyInsertValuespath removed across every driver.pkg/driver/csv/— ephemeral CSV driver, URL-configured, NATIVE-only.SHA256 golden for TPC-B SF=1 reference output.
pkg/driver/noop/— honoursparallelism.workersviaRunParallel.time.Timepromoted to addressable; dialectcompatibility fixes for tpcc/tpcb.
workloads/tpch/— new end-to-end TPC-H: 8-table load, spec-compliantorderkeys/dates/prices/variable degree, text helper, query validation.
Dialect SQL for pg/mysql/pico/ydb. Reference
distributions.jsonandanswers_sf1.jsongenerated by build-time tools (not embedded Goliterals — old 19936-line
refanswers/answers.gogone).workloads/tpcc/— rewritten load withRel.table+driver.insertSpec.Spec parity: c_last syllables, ORIGINAL marker, deterministic NULLs,
o_c_id permutation, CC1-CC4. Tx-time randomness migrated to
DrawRT.workloads/tpcb/— same rewrite pattern; tx-time randomness on DrawRT.workloads/simple/— rewritten as a framework demo.populatestep renamed toload_datafor consistency.LOAD_WORKERSenv parameterises load parallelism per workload.cmd/)dstparse— one-shot parser for upstream.dstfiles → JSON.tpch-dists,tpch-answers— TPC-Hdists.dssand SF=1.ans→ JSON.xk6air— newdraw.go/draw_arms.go/draw_ctors.go/draw_prng_pool.go; legacygenerator_wrappers.goremoved.test/integration/)degree + SCD-2), driver InsertSpec, csv, tpcb, tpcc, tpch multi-DB,
tpch parallel.
row multiset).
docs/datagen-framework.md(1.2k lines) — authoritative framework guide.docs/parallelism.md— parallelism contract + tuning guide.docs/bench/— TPC-C W=50 pg sweep and two parallelism reruns.docs/relgen-framework-subplan.md,docs/tpcds-empty-queries-analysis.md,docs/dst-to-json-refactor.md.Motivation and Context
The old data path was three ways to generate a row (
ParamSource,ExternalGenerator,TpchTableSource), none of them seekable, noneexpressive enough for TPC-DS, and scattered between Go and TS. Parallel
load was per-driver and inconsistent. TPC-H reference data shipped as a
20k-line Go literal.
Goals of this branch:
pkg/datagen/, crossing the Go↔TSboundary through one strict proto.
f(root_seed, attr_path, sub_keys, row_index). Parallel load is free row-range chunking.uses the same primitives).
How Has This Been Tested?
make build+make linter_fixclean.runtime, cohort, lookup, parallel_insert).
test/integration/— run withmake tmpfs-upthen the relevantgo testtarget:workers ∈ {1, 4, 16}.
answers_sf1.json.docs/bench/.Type of Changes
Checklist