This is the Dockerized version of the Berlin SPARQL Benchmark.
- Original work : http://wbsg.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/
- Sources : https://github.com/VCityTeam/BSBM
- Images published on Docker hub.
docker run -v "$PWD:/app/data" -e "DATA_DESTINATION=<folder>" vcity/bsbm generate [args]
docker run -v "$PWD:/app/data" -e "DATA_DESTINATION=<folder>" vcity/bsbm generate-n [args]
docker run -v "$PWD:/app/data" -e "DATA_DESTINATION=<folder>" vcity/bsbm qualification [args]
docker run -v "$PWD:/app/data" -e "DATA_DESTINATION=<folder>" vcity/bsbm testdriver [args]The generate-n command accepts the following arguments:
--versionsor-v(required): Number of dataset versions to generate--productsor-p(default: 100): Base product count. In linear mode it is the additive base (versionihasproducts + i × step, so the first version isproducts + step, notproducts). In easing mode it is the exact count of the first version.--stepor-s(default: 1000): Product increment per version (linear mode only)--targetor-t(optional): Final product count (last version). Providing it enables easing mode (see below). When set,--stepis ignored.--easingor-e(default: linear): Easing curve used in easing mode. Accepts a named curve (monotonic or periodic) or a rawawkexpression oft.--patternsor-P(default: 1): Number of cycles for the periodic curves (sineWave,sineStairs), modelling repeated data-acquisition campaigns. Ignored by the other curves.--concurrentor-c(optional): Generate several concurrent lineages that share the same first version but evolve differently. Takes a comma-separated list of branch specscurve[@target[@patterns]]. Requires--target. See Concurrent versioned graphs.--outdiror-o(default:concurrent): Parent output directory used by--concurrent. The shared first version is written asdataset-1.<format>here and each branch gets its ownbranch-NN-<curve>/sub-directory.--formator-f(default: ttl): Output format (nt, ttl, trig, xml, sql, virt, monetdb)--var(default: 0): Variability percentage (0-100). Controls the percentage of products that change between versions. When set to a value greater than 0, each version generates an update dataset containing the specified percentage of products as changes.
By default generate-n grows the product count linearly: version i contains products + i × step products.
When you pass --target, the product count is instead interpolated from --products to --target across the versions following an easing curve. Easing mode requires --versions >= 2. For version i of N the progress is t = (i-1)/(N-1) and the count is round(products + easing(t) × (target − products)). For monotonic curves the first version equals --products and the last equals --target; the chosen curve only changes how the intermediate versions are distributed. Descending ranges (target < products) are supported.
The named curves fall into two families that represent different dataset evolutions:
- Monotonic ramps —
linearand everyease*curve. They grow once from--productsto--target; they differ only in where along the versions the growth is concentrated (start, end, or middle). - Periodic patterns —
sineWaveandsineStairs. They repeat--patternstimes to model several successive data-acquisition campaigns instead of a single ramp.sineWaveoscillates between--productsand--target(acquire then release), whilesineStairsaccumulates monotonically to--targetin--patternsvisible steps.
Pick the curve whose shape matches the evolution you want to benchmark:
| Dataset evolution to model | Curve(s) | Example generate-n args |
|---|---|---|
| Steady, constant-rate growth | linear |
-v 10 -t 5000 -e linear |
| Slow start, late surge (delayed adoption) | easeInQuad · easeInCubic · easeInExpo |
-v 10 -t 5000 -e easeInCubic |
| Fast start, then saturation / plateau | easeOutQuad · easeOutExpo · easeOutCirc |
-v 10 -t 5000 -e easeOutExpo |
| S-curve adoption (slow → fast → slow) | easeInOutSine · easeInOutCubic |
-v 10 -t 5000 -e easeInOutCubic |
| Growth that overshoots then corrects | easeOutBack · easeOutElastic |
-v 12 -t 5000 -e easeOutBack |
| Settling with a few bounces | easeOutBounce |
-v 16 -t 5000 -e easeOutBounce |
| Repeated acquire-then-release cycles | sineWave + --patterns |
-v 13 -t 5000 -e sineWave -P 3 |
| Cumulative acquisition over N campaigns | sineStairs + --patterns |
-v 13 -t 5000 -e sineStairs -P 3 |
For periodic curves, use
--versions >= 4 × --patternsso each cycle is sampled by enough versions to be visible in the generated sequence.
The options above describe a single history. Sometimes you instead want several concurrent histories of the same dataset — they all start from the same first version and then evolve differently. This models situations such as parallel data-acquisition pipelines, A/B growth scenarios, or competing forecasts feeding the same initial graph. --concurrent produces exactly that:
- Version 1 is generated once and is the common root shared by every branch (
dataset-1.<format>in--outdir); it is byte-identical for all branches. - Versions 2..N evolve independently per branch, each following its own easing curve and, optionally, its own target and pattern count.
--concurrent takes a comma-separated list of branch specs curve[@target[@patterns]], where the optional @target and @patterns default to the global --target and --patterns. It requires --target (which also enables easing mode). Each branch is written to its own branch-NN-<curve>/ sub-directory under --outdir (default concurrent/), and the diff of its version 2 is computed against the shared root — so the branches share their first version but diverge from version 2 onward.
# 3 concurrent 6-version histories sharing the same first version (100 products),
# all growing toward 5000 but along different curves
./generate-n -v 6 -p 100 -t 5000 -c "linear,easeInCubic,easeOutExpo"
# Per-branch target/patterns overrides: steady growth, an early surge to 8000,
# and 4 acquire-then-release acquisition cycles between 100 and 5000
./generate-n -v 13 -p 100 -t 5000 -o histories \
-c "linear,easeInExpo@8000,sineWave@5000@4"Resulting layout for the first example:
concurrent/
├── dataset-1.ttl # shared first version (common root)
├── branch-01-linear/
│ ├── dataset-2.ttl … dataset-6.ttl
│ └── dataset-{2..6}_additions.nt / _deletions.nt # diffs vs. the previous version
├── branch-02-easeInCubic/
│ └── …
└── branch-03-easeOutExpo/
└── …
--concurrent overrides the global --easing; combine it with --var and --format exactly as in single-lineage mode (those apply to every branch and to the shared root).
--easing accepts either a named curve or any raw awk expression of t (the version progress in [0, 1]).
Variants. Every family comes in three flavors:
easeIn*— slow start, fast finish (growth back-loaded toward the last versions).easeOut*— fast start, slow finish (growth front-loaded toward the first versions).easeInOut*— slow at both ends, fast in the middle.
How to read the plots. Each plot shows the normalized curve e(t) over t ∈ [0, 1]: the dashed square is the [0, 1] reference range, the lower line is the e = 0 baseline (the --products level), the blue line is the curve, and the two red dots are the endpoints. Monotonic curves start at --products (t = 0) and end at --target (t = 1); curves that leave the top or bottom of the square overshoot the range (see Back and Elastic). The periodic curves (sineWave, sineStairs) repeat their shape --patterns times — their plots use --patterns = 3, and sineWave returns to --products at the end rather than reaching --target.
| Curve | Shape | awk formula e(t) |
Behavior |
|---|---|---|---|
linear |
t |
Constant rate — versions are spaced evenly. |
Gentle trigonometric acceleration.
Mild polynomial acceleration (t²).
Stronger acceleration (t³).
Steep acceleration (t⁴).
Very steep acceleration (t⁵).
Extreme: almost flat, then explosive (2^t).
Circular arc — abrupt near one end.
Overshoots slightly past the bound before settling (anticipation).
Springs / oscillates around the bounds (rubber-band).
Bounces like a ball coming to rest.
Repeat --patterns times to model several successive acquisition campaigns. The awk expression reads patterns from --patterns; the plots below use --patterns = 3.
Custom curves. Any value that is not a known curve name is treated as a custom
awkexpression oft, so you can supply your own (e.g.-e 't*t*t', equivalent toeaseInCubic). A custom expression must referencet; a bare word or a constant is rejected to avoid silently flattening every version. Custom expressions may also usepatterns(the--patternsvalue) to build their own periodic curve, e.g.-e '(1-cos(2*pi*patterns*t))/2' -P 4.The illustrations above plot these exact
awkexpressions (the same ones defined ingenerate-n).
Examples:
# 5 versions ramping from 100 to 5000 products along an ease-in/out sine curve
docker run -v "$PWD:/app/data" vcity/bsbm generate-n -v 5 -p 100 -t 5000 -e easeInOutSine
# Front-loaded growth via a custom expression
docker run -v "$PWD:/app/data" vcity/bsbm generate-n -v 5 -p 100 -t 5000 -e 'sqrt(t)'
# Multiple data acquisition: 3 campaigns accumulating to 5000 products (13 versions)
docker run -v "$PWD:/app/data" vcity/bsbm generate-n -v 13 -p 100 -t 5000 -e sineStairs -P 3
# Multiple data acquisition: 3 acquire-then-release cycles between 100 and 5000 products
docker run -v "$PWD:/app/data" vcity/bsbm generate-n -v 13 -p 100 -t 5000 -e sineWave -P 3
# Concurrent histories: same first version, 3 different evolutions toward 5000
docker run -v "$PWD:/app/data" vcity/bsbm generate-n -v 6 -p 100 -t 5000 -c "linear,easeInCubic,easeOutExpo"
# Concurrent histories with per-branch targets/patterns, written under "histories/"
docker run -v "$PWD:/app/data" vcity/bsbm generate-n -v 13 -p 100 -t 5000 -o histories -c "linear,easeInExpo@8000,sineWave@5000@4"For each version >= 2, generate-n automatically computes the RDF diff between consecutive versions and outputs:
dataset-X_additions.nt: triples present in version X but not in version X-1dataset-X_deletions.nt: triples present in version X-1 but not in version X
These files are always generated in N-Triples format regardless of the --format option, since they are computed by comparing sorted N-Triples representations of each version. In --concurrent mode the diffs live inside each branch-NN-<curve>/ directory, and each branch's dataset-2_* files describe the change from the shared root (dataset-1) to that branch's version 2.
Generation runs two JVMs (the dataset generator, and the DatasetDiff step inside generate-n), both with a default heap of 2 GB. A single environment variable, BSBM_XMX, sizes both — raise it when generating large datasets (high --products/--target):
| Process | Footprint grows with |
|---|---|
generate (the dataset generator) |
The product count — it holds the whole model in memory while serializing. |
generate-n diff step (DatasetDiff) |
The version size — it loads both consecutive N-Triples versions into memory and sorts them. |
generate-n runs generate as a child process, so setting BSBM_XMX once in front of generate-n covers both: it is inherited by the generator subprocess and read directly by the diff step.
BSBM_XMX=8g ./generate-n -v 10 -p 100 -t 50000 -c "linear,easeInCubic"If you want more information about the different arguments, please refer to the original documentation.
docker run vcity/bsbm generate -help
docker run vcity/bsbm generate-n -help
docker run vcity/bsbm qualification -help
docker run vcity/bsbm testdriver -help$PWD is the directory where the data will be stored. You can change it to any directory you want.
Dockerfile:
- Added new authors
- Dockerized the benchmark
entrypoint.sh
- Added a new entrypoint script to run the benchmark