Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions blog/2026-04-07-noop-drivers/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
slug: measuring-stroppy-before-measuring-databases
title: "Measuring Stroppy Before Measuring Databases"
authors: [stroppy-authors]
tags: [development, drivers, noop, pg-noop, internals]
---

If stroppy tops out at 12 000 iterations per second, no database under test will appear faster than that, regardless of its actual performance, in other words, stroppy's own throughput caps benchmark results.

Stroppy already has two points where it communicates with the outside world: the driver layer, which handles query construction and dispatch, and the wire protocol layer beneath it. We added a noop sink to each — an in-process driver that discards all operations and a standalone pg-noop server that speaks the full PostgreSQL wire protocol but returns empty results. Together they give us throughput ceilings with and without the protocol stack involved.

<!-- truncate -->

## The Built-in Noop Driver

The first layer is a driver inside stroppy that accepts all operations and discards them. The data generator runs in full, queries are built and parameterized correctly, but nothing reaches a socket. It landed in [PR #61](https://github.com/stroppy-io/stroppy/pull/61).

It serves two purposes. During development, it gives us a full E2E path without requiring a database — useful for testing workload scripts and driver logic before spinning anything up. And during benchmarking, it measures stroppy's absolute throughput ceiling: the full stroppy cost with no network and no database in the picture.

Running the TPC-C `pick` workload against the noop driver on our test machine (Intel Core Ultra 7 155H, 22 cores, 32 GB RAM):

```bash
stroppy run tpcc/pick -d noop -- --vus 8 --duration 30s
```

gives about **100 000 iterations/s** at 8 VUs. That's the maximum stroppy can produce on this hardware. If we saw a real database approach this number, we'd know the database isn't the bottleneck — stroppy is.

## pg-noop: Adding the Protocol Layer

The second layer is a standalone server: [pg-noop](https://github.com/stroppy-io/pg-noop). It listens on a regular PostgreSQL port and speaks the full wire protocol — simple query, extended query, COPY — but treats every statement as a no-op and returns mechanically valid empty responses.

```bash
# Install and run
curl -LsSf https://github.com/stroppy-io/pg-noop/releases/latest/download/pg-noop-installer.sh | sh
pgnoop

# Run stroppy against it as a regular postgres target
stroppy run tpcc/pick -- --vus 8 --duration 30s
```

No configuration needed on stroppy's side — it connects to localhost:5432 and sees a normal PostgreSQL server.

The implementation is built on [pgwire](https://github.com/sunng87/pgwire), a Rust library that several newer databases use as their PostgreSQL wire protocol layer. On top of it: a `NoopHandler` implementing four traits (`StartupHandler`, `SimpleQueryHandler`, `ExtendedQueryHandler`, `CopyHandler`), a Tokio acceptor loop, and jemalloc for the musl build. Structurally, pg-noop is a PostgreSQL server — it just has no storage behind it.

The same workload against pg-noop yields about **41 000 iterations/s** at VUS=8 — versus 100 000 against the in-process noop. The gap is the PostgreSQL wire protocol overhead on localhost: connection management, query serialization, network round-trips, response parsing. At VUS=1 the uncontended per-transaction cost difference is about 51 µs — the protocol cost with no database work.

| Driver | VUS=1 | VUS=8 |
|--------|------:|------:|
| noop (in-process) | 29 421/s | 100 419/s |
| pg-noop (TCP localhost) | 11 814/s | 41 352/s |

## What These Numbers Tell Us

The noop ceiling is the hard upper bound on stroppy's throughput on this hardware — any benchmark result near it means stroppy is the bottleneck, not the database. The pg-noop ceiling adds the protocol layer: if a real database sits close to that number, the cost is mostly in the client, not the server.

These numbers also characterize the test machine itself, which matters when comparing results across environments or hardware generations.

One more use: pg-noop can run on a remote node. Pointing stroppy at it over real network infrastructure gives a clean measurement of what latency and bandwidth alone cost, without any database noise. That's useful when evaluating multi-node or cross-datacenter setups before involving an actual database cluster.

---

Writing a driver for a different database follows the same pattern as the noop driver — see [Extensibility](/docs/extensibility) for a walkthrough.

---

The numbers in the tables above are after the generator-pipeline optimizations — the [next post](/blog/stroppy-generator-performance) covers the profiling sessions and changes that produced them.
29 changes: 29 additions & 0 deletions blog/2026-04-08-stroppy-perf/e2e_throughput.gnu
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
set terminal pngcairo size 1800,960 enhanced font "Sans,12" fontscale 2 linewidth 2
set output "e2e_throughput.png"

set style data histogram
set style histogram clustered gap 1
set style fill solid 0.85 border -1
set boxwidth 0.8

set title "tpcc/pick throughput — noop driver (median of 3×30s runs)" font "Sans,13"
set xlabel "Virtual Users"
set ylabel "Iterations / second"
set yrange [0:130000]
set format y "%'.0f"
set grid y lt 0 lc "grey" lw 0.5
set key top left

set xtics ("1" 0, "2" 1, "4" 2, "8" 3, "16" 4)

# before, after
$data << EOD
"1" 25424 29421
"2" 43175 50072
"4" 65919 74566
"8" 90264 100419
"16" 100170 111489
EOD

plot $data using 2:xtic(1) title "before" lc rgb "#5778a4", \
$data using 3 title "after" lc rgb "#e49444"
Binary file added blog/2026-04-08-stroppy-perf/e2e_throughput.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
136 changes: 136 additions & 0 deletions blog/2026-04-08-stroppy-perf/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
---
slug: stroppy-generator-performance
title: "How Fast Is Stroppy? Profiling the Generator Pipeline"
authors: [stroppy-authors]
tags: [performance, internals, generators, profiling, tpc-c]
---

The [previous post](/blog/measuring-stroppy-before-measuring-databases) set up the measurement methodology: the noop driver and pg-noop let us measure stroppy's own overhead with no database involved. Using that setup, we ran pprof against stroppy to understand where its time goes. The generator pipeline was the main target, and this post covers what the profiles showed and what we changed.

Stroppy's Go generator pipeline had a handful of avoidable allocations and one hot path doing redundant work on every query. Fixing them is the subject of a single PR. The end-to-end throughput improvement is about 11–16% on a steady-state workload; the data loading phase improves by 3.7× — that phase runs entirely in Go, without the per-iteration overhead of sobek, the JavaScript runtime stroppy uses to execute workload scripts.

<!-- truncate -->

## What pprof Showed

To profile, we run stroppy with `--profiling-enabled` and pull a profile against the built-in pprof endpoint:

```bash
stroppy run tpcc/pick -d noop -- --vus 8 --duration 10m --profiling-enabled
# in another terminal:
go tool pprof -http=:8080 http://localhost:6565/debug/pprof/profile?seconds=30
```

Three things stood out in the CPU and heap profiles:

1. **`WordCutter.Cut`** — every generated string copied its buffer via `strings.Builder.String()`. One 16-byte heap allocation per string field, per row, everywhere.
2. **`ProcessArgs`** — the function that rewrites named SQL placeholders (`:w_id` → `$1`) ran a regex scan and `strings.Builder` assembly on every single query execution. SQL templates don't change between calls, but nothing was cached.
3. **`*stroppy.Value` boxing** — every generated value was wrapped in a protobuf oneof struct before being handed to the driver. One allocation per value, regardless of type.

These three accounted for most of the non-sobek heap activity.

## The Fixes

### String generation

The `strings.Builder` approach in `WordCutter` accumulates characters into a buffer and then calls `.String()` to return a copy. The copy is unnecessary — the caller only needs the string long enough to pass it to the driver, and the buffer will be reset immediately after.

The fix: replace `.String()` with `unsafe.String(unsafe.SliceData(buf), len(buf))`. This returns a string header that aliases the internal buffer directly. No allocation. The lifetime contract is the same as before — valid until the next `Cut()` call — which the caller already respected.

We also rewrote the character tape. The original picked a random Unicode range, then called `IntN` twice per character. The new version caches one `uint64` from the PRNG and extracts `log2(alphabetSize)` bits per character with a bitmask. For a 52-character alphabet (A-Z, a-z), that's one PRNG call per ~10 characters instead of two per character. The lookup table is sized to the next power of two for cheap masking.

Combined: `WordCutter.Cut` went from 161 ns / 1 alloc to 39 ns / 0 allocs.

### SQL template caching

`ProcessArgs` was doing real work on every call: a regex scan to find `:param` tokens, and a `strings.Builder` pass to rebuild the SQL with positional placeholders. All of this is determined by the SQL template and the dialect — it doesn't depend on the argument values at all.

The fix is a `sync.Map` keyed by `dialect.Placeholder(0) + "|" + sqlStr`. The first call for each `(dialect, sql)` pair parses and stores a `parsedQuery` struct. Every subsequent call does a map lookup and fills the argument slice. The cache is capped at 1 000 entries.

Result: 4 636 ns / 13 allocs → 186 ns / 2 allocs. The 96% reduction is large because the cached path is almost nothing — one map lookup, one slice fill.

### `*stroppy.Value` boxing

The generator interface previously returned `(*stroppy.Value, error)`, where `*stroppy.Value` is a protobuf oneof that wraps the actual value. Every call to `Next()` allocated a new wrapper struct.

We changed `Next()` to return `(any, error)` with native Go types: `int64`, `float64`, `uuid.UUID`, `time.Time`, `decimal.Decimal`, `*string`. Types narrower than 64 bits are widened so they fit in the interface word without a `convT` allocation.

For types that still heap-allocate (time.Time, decimal.Decimal), we use a "slotted" generator: the generator owns a single value slot in a closure and returns `*T`. A pointer is always pointer-sized, so boxing it as `any` is zero-alloc. The caller must not hold the pointer past the next `Next()` call — the same constraint as `WordCutter.Cut`.

Result: integer and string generators dropped to zero allocations per call. DateTime: 127 ns / 4 allocs → 17 ns / 0 allocs.

### Smaller fixes

A few smaller wins rounded out the PR:

- **`UniqueDistribution`**: replaced `atomic.Pointer[T]` (allocates a new value on every CAS) with `atomic.Uint64` (plain counter). −68%, 1 alloc → 0.
- **`UniformDistribution` integer path**: was using `Float64()` + `math.Round()` for integer ranges. Replaced with `Uint64N(span+1)`. −43%.
- **`TupleGenerator`**: replaced a goroutine + buffered channel with an inline depth-first state machine.
- **genIDs**: `QueryBuilder` now precomputes generated ID lists at construction, not per batch.

## The Numbers

### Microbenchmarks

Running `go test -bench=. -benchmem -count=10` before and after, with `benchstat` for comparison:

![Microbenchmark ns/op, log scale](./micro_ns.png)

| Benchmark | Before | After | Δ |
|-----------|-------:|------:|---|
| `CharTape_Next` | 9.7 ns | 2.4 ns | −75% |
| `WordCutter_Cut` | 161 ns / 1 alloc | 39 ns / **0** | −76% |
| `StringGenerator_Next` | 169 ns / 1 alloc | 40 ns / **0** | −76% |
| `UniqueNumber_Next` | 17.4 ns / 1 alloc | 5.5 ns / **0** | −68% |
| `Generator_String` | 232 ns / 2 allocs | 58 ns / **0** | −75% |
| `Generator_DateTime` | 127 ns / 4 allocs | 17 ns / **0** | −87% |
| `Generator_Decimal` | 349 ns / 9 allocs | 116 ns / 1 | −67% |
| `ProcessArgs` | 4 636 ns / 13 allocs | 186 ns / 2 | **−96%** |

The log scale on the chart is needed because `ProcessArgs` (4 636 ns → 186 ns) and `CharTape_Next` (9.7 ns → 2.4 ns) live in completely different ranges.

### End-to-end throughput

The `tpcc/pick` workload against the noop driver, varying VU count. Each data point is the median of three 30-second runs (Intel Core Ultra 7 155H, 22 cores, 32 GB RAM).

![E2E throughput before/after, noop driver](./e2e_throughput.png)

| VUS | Before | After | Δ |
|----:|-------:|------:|---|
| 1 | 25 424/s | 29 421/s | +15.7% |
| 2 | 43 175/s | 50 072/s | +16.0% |
| 4 | 65 919/s | 74 566/s | +13.1% |
| 8 | 90 264/s | 100 419/s | +11.2% |
| 16 | 100 170/s | 111 489/s | +11.3% |

The gain is uniform across VU counts, which confirms it's per-operation work, not something contention-related. The scaling curve itself doesn't change — both versions plateau at the same point, limited by the sobek event loop scheduler, not the generators.

Against **pg-noop** (real TCP, full PostgreSQL wire protocol on localhost) the gain is +11.5% at VUS=1, +8.8% at VUS=8. Smaller, because the ~51 µs protocol overhead per transaction dilutes the per-operation generator savings.

### Data generation

The most visible improvement is in `load_data` — the phase that generates and inserts all TPC-C tables before the workload starts. This runs entirely in Go with no JS overhead per row, so it directly measures generation throughput.

![Per-table speedup at SF=20](./load_speedup.png)

| Scale factor | Rows | Before | After | Speedup |
|-------------:|-----:|-------:|------:|--------:|
| 1 | 231K | 1.20s | 0.32s | 3.8× |
| 20 | 2.7M | 20.2s | 5.5s | 3.7× |
| 100 | 13.1M | 103s | 27.7s | 3.7× |

The 3.7× figure is consistent across all scale factors — it's a per-row cost reduction, so the benefit scales linearly with data volume.

The `district` table (7.6×) benefits more than the others because it generates many short strings relative to its row count. `customer` and `stock` have more varied fields but the string-heavy columns (`c_data` at 300–500 characters, the ten `s_dist_*` fields) still drive most of the work.

## Where We Stopped

After these changes, a fresh pprof profile shows sobek (the Go JavaScript runtime that runs workload scripts) as the dominant cost. That's not ours to optimize. The generator pipeline is no longer visible as a distinct entry.

The steady-state improvement (+11–16%) is real but not large. It's limited by how much of the per-transaction time is spent in Go generators versus sobek. The `load_data` improvement (3.7×) is larger because that path doesn't cross the JS boundary at all — it's pure Go generation.

For practical purposes: with this change, stroppy's generator overhead is unlikely to be the bottleneck in most setups. The data loading phase is substantially faster at any scale, which makes iteration on large-scale tests faster.

---

The [previous post](/blog/measuring-stroppy-before-measuring-databases) covers the noop driver and pg-noop setup used throughout this work. The full changes are in [PR #62](https://github.com/stroppy-io/stroppy/pull/62).
31 changes: 31 additions & 0 deletions blog/2026-04-08-stroppy-perf/load_speedup.gnu
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
set terminal pngcairo size 1640,880 enhanced font "Sans,12" fontscale 2 linewidth 2
set output "load_speedup.png"

set style fill solid 0.85 border -1
set style data histograms
set boxwidth 0.6

set title "load\\_data speedup by table — SF=20, noop driver" font "Sans,13"
set xlabel ""
set ylabel "Speedup (×)"
set yrange [0:9]
set grid y lt 0 lc "grey" lw 0.5
set key off

# label bars with value
set label 1 "3.6×" at 0, 3.6+0.25 center font "Sans,11"
set label 2 "7.6×" at 1, 7.6+0.25 center font "Sans,11"
set label 3 "3.9×" at 2, 3.9+0.25 center font "Sans,11"
set label 4 "3.6×" at 3, 3.6+0.25 center font "Sans,11"

# reference line at 1×
set arrow from -0.5, 1 to 3.5, 1 nohead lc rgb "#cc3333" lw 1.5 dt 2

$data << EOD
"item" 3.6
"district" 7.6
"customer" 3.9
"stock" 3.6
EOD

plot $data using 2:xtic(1) lc rgb "#e49444"
Binary file added blog/2026-04-08-stroppy-perf/load_speedup.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
29 changes: 29 additions & 0 deletions blog/2026-04-08-stroppy-perf/micro_ns.gnu
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
set terminal pngcairo size 1920,1000 enhanced font "Sans,11" fontscale 2 linewidth 2
set output "micro_ns.png"

set style data histogram
set style histogram clustered gap 1
set style fill solid 0.85 border -1
set boxwidth 0.85

set title "Key microbenchmarks — ns/op (log scale)" font "Sans,13"
set xlabel ""
set ylabel "ns / op (log scale)"
set logscale y
set yrange [1:10000]
set grid y lt 0 lc "grey" lw 0.5
set key top right
set xtics rotate by -20 noenhanced

$data << EOD
label before after
"CharTape" 9.747 2.444
"WordCutter" 160.70 39.11
"Uniq_Next" 17.355 5.483
"Gen_String" 232.10 58.46
"Gen_DateTime" 127.45 16.64
"ProcessArgs" 4635.5 185.7
EOD

plot $data using 2:xtic(1) title "before" lc rgb "#5778a4", \
$data using 3 title "after" lc rgb "#e49444"
Binary file added blog/2026-04-08-stroppy-perf/micro_ns.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.