From ee105e8c0bb96bdf25d2d7ba497cbc1a82e761dd Mon Sep 17 00:00:00 2001 From: Nik Samokhvalov Date: Sat, 18 Apr 2026 06:50:42 -0700 Subject: [PATCH 1/6] docs: complete bench methodology + tooling + ops gotchas under benchmark/ MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds a strictly-additive benchmark/ directory documenting the methodology, tooling, and operational lessons from the pgque-vs-pgq-vs-pgmq-vs-river-vs-que-vs-pgboss-vs-pgmq-partitioned bench that backs #61 and PR #62. - README.md: entry point + quick-start - METHODOLOGY.md: methodology fix per review feedback - OPS_GOTCHAS.md: 15 operational lessons (NEW — NVMe mount, partman stale rows, que func leftovers, pgboss covering index, pgq ticker, pgque xid8 bug, spot reclaim, ASH prereqs, NOTICE instrumentation, etc.) - HARDWARE.md: i4i.2xlarge specs, PG tuning, microbench baselines - tooling/, runners/, consumers/, producers/, install/, charts/, gifs/ No pgque production SQL is touched. Refs: #61, #62. Co-Authored-By: Claude Opus 4.7 (1M context) --- benchmark/HARDWARE.md | 71 ++ benchmark/METHODOLOGY.md | 206 ++++++ benchmark/OPS_GOTCHAS.md | 259 ++++++++ benchmark/README.md | 97 +++ benchmark/charts/r5_analyze.py | 137 ++++ benchmark/charts/r6_smoke_chart.py | 123 ++++ benchmark/consumers/consumer_pgboss.sql | 18 + .../consumers/consumer_pgmq-partitioned.sql | 14 + benchmark/consumers/consumer_pgmq.sql | 15 + benchmark/consumers/consumer_pgq.sql | 14 + benchmark/consumers/consumer_pgque.sql | 16 + benchmark/consumers/consumer_que.sql | 18 + benchmark/consumers/consumer_river.sql | 18 + benchmark/gifs/r4_gif_tps_solarized.py | 618 ++++++++++++++++++ benchmark/gifs/r4_gif_v17_solarized.py | 610 +++++++++++++++++ benchmark/install/README.md | 19 + benchmark/install/bootstrap.sh | 95 +++ benchmark/install/install_pgboss.sh | 70 ++ benchmark/install/install_pgmq.sh | 11 + benchmark/install/install_pgq.sh | 30 + benchmark/install/install_river.sh | 80 +++ .../install/pgmq-partitioned_setup_5min.sql | 6 + benchmark/producers/producer_pgboss.sql | 2 + .../producers/producer_pgmq-partitioned.sql | 1 + benchmark/producers/producer_pgmq.sql | 1 + benchmark/producers/producer_pgq.sql | 1 + benchmark/producers/producer_pgque.sql | 1 + benchmark/producers/producer_que.sql | 4 + benchmark/producers/producer_river.sql | 2 + benchmark/runners/clean_reinstall.sh | 76 +++ benchmark/runners/fix_nvme_mount.sh | 65 ++ benchmark/runners/run_r7.sh | 115 ++++ benchmark/tooling/bloat_sampler.py | 86 +++ benchmark/tooling/idle_in_tx.py | 30 + benchmark/tooling/microbench.sh | 33 + benchmark/tooling/parse_events_consumed.py | 169 +++++ .../tooling/pg_stat_statements_snapshot.py | 72 ++ benchmark/tooling/pgq_ticker_daemon.py | 42 ++ benchmark/tooling/sys_metrics_sampler.py | 121 ++++ 39 files changed, 3366 insertions(+) create mode 100644 benchmark/HARDWARE.md create mode 100644 benchmark/METHODOLOGY.md create mode 100644 benchmark/OPS_GOTCHAS.md create mode 100644 benchmark/README.md create mode 100644 benchmark/charts/r5_analyze.py create mode 100644 benchmark/charts/r6_smoke_chart.py create mode 100644 benchmark/consumers/consumer_pgboss.sql create mode 100644 benchmark/consumers/consumer_pgmq-partitioned.sql create mode 100644 benchmark/consumers/consumer_pgmq.sql create mode 100644 benchmark/consumers/consumer_pgq.sql create mode 100644 benchmark/consumers/consumer_pgque.sql create mode 100644 benchmark/consumers/consumer_que.sql create mode 100644 benchmark/consumers/consumer_river.sql create mode 100644 benchmark/gifs/r4_gif_tps_solarized.py create mode 100644 benchmark/gifs/r4_gif_v17_solarized.py create mode 100644 benchmark/install/README.md create mode 100644 benchmark/install/bootstrap.sh create mode 100644 benchmark/install/install_pgboss.sh create mode 100644 benchmark/install/install_pgmq.sh create mode 100644 benchmark/install/install_pgq.sh create mode 100644 benchmark/install/install_river.sh create mode 100644 benchmark/install/pgmq-partitioned_setup_5min.sql create mode 100644 benchmark/producers/producer_pgboss.sql create mode 100644 benchmark/producers/producer_pgmq-partitioned.sql create mode 100644 benchmark/producers/producer_pgmq.sql create mode 100644 benchmark/producers/producer_pgq.sql create mode 100644 benchmark/producers/producer_pgque.sql create mode 100644 benchmark/producers/producer_que.sql create mode 100644 benchmark/producers/producer_river.sql create mode 100644 benchmark/runners/clean_reinstall.sh create mode 100644 benchmark/runners/fix_nvme_mount.sh create mode 100644 benchmark/runners/run_r7.sh create mode 100644 benchmark/tooling/bloat_sampler.py create mode 100644 benchmark/tooling/idle_in_tx.py create mode 100644 benchmark/tooling/microbench.sh create mode 100755 benchmark/tooling/parse_events_consumed.py create mode 100755 benchmark/tooling/pg_stat_statements_snapshot.py create mode 100644 benchmark/tooling/pgq_ticker_daemon.py create mode 100644 benchmark/tooling/sys_metrics_sampler.py diff --git a/benchmark/HARDWARE.md b/benchmark/HARDWARE.md new file mode 100644 index 00000000..e171b937 --- /dev/null +++ b/benchmark/HARDWARE.md @@ -0,0 +1,71 @@ +# Hardware — VM sizing and microbench baselines + +## VM: AWS i4i.2xlarge (us-east-2) + +| Spec | Value | +|---|---| +| vCPU | 8 (Intel Ice Lake Xeon 8375C, 2.9 GHz base / 3.5 GHz turbo) | +| RAM | 64 GB | +| NVMe instance store | 1 × 1.75 TB (physical attach, NVMe) | +| Network | Up to 12 Gbps | +| EBS (root) | 8 GB gp3 (Ubuntu 24.04 AMI default) | +| Spot price (us-east-2, 2026-04) | ~$0.20–0.30 / hour | +| On-demand price | $0.686 / hour | + +See [OPS_GOTCHAS.md §1](OPS_GOTCHAS.md) — the NVMe instance store is **not** auto-mounted on Ubuntu 24.04 boot. + +## Expected microbench baselines + +Run via [tooling/microbench.sh](tooling/microbench.sh). Expected order-of-magnitude numbers: + +| Probe | Expected | +|---|---| +| sysbench cpu (1-thread events/sec) | ~25 k | +| sysbench memory bandwidth (1-thread) | ~15 GiB/sec | +| fio 4 k randwrite, QD=32, direct=1, on NVMe | ~300 k IOPS | +| fio 4 k randwrite, bandwidth | ~1.2 GiB/sec | +| fio 4 k randwrite, 99 p latency | ~100 µs | +| fio 4 k randread, QD=32, on NVMe | ~400 k IOPS | + +*Actual R7 microbench numbers to be filled in from the R7 microbench pass.* + +## Postgres tuning (shared across all 7 VMs) + +Applied by each VM's bootstrap (see `install/install_*.sh`): + +``` +shared_preload_libraries = 'pg_stat_statements,pg_cron' +cron.database_name = 'bench' + +shared_buffers = 4GB +effective_cache_size = 12GB + +synchronous_commit = off +wal_level = minimal +wal_compression = lz4 +max_wal_size = 16GB +checkpoint_completion_target = 0.9 + +bgwriter_delay = 50ms +bgwriter_lru_maxpages = 400 +bgwriter_lru_multiplier = 4.0 + +random_page_cost = 1.1 +effective_io_concurrency = 200 +max_connections = 200 + +max_wal_senders = 0 + +autovacuum_vacuum_scale_factor = 0.01 +autovacuum_analyze_scale_factor = 0.01 +autovacuum_vacuum_cost_delay = 2ms + +jit = off +listen_addresses = 'localhost' +``` + +`synchronous_commit=off` is deliberate — queue workloads are almost always idempotent at the application layer, and the WAL-flush path is the dominant cost for low-latency producers. It's the only PG knob we touch that materially changes safety posture. + +`jit=off` because 5-s JIT warmups on `DO` blocks dominated our first-transaction latency in R4. + +`autovacuum_*_scale_factor=0.01` is aggressive on purpose — we want autovacuum attempting to clean every 1 % dead-tuple ratio so the held-xmin phase of the bench exposes the *inability* to vacuum, not a lazy autovacuum schedule. diff --git a/benchmark/METHODOLOGY.md b/benchmark/METHODOLOGY.md new file mode 100644 index 00000000..a61898a4 --- /dev/null +++ b/benchmark/METHODOLOGY.md @@ -0,0 +1,206 @@ +# Bench methodology — the definitive reference + +Adapted from GitLab issue [postgres-ai/postgresql-consulting/tests-and-benchmarks#77, note 3263767264](https://gitlab.com/postgres-ai/postgresql-consulting/tests-and-benchmarks/-/issues/77#note_3263767264). + +This document is the single source of truth for how the PostgreSQL-queue bench is structured, what it measures, and every script/config that makes it work. It is written so a reviewer can reproduce the whole thing end-to-end. + +Cross-links: + +- Upstream fix: [pgque PR #62](https://github.com/NikolayS/pgque/pull/62) · issue [NikolayS/pgque#61](https://github.com/NikolayS/pgque/issues/61) +- GitLab rounds: R4 final narrative — note 3262968456; R5 root — note 3263000815; R5 full writeup — note 3263438287; pgque rotation PoC under held xmin — note 3263417609. + +--- + +## 1. Goals and pathology + +We reproduce Brandur Leach's "Postgres queues" death spiral (see [Brandur's 2015 post](https://brandur.org/postgres-queues)): a DELETE-based work-queue becomes unvacuumably bloated the moment an unrelated backend holds `xmin` in the past (long transactions, logical-replication slots, stuck standby feedback). Dead tuples accumulate; bitmap / index scans must traverse them; latency explodes; throughput collapses even after the holder releases. + +Seven systems, each on its own VM: + +| System | Version | Pattern | +|---|---|---| +| **pgque** | patched via PR #62 | batch ticker + rotating event tables (TRUNCATE) | +| pgq | v3.5.1 PL-only | same model as pgque; upstream baseline | +| pgmq | v1.11.0 | single queue table, VISIBILITY + DELETE | +| pgmq-partitioned | v1.11.0 + pg_partman | range-partitioned queue table | +| river | v0.34 | SQL-level SKIP LOCKED + DELETE (consumer emulated) | +| que | v2.4 | SKIP LOCKED + DELETE (consumer emulated) | +| pg-boss | v12.15 | SKIP LOCKED + DELETE on partitioned `pgboss.job` | + +The Go/Ruby/Node workers are installed end-to-end so the schema is authentic, but the actual consumer load is driven via pgbench running each system's *SQL claim pattern*. This isolates the DB-side behaviour from runtime/GC artefacts. + +Workload shape (all runs): + +- Producer: 1 client, pgbench `-R 1000` rate-cap (may move to `-R 5000` in R7/R8) +- Consumer: 1 client for pgque/pgq, 4 clients for everything else +- Three 30-minute phases back-to-back = **1.5 h per run**: + 1. **Clean baseline** — no held xmin + 2. **Held xmin** — `idle_in_tx.py` holds `REPEATABLE READ` open + 3. **Clean recovery** — holder killed, observe regrowth / catchup + +--- + +## 2. Infrastructure + +- AWS us-east-2, **i4i.2xlarge** (8 vCPU, 64 GB, NVMe instance store) +- Spot where available; on-demand only as last resort (see Section 10) +- Ubuntu 24.04, **PG18 from PGDG**, `pg_cron`, `pg_stat_statements`, pg_ash, pgfr +- Data dir moved to NVMe: `/mnt/pgdata/postgresql/18/main` symlink (see [runners/fix_nvme_mount.sh](runners/fix_nvme_mount.sh) and [OPS_GOTCHAS.md §1](OPS_GOTCHAS.md)) +- One VM per system so any tuning / runtime / GC behaviour is contained +- SSH key: `` (us-east-2) + +Live R7 VMs (redacted): ``, ``, ``, ``, ``, ``, ``. + +--- + +## 3. Observability stack + +Seven parallel streams run during every bench. All CSVs land in `/tmp/bench/` and are rsynced to the local `/tmp/bench_r//` tree every 30 min. + +### (a) bloat_sampler.py — pg_stat_user_tables every 30 s + +Per-system filter covers each queue's event/metadata tables (including pg-boss partitions and pgmq-partitioned's partman partitions). Writes `bloat.csv`. + +Source: [tooling/bloat_sampler.py](tooling/bloat_sampler.py). + +### (b) pg_ash — 1 Hz wait-event sampling + +pg_cron runs ash sampling at 1 Hz throughout; at end of bench we export: + +```sql +COPY (SELECT sample_time, database_name, active_backends, wait_event, query_id + FROM ash.samples(p_interval => '2 hour'::interval, p_limit => 2000000)) +TO '/tmp/bench/ash.csv' CSV HEADER; +``` + +### (c) pgfr (pg-flight-recorder) — snapshot observability + +pgfr writes snapshots on a pg_cron schedule into `pgfr_record.*`. At end of run we export: + +```sql +COPY (SELECT * FROM pgfr_record.snapshots) TO '/tmp/bench/pgfr_snapshots.csv' CSV HEADER; +COPY (SELECT * FROM pgfr_record.table_snapshots) TO '/tmp/bench/pgfr_table_snapshots.csv' CSV HEADER; +COPY (SELECT * FROM pgfr_record.statement_snapshots) TO '/tmp/bench/pgfr_statement_snapshots.csv' CSV HEADER; +``` + +Full pgfr_record schema on every VM includes: `snapshots`/`snapshots_v2` (partitioned), `table_snapshots(_v2)`, `statement_snapshots(_v2)`, `index_snapshots(_v2)`, `replication_snapshots(_v2)`, `vacuum_progress_snapshots(_v2)`, `activity_samples(_archive_v2)`, `lock_samples(_archive_v2)`, `config_snapshots`, `db_role_config_snapshots`. + +### (d) sys_metrics_sampler.py — CPU / mem / disk every 10 s (R7+) + +Reads `/proc/stat`, `/proc/meminfo`, `/proc/diskstats` directly (psutil-optional). NVMe device is `nvme1n1` (the instance store, which is what `/mnt/pgdata` sits on). v2 adds per-device IOPS and latency columns. + +Source: [tooling/sys_metrics_sampler.py](tooling/sys_metrics_sampler.py). + +### (e) pg_stat_statements_snapshot.py — pgss time-series every 10 s + +Polls pgss with a regex filter over our queue-related query shapes, diffs consecutive snapshots downstream. Used as a cross-check for NOTICE-based ev/s (important for pgque/pgq/pgmq where DO-block wrappers hide per-statement rows). + +Source: [tooling/pg_stat_statements_snapshot.py](tooling/pg_stat_statements_snapshot.py). + +### (f) pgbench `--aggregate-interval=10 --log` + +Both producer and consumer run with per-10 s aggregate logs (min/max/sum/sumsq latency). Files: `producer_agg.` and `consumer_agg..` under `/tmp/bench/`. + +### (g) Consumer NOTICE instrumentation (R6+) + +Each instrumented `consumer.sql` is wrapped in a DO block that emits exactly one `RAISE NOTICE 'ev ts= n='` per call. This gives us an authoritative per-call consumed-events stream that is immune to the pgss DO-wrapper opacity problem. + +The seven instrumented consumers are in [consumers/](consumers/). The parser for `consumer.log` NOTICE lines is [tooling/parse_events_consumed.py](tooling/parse_events_consumed.py). + +### (h) idle_in_tx.py — the death-spiral inducer + +Opens a `REPEATABLE READ` transaction, pins xmin, sleeps forever. Kill with SIGTERM to release the horizon. Source: [tooling/idle_in_tx.py](tooling/idle_in_tx.py). + +### (i) pgq_ticker_daemon.py — tight ticker loop for pgq + +pgq upstream has no built-in ticker daemon (the C one was always external). pgque runs inline; for a fair comparison we run a tight Python loop on the pgq VM that calls `pgq.ticker()` at 1 Hz and `pgq.maint_operations()` every 5 s. Source: [tooling/pgq_ticker_daemon.py](tooling/pgq_ticker_daemon.py). + +--- + +## 4. Bench runner + +Orchestrates the phase scheduler. Forks producer, consumer (both pgbench), bloat_sampler, sys_metrics_sampler, pgss snapshotter; at t=1800 s opens `idle_in_tx`; runs `VACUUM VERBOSE` at phase boundaries; at t=3600 s kills the holder; another `VACUUM VERBOSE`; at t=5400 s harvests ash/pgfr/pgss. + +Source: [runners/run_r7.sh](runners/run_r7.sh). + +--- + +## 5. Clean-slate reset + +Before every run we kill stragglers, DROP the system's schema/extensions, unschedule lingering `pg_cron` jobs, re-run `install.sh`, and reset all stats. The companion `full_reset.sql` calls `pg_stat_statements_reset()`, `pg_stat_reset()`, `pg_stat_reset_shared()` for bgwriter/checkpointer/wal/io, `TRUNCATE ash.sample`, and `DELETE FROM cron.job_run_details`. + +Source: [runners/clean_reinstall.sh](runners/clean_reinstall.sh). See also [OPS_GOTCHAS.md §4, §5, §7](OPS_GOTCHAS.md) for adjacent-schema pitfalls. + +--- + +## 6. Per-system install scripts + +Each VM has its own `/tmp/install.sh` (mirrored into [install/](install/)). Pattern differs by system: + +- **pgque** — `git clone` the PR #62 branch, `make USE_PGXS=1 install`, run the SQL installer, `SELECT pgque.create_queue('bench_queue')`, schedule `pgque.ticker()` via `pg_cron` (also called inline in `consumer.sql`). Driven from AMI user-data; see [install/README.md](install/README.md) and [install/bootstrap.sh](install/bootstrap.sh). +- **pgq** — PGDG `postgresql-18-pgq3` or `git clone --branch v3.5.1`, then `CREATE EXTENSION pgq` and immediately apply `switch_plonly.sql` to replace the C `insert_event_raw` with PL/pgSQL (pg_proc lang check `= 'plpgsql'` as a gate). See [install/install_pgq.sh](install/install_pgq.sh). +- **pgmq** — PGDG `postgresql-18-pgmq` + `CREATE EXTENSION pgmq` + `pgmq.create('bench_queue')`. See [install/install_pgmq.sh](install/install_pgmq.sh). +- **pgmq-partitioned** — pgmq + `pg_partman` + `pgmq.create_partitioned('bench_queue', 'id', '10000')`. Reuses [install/install_pgmq.sh](install/install_pgmq.sh) plus [install/pgmq-partitioned_setup_5min.sql](install/pgmq-partitioned_setup_5min.sql) for the partman schedule. +- **river** — `go install github.com/riverqueue/river/cmd/river@v0.34.x` and `river migrate-up` (sets schema), then the consumer SQL emulates the SKIP-LOCKED claim pattern. See [install/install_river.sh](install/install_river.sh). +- **que** — Ruby + `gem install que -v 2.4.x`, `bundle exec que:install` creates `que_jobs`; consumer SQL emulates the Ruby Que SELECT/DELETE. Driven from AMI user-data — see [install/README.md](install/README.md) and [OPS_GOTCHAS.md §7](OPS_GOTCHAS.md). +- **pg-boss** — local `npm install pg-boss@12.15`, run `new PgBoss(DSN).start()` once to migrate schema; consumer SQL emulates the `SKIP LOCKED` claim on `pgboss.job`. See [install/install_pgboss.sh](install/install_pgboss.sh). + +--- + +## 7. Per-system consumer.sql patterns + +All seven instrumented consumers are in [consumers/](consumers/): + +- [consumer_pgque.sql](consumers/consumer_pgque.sql) — ticker + `next_batch` + `get_batch_events` + `finish_batch` +- [consumer_pgq.sql](consumers/consumer_pgq.sql) — identical, schema swap pgque → pgq +- [consumer_pgmq.sql](consumers/consumer_pgmq.sql) — `pgmq.read(50)` + `pgmq.delete` +- [consumer_pgmq-partitioned.sql](consumers/consumer_pgmq-partitioned.sql) — identical SQL to pgmq (schema hides partitioning) +- [consumer_river.sql](consumers/consumer_river.sql) — SKIP LOCKED + DELETE on `river_job` +- [consumer_que.sql](consumers/consumer_que.sql) — SKIP LOCKED + DELETE on `que_jobs` +- [consumer_pgboss.sql](consumers/consumer_pgboss.sql) — SKIP LOCKED + DELETE on `pgboss.job` + +All producers are in [producers/](producers/). + +--- + +## 8. Analysis and chart generation + +- [charts/r5_analyze.py](charts/r5_analyze.py) — R5 full 2-panel chart (dead tuples + consumer latency, linear scale, no symlog — per [MEMORY rule](../CLAUDE.md) "Never use log/symlog on charts"). +- [charts/r6_smoke_chart.py](charts/r6_smoke_chart.py) — R6 smoke Solarized-Dark 2-panel chart (events/s + pgque per-table dead tuples). +- [gifs/r4_gif_v17_solarized.py](gifs/r4_gif_v17_solarized.py) — R4 dead-tuples animated GIF (7 systems, Solarized-Dark). +- [gifs/r4_gif_tps_solarized.py](gifs/r4_gif_tps_solarized.py) — R4 TPS/latency animated GIF. + +--- + +## 9. GitLab posting style + +- Threads open under `### header`; replies stay inside the same discussion. +- Every `
` block has **one blank line** between the opening tag and the triple-backtick fence, and **one blank line** between the closing fence and `
` — and another blank line after ``. This is what makes them render correctly on GitLab. +- Every significant result embeds `![alt](/uploads/…)` — charts are first-class citizens. GIFs too. +- Post bodies via `curl --data-urlencode "body@/tmp/path"` to the discussions endpoint — new thread, not a reply under an existing one. + +--- + +## 10. Cost discipline + +- us-east-2 spot preferred (~$0.22/h); when spot is exhausted we hop to another region before going on-demand. +- At R7 time 2 VMs are on-demand (pgmq-partitioned and pgboss; pgque+que moved on-demand after R6 spot reclaim). +- **~$15 per 1.5 h bench round** (7 VMs). +- Total spend through R7 is ~$100. +- Rsync every 30 min pulls `/tmp/bench/` to local, so a spot reclaim loses at most a partial phase. + +See [OPS_GOTCHAS.md §12](OPS_GOTCHAS.md) for spot-reclaim mitigations. + +--- + +## Reproducing from scratch + +1. Launch 7 × i4i.2xlarge in us-east-2 (spot first) with `bench_userdata.sh` (symlinks `/mnt/pgdata`, installs PGDG PG18, pg_cron, pg_ash, pgfr, pg_stat_statements). See [runners/fix_nvme_mount.sh](runners/fix_nvme_mount.sh) for the NVMe side if the userdata path breaks. +2. `scp` per-system `install.sh`, `producer.sql`, instrumented `consumer.sql`, plus the samplers in [tooling/](tooling/). +3. `bash clean_reinstall.sh ` on each VM. +4. `bash run_r7.sh ` — 1.5 h, writes `/tmp/bench/*.csv` + `*.log`. +5. Rsync everything back to `/tmp/bench_r7//`. +6. Run [charts/r5_analyze.py](charts/r5_analyze.py) (or R7 successor) for the verdict table + 2-panel PNG. +7. Post results as a threaded reply with the `
`/blank-line style above. + +That is the whole rig. diff --git a/benchmark/OPS_GOTCHAS.md b/benchmark/OPS_GOTCHAS.md new file mode 100644 index 00000000..6c78d190 --- /dev/null +++ b/benchmark/OPS_GOTCHAS.md @@ -0,0 +1,259 @@ +# Operational gotchas + +Every operational lesson accumulated while running the pgque-vs-pgq-vs-pgmq-vs-river-vs-que-vs-pgboss-vs-pgmq-partitioned bench on AWS i4i.2xlarge. Read this before reproducing. + +--- + +## 1. AWS i4i.2xlarge: NVMe instance store is NOT auto-mounted + +The Ubuntu 24.04 AMI boots with `/dev/nvme1n1` **present but not formatted, not mounted**. You MUST format and mount it BEFORE installing Postgres data dir, or the data goes to the 8 GB root EBS volume and the bench dies the moment you fill the page cache. + +```bash +sudo mkfs.xfs /dev/nvme1n1 +sudo mkdir -p /mnt/pgdata +sudo mount -o noatime,nodiratime /dev/nvme1n1 /mnt/pgdata +sudo chown postgres:postgres /mnt/pgdata +``` + +If PG is installed first (default path `/var/lib/postgresql/18/main`), data goes to the 8 GB root EBS → disaster under bench load. + +**Symlink pattern** after moving data: + +```bash +sudo systemctl stop postgresql@18-main +sudo mv /var/lib/postgresql/18/main /mnt/pgdata/postgresql/18/main +sudo ln -s /mnt/pgdata/postgresql/18/main /var/lib/postgresql/18/main +sudo systemctl start postgresql@18-main +``` + +See [runners/fix_nvme_mount.sh](runners/fix_nvme_mount.sh) for the recovery procedure when this happens post-hoc. + +--- + +## 2. `/tmp/bench/` observability dir — must be on NVMe + +Default `/tmp` is on the root disk. pgbench `--log-prefix=/tmp/bench/producer_agg` + `consumer.log` with 17 k NOTICE/s + `bloat.csv` + `sys_metrics.csv` all write there. + +At bench rates this is 1–5 MB/s of root-disk I/O. Not catastrophic but adds noise you can see in the sys_metrics disk panel of the non-subject VM. + +**Fix:** + +```bash +sudo mkdir -p /mnt/pgdata/bench && sudo chown postgres:postgres /mnt/pgdata/bench +ln -s /mnt/pgdata/bench /tmp/bench +``` + +Now observability output lands on NVMe alongside PG data. + +--- + +## 3. `logging_collector=off` means server logs go to systemd journal + +With `logging_collector=off` (our default), server logs go to the systemd journal, not a file on disk. Means zero PG log I/O during bench — good for the throughput numbers. + +If investigating slow queries later, enable `log_statement='all'` + `logging_collector=on` with `log_directory` inside the NVMe data dir (relative path `'log'` resolves correctly since `data_dir` is symlinked to NVMe). + +--- + +## 4. Clean-slate reset without losing adjacent schemas + +`DROP SCHEMA pgque CASCADE` can cascade into objects it doesn't own if a function somewhere references pgque. R6 lost the `ash` schema this way on the pgque VM. + +**Safer pattern for pgque:** `TRUNCATE` all `pgque.*` tables instead of drop-and-recreate. If a reinstall is actually needed: + +```bash +sudo -u postgres psql -d bench -c "DROP SCHEMA pgque CASCADE" +sudo -u postgres psql -d bench -f /tmp/pgque.sql +sudo -u postgres psql -d bench -f /tmp/pg_ash/sql/ash-install.sql +sudo -u postgres psql -d bench -f /tmp/pgfr/_record/sql/install.sql +sudo -u postgres psql -d bench -f /tmp/pgfr/_analyze/sql/install.sql +sudo -u postgres psql -d bench -f /tmp/pgfr/_control/sql/install.sql +``` + +Re-run the ash/pgfr installers even if you think they weren't touched. + +**Verify after reset:** + +```sql +SELECT nspname FROM pg_namespace + WHERE nspname IN ('ash','pgfr_record','pgfr_analyze','pgfr_control','pgque'); +``` + +All five must appear. + +--- + +## 5. pg_partman + pgmq-partitioned: stale `part_config` rows + +After `DROP SCHEMA pgmq CASCADE`, `public.part_config` retains rows pointing to non-existent tables. `run_maintenance_proc()` then throws: + +``` +ERROR: Given parent table not found in system catalogs: pgmq.q_bench_queue +``` + +**Fix** before reinstalling pgmq: + +```sql +DELETE FROM public.part_config WHERE parent_table LIKE 'pgmq.%'; +``` + +pg_partman is installed in schema `public` (not `partman` as the docs sometimes suggest); `part_config` is `public.part_config`. + +--- + +## 6. pg_partman premake tuning + +Default `premake=4` works for bench (4 future 5-min partitions = 20 min buffer, enough for 1.5 h bench given 1-min maintenance cron cadence). + +R6 tried `premake=20` = 24 partitions steady-state, which killed pgmq-partitioned consumer perf (525 TPS vs 6621 with premake=4) due to planner cost across all partitions per query. + +`infinite_time_partitions=true` is needed for the maintenance job to keep creating future partitions indefinitely: + +```sql +UPDATE public.part_config + SET premake=4, infinite_time_partitions=true + WHERE parent_table='pgmq.q_bench_queue'; +``` + +--- + +## 7. que (Ruby) reinstall: residual `que_validate_tags` function + +`DROP TABLE que_jobs, que_lockers, que_values CASCADE` doesn't drop associated functions (they're in `public` schema). + +Re-running `que.migrate!` fails: + +``` +PG::DuplicateFunction: function que_validate_tags already exists with same argument types +``` + +**Fix** — explicitly drop all que_* functions before migrate: + +```sql +DO $$ DECLARE r record; BEGIN + FOR r IN SELECT proname FROM pg_proc WHERE proname LIKE 'que\_%' OR proname='que_validate_tags' + LOOP EXECUTE format('DROP FUNCTION IF EXISTS public.%I CASCADE', r.proname); END LOOP; +END $$; +``` + +--- + +## 8. river: no clean_reinstall.sh on the VM + +river's `install.sh` was simpler than the others — it installs the Go binary + creates `river_job` via `river migrate up`. + +Reset = `TRUNCATE river_job` + stats reset. No schema-drop needed. If you really want a nuke-and-pave, `DROP TABLE river_job, river_leader, river_migration, river_client, river_client_queue, river_queue CASCADE` then re-run migrate. + +--- + +## 9. pgboss: covering index for bench workload + +pg-boss v12.15 DELETE path scans `pgboss.job_common` which partitions by name; the `(name, state, start_after)` index isn't covering under bench-heavy DELETE. + +**Add a covering index:** + +```sql +CREATE INDEX bench_covering +ON pgboss.job_common (name, state, start_after, created_on, id) +WHERE state < 'active'; +``` + +This lines up the consumer's claim path and turns the DELETE into an index-only prefilter. + +--- + +## 10. pgq: no built-in ticker in the v3.5.1 PL-only fork + +Upstream pgq relies on the C ticker binary. The PL-only fork we use for fair comparison doesn't ship one. + +**Workaround:** tight Python loop calling `pgq.ticker('queue_name')` every 1 s and `pgq.maint_operations()` every 5 s — see [tooling/pgq_ticker_daemon.py](tooling/pgq_ticker_daemon.py). + +Runs on the pgq VM only. Killed at bench end via `pkill -f pgq_ticker_daemon`. + +--- + +## 11. pgque PG18 xid8 cast bug (pre-PR #62) + +`batch_event_sql` function casts `xid8` to `bigint`, which fails on PG18's stricter type rules: + +``` +ERROR: cannot cast type xid8 to bigint +``` + +**Fix:** `::text::bigint` as intermediate cast. Landed in [PR #62](https://github.com/NikolayS/pgque/pull/62). + +If you see this on a VM that wasn't freshly installed from the PR #62 branch, re-run [install/install_pgque.sh](install/install_pgque.sh). + +--- + +## 12. Spot-instance termination risk + +i4i.2xlarge spot price is ~$0.20–0.30/h vs on-demand $0.686/h. The us-east-2 spot market is reliable for short bench runs but DOES reclaim. + +- R4 lost `pgmq-partitioned` spot mid-run. +- R6 lost `pgque` AND `que` spots between runs (hours-long windows). + +**Mitigation:** on-demand for the primary subject under test (pgque for PR #62 work). Cost delta: ~$3–4 per 7-VM 1.5 h run. Worth it for the subject you're landing a PR on; leave the comparison set on spot. + +Rsync every 30 min (see Section 10 of [METHODOLOGY.md](METHODOLOGY.md)) so a reclaim loses at most a partial phase, not a whole run. + +--- + +## 13. ASH + pgfr install prerequisites + +Install order: `pg_cron` → `pg_stat_statements` → pg_ash → pgfr (`_record` / `_analyze` / `_control`). + +``` +ALTER SYSTEM SET shared_preload_libraries='pg_cron,pg_stat_statements'; +``` + +…then **restart** BEFORE installing ash. + +ASH uses `pg_cron.schedule('ash_sampler','1 seconds', ...)` — requires `pg_cron.database_name` to match the bench DB: + +``` +cron.database_name = 'bench' +``` + +If you see `ash.samples()` returning zero rows with no obvious error, the cron job is almost certainly scheduled against the wrong database. + +--- + +## 14. pgbench aggregate-interval logging + +`--aggregate-interval=10 --log --log-prefix=/tmp/bench/producer_agg` writes per-10 s buckets, one line per bucket per client. + +Low observer effect: one `fwrite` per 10 s per client, not per transaction. + +Captures `latency min/max/sum/sumsq` → enough to reconstruct mean + stddev + approximate min/max band per window. This is how we plot "consumer latency over time" without running a separate sampler. + +Log file naming: `producer_agg.` for `-c 1 -j 1`; `consumer_agg..` for `-c N -j N` with N>1. + +--- + +## 15. NOTICE-based event-count instrumentation + +We use `RAISE NOTICE 'ev ts=% n=%'` inside DO blocks inside the consumer SQL. + +At 17 k calls/sec: ~40 % CPU overhead on the **consumer client** (the pgbench process parsing NOTICE lines into stdout/log), but no observed server-side cost. Safe for bench purposes, NOT for production. + +pgbench faithfully prints NOTICE lines to stdout (captured as `consumer.log`). Parser: [tooling/parse_events_consumed.py](tooling/parse_events_consumed.py). Output: `events_consumed_per_sec.csv` + `events_consumed_summary.txt`. + +Why NOTICE rather than pgss: DO-wrappers hide per-statement rows from `pg_stat_statements`, so the pgss queryid reported for a consumer is the DO wrapper as a whole — not the SELECT/DELETE inside it — which makes it useless for counting consumed events when the internal shape differs across systems. NOTICE is schema-neutral and authoritative. + +--- + +## Summary checklist + +Before a run: + +- [ ] `/dev/nvme1n1` formatted + mounted + PG data dir symlinked +- [ ] `/tmp/bench` symlinked to `/mnt/pgdata/bench` +- [ ] `shared_preload_libraries='pg_cron,pg_stat_statements'` + restart +- [ ] `cron.database_name='bench'` +- [ ] pg_ash, pgfr (`_record`, `_analyze`, `_control`) installed +- [ ] per-system install script re-run after any schema drop +- [ ] for pgmq-partitioned: stale `public.part_config` rows deleted, `premake=4` +- [ ] for que: que_* functions dropped before migrate +- [ ] for pgq: ticker daemon started (and remembered to be killed at end) +- [ ] rsync cron scheduled on the local workstation (not the VM) diff --git a/benchmark/README.md b/benchmark/README.md new file mode 100644 index 00000000..0ef65520 --- /dev/null +++ b/benchmark/README.md @@ -0,0 +1,97 @@ +# pgque benchmark harness + +The whole bench rig: how we compared pgque (PR #62) against pgq, pgmq, pgmq-partitioned, river, que, and pg-boss under held-xmin pathology conditions on AWS i4i.2xlarge. + +Backs [NikolayS/pgque#61](https://github.com/NikolayS/pgque/issues/61) (the held-xmin bloat issue) and [PR #62](https://github.com/NikolayS/pgque/pull/62) (the subscription/tick rotation fix). + +## Start here + +- **[METHODOLOGY.md](METHODOLOGY.md)** — full bench methodology, workload shape, observability stack, phase scheduler. +- **[OPS_GOTCHAS.md](OPS_GOTCHAS.md)** — every operational lesson from running this on AWS (NVMe mount, pg_partman stale rows, que function leftovers, pgboss index, etc.). +- **[HARDWARE.md](HARDWARE.md)** — i4i.2xlarge specs, PG tuning, expected microbench baselines. + +## Credit + +The pathology this bench exercises is the one Brandur Leach documented in [Postgres Queues (2015)](https://brandur.org/postgres-queues) — DELETE-based queues become unvacuumably bloated under held xmin, and the DB-level death spiral that follows. The bench is a controlled reproduction of that scenario across seven queue systems. + +## Quick-start + +```bash +# on the VM, after bootstrap: +bash install/install_.sh # system-specific install +bash runners/run_r7.sh # orchestrator; 1.5h per system + +# outputs (all under /tmp/bench/): +# producer_agg.* pgbench producer aggregate log +# consumer_agg.* pgbench consumer aggregate log +# consumer.log NOTICE-instrumented per-call event counts +# bloat.csv pg_stat_user_tables every 30 s +# sys_metrics.csv CPU/mem/disk every 10 s +# pgss_timeseries.csv pg_stat_statements every 10 s +# ash.csv pg_ash 1 Hz wait-event samples (harvested at end) +# pgfr_snapshots.csv pg-flight-recorder snapshots (harvested at end) +# pgfr_table_snapshots.csv pgfr table-level snapshots +# pgfr_statement_snapshots.csv pgfr per-statement snapshots +# events_consumed_per_sec.csv parsed from consumer.log +``` + +Example instrumented consumer: [consumers/consumer_pgque.sql](consumers/consumer_pgque.sql). + +ASH and pgfr output (`ash.csv`, `pgfr_snapshots.csv`) land in `/tmp/bench/` post-run when the harvest step of `run_r7.sh` completes. + +## Directory layout + +``` +benchmark/ + README.md # this file + METHODOLOGY.md # full methodology (from GitLab #77) + OPS_GOTCHAS.md # operational lessons + HARDWARE.md # VM + PG tuning + tooling/ + bloat_sampler.py # pg_stat_user_tables sampler + sys_metrics_sampler.py # /proc/{stat,meminfo,diskstats} sampler (v2) + pg_stat_statements_snapshot.py # pgss time-series + parse_events_consumed.py # NOTICE log → events_consumed_per_sec.csv + idle_in_tx.py # REPEATABLE READ xmin holder (the inducer) + pgq_ticker_daemon.py # tight ticker loop for pgq (no built-in daemon) + microbench.sh # sysbench + fio baseline + runners/ + run_r7.sh # phase-scheduled orchestrator + clean_reinstall.sh # reset between runs + fix_nvme_mount.sh # recover from NVMe-not-mounted boot + consumers/ + consumer_pgque.sql # instrumented (DO + NOTICE) + consumer_pgq.sql + consumer_pgmq.sql + consumer_pgmq-partitioned.sql + consumer_river.sql + consumer_que.sql + consumer_pgboss.sql + producers/ + producer_pgque.sql # pgque.send() + producer_pgq.sql # pgq.insert_event + producer_pgmq.sql # pgmq.send + producer_river.sql # INSERT INTO river_job + producer_que.sql # INSERT INTO que_jobs + producer_pgboss.sql # INSERT INTO pgboss.job + producer_pgmq-partitioned.sql # same as producer_pgmq + install/ + README.md # which system uses which installer + bootstrap.sh # shared PG18 + pg_cron + ash + pgfr base + install_pgq.sh + install_pgmq.sh # also used by pgmq-partitioned + pgmq-partitioned_setup_5min.sql # partman cron schedule + install_river.sh + install_pgboss.sh + # note: pgque + que driven from AMI userdata + charts/ + r5_analyze.py # R5 2-panel chart + r6_smoke_chart.py # R6 smoke Solarized-Dark chart + gifs/ + r4_gif_v17_solarized.py # dead-tuples animated GIF + r4_gif_tps_solarized.py # TPS/latency animated GIF +``` + +## Scope + +Only a harness. Produces no test coverage for pgque itself — that lives under `tests/` in the main repo. This directory adds nothing to the pgque production SQL; it is pure ops tooling + docs. diff --git a/benchmark/charts/r5_analyze.py b/benchmark/charts/r5_analyze.py new file mode 100644 index 00000000..7a5b76aa --- /dev/null +++ b/benchmark/charts/r5_analyze.py @@ -0,0 +1,137 @@ +#!/usr/bin/env python3 +"""R5 full — verdict table + 2-panel chart (dead tuples + consumer latency).""" +import csv, re +from pathlib import Path +from datetime import datetime +import matplotlib +matplotlib.use('Agg') +import matplotlib.pyplot as plt +import numpy as np +from matplotlib.ticker import FuncFormatter + +ROOT = Path("/tmp/bench_r5") +SYSTEMS = ["pgque", "pgq", "pgmq", "pgmq-partitioned", "river", "que", "pgboss"] +COLORS = { + "pgque": "#1F4E79", "pgq": "#6B9BD1", "pgmq": "#C0392B", + "pgmq-partitioned": "#D98880", "river": "#E67E22", + "que": "#8E44AD", "pgboss": "#16A085", +} +EVENT_RE = { + "pgque": re.compile(r"^pgque\.event_\d+_\d+$"), + "pgq": re.compile(r"^pgq\.event_\d+_\d+$"), + "pgmq": re.compile(r"^pgmq\.q_bench_queue$"), + "pgmq-partitioned": re.compile(r"^pgmq\.q_bench_queue_(p\d+_\d+|default)$"), + "river": re.compile(r"^public\.river_job$"), + "que": re.compile(r"^public\.que_jobs$"), + "pgboss": re.compile(r"^pgboss\.job_common$|^pgboss\.j[a-f0-9]+$"), +} +PROG_RE = re.compile(r"^progress:\s*([\d.]+)\s*s,\s*([\d.]+)\s*tps,\s*lat\s*([\d.]+)\s*ms\s*stddev\s*([\d.]+)") + +def load_dead(sysn): + p = ROOT / sysn / "bloat.csv" + if not p.exists(): return np.array([]), np.array([]) + by_ts = {} + with open(p) as f: + r = csv.reader(f); next(r) + for row in r: + if not EVENT_RE[sysn].match(row[1]): continue + by_ts.setdefault(row[0], 0); by_ts[row[0]] += int(row[3]) + if not by_ts: return np.array([]), np.array([]) + tss = sorted(by_ts) + t0 = datetime.fromisoformat(tss[0].replace("Z","")) + mins = np.array([(datetime.fromisoformat(t.replace("Z","")) - t0).total_seconds()/60 for t in tss]) + dead = np.array([by_ts[t] for t in tss], dtype=float) + return mins, dead + +def load_lat(sysn): + p = ROOT / sysn / "consumer.log" + if not p.exists(): return np.array([]), np.array([]), np.array([]) + secs, tps, lats = [], [], [] + with open(p) as f: + for ln in f: + m = PROG_RE.match(ln) + if m: + secs.append(float(m.group(1))/60.0) + tps.append(float(m.group(2))) + lats.append(float(m.group(3))) + return np.array(secs), np.array(tps), np.array(lats) + +# verdict table +print(f"\n{'system':<20} {'peak_dead':>12} {'avg_tps':>10} {'max_lat':>10} {'min_tps_TX':>11}") +print("-"*80) +for sysn in SYSTEMS: + _, dead = load_dead(sysn) + mins_lat, tps, lats = load_lat(sysn) + peak = int(dead.max()) if dead.size else 0 + avg_tps = float(tps.mean()) if tps.size else 0 + max_lat = float(lats.max()) if lats.size else 0 + # TX window: 30-60 min (ideal) — actually started ~30 but check phases + tx_mask = (mins_lat >= 32) & (mins_lat <= 58) + min_tps_tx = float(tps[tx_mask].min()) if tx_mask.any() else 0 + print(f"{sysn:<20} {peak:>12,} {avg_tps:>10,.0f} {max_lat:>10.1f} {min_tps_tx:>11,.1f}") + +# chart +fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(11, 7.6), dpi=110, + gridspec_kw={'height_ratios':[1.0, 0.58], 'hspace':0.32, + 'top':0.85, 'bottom':0.075, 'left':0.08, 'right':0.94}) +for ax in (ax1, ax2): + ax.axvspan(30, 60, color="#fdeeee", alpha=0.9, zorder=0) + for b in (30, 60): + ax.axvline(x=b, color="#ccc", lw=0.8, zorder=0.5) + ax.set_xlim(0, 90) + ax.set_xticks([0, 15, 30, 45, 60, 75, 90]) + ax.set_xticklabels(["0", "15m", "30m", "45m", "1h", "1h15", "1h30"]) + ax.grid(True, color="#e8e8e8", lw=0.8) + ax.set_axisbelow(True) + for sp in ("top","right"): ax.spines[sp].set_visible(False) + # strict bottom at 0 — no below-axis whitespace + ax.set_ymargin(0) + +# Phase labels — sit in the whitespace between title and the top of ax1 (not overlapping title) +for (s, e, lbl, col) in [(0,30,"clean baseline · 30m","#888"), + (30,60,"xmin horizon blocked · 30m","#c0392b"), + (60,90,"clean recovery · 30m","#888")]: + ax1.text((s+e)/2, 1.03, lbl, transform=ax1.get_xaxis_transform(), + ha="center", color=col, fontsize=10, + fontweight="bold" if col=="#c0392b" else "normal") + +# Top: dead tuples +for sysn in SYSTEMS: + mins, dead = load_dead(sysn) + if mins.size == 0: continue + lw = 3.0 if sysn=="pgque" else 2.0 + ax1.plot(mins, dead, color=COLORS[sysn], lw=lw, label=sysn, zorder=3 if sysn=="pgque" else 2) + +def fmt_dead(v,_): + v = abs(v) + if v>=1e6: return f"{v/1e6:.1f}M" + if v>=1e3: return f"{v/1e3:.0f}k" + return f"{v:.0f}" +ax1.yaxis.set_major_formatter(FuncFormatter(fmt_dead)) +ax1.set_ylabel("event-table dead tuples", labelpad=8) +ax1.legend(loc="upper left", frameon=False, fontsize=9, ncol=2) + +# Bottom: latency (clipped) +LAT_CLIP = 2000 +for sysn in SYSTEMS: + mins, _, lats = load_lat(sysn) + if mins.size == 0: continue + y = np.minimum(lats, LAT_CLIP) + lw = 3.0 if sysn=="pgque" else 2.0 + ax2.plot(mins, y, color=COLORS[sysn], lw=lw, zorder=3 if sysn=="pgque" else 2) + +def fmt_lat(v,_): + if v>=1000: return f"{v/1000:.1f}s" + if v>=10: return f"{v:.0f}ms" + if v>=1: return f"{v:.1f}ms" + return f"{v:.2f}ms" +ax2.yaxis.set_major_formatter(FuncFormatter(fmt_lat)) +ax2.set_ylabel("consumer latency (mean)", labelpad=8) + +fig.text(0.08, 0.97, "R5 full — 7 systems · 1.5 h · held xmin horizon for 30 min", + ha="left", fontsize=13, fontweight="bold") +fig.text(0.08, 0.948, "producer -R 1000, pgbench --aggregate-interval=10 --log · consumer latency clipped at 2s", + ha="left", fontsize=9, color="#666") + +fig.savefig("/tmp/r5_full_chart.png", dpi=110, bbox_inches="tight") +print(f"\nwrote /tmp/r5_full_chart.png ({Path('/tmp/r5_full_chart.png').stat().st_size/1024:.0f} KB)") diff --git a/benchmark/charts/r6_smoke_chart.py b/benchmark/charts/r6_smoke_chart.py new file mode 100644 index 00000000..e455cc8a --- /dev/null +++ b/benchmark/charts/r6_smoke_chart.py @@ -0,0 +1,123 @@ +#!/usr/bin/env python3 +"""R6 combined smoke chart — events consumed/s + pgque table dead tuples.""" +import csv, re +from pathlib import Path +from datetime import datetime +import matplotlib +matplotlib.use('Agg') +import matplotlib.pyplot as plt +import numpy as np +from matplotlib.ticker import FuncFormatter + +ROOT = Path("/tmp/bench_r6_combined") +OUT = Path("/tmp/r6_smoke_chart.png") + +# Phase windows for the 20-min smoke (seconds): clean 0-300, xmin 300-900, recovery 900-1200 +PHASE_TX = (300, 900) +X_MAX_S = 1200 + +# Load events_consumed_per_sec.csv +ev_t, ev_rate = [], [] +with open(ROOT / "events_consumed_per_sec.csv") as f: + r = csv.reader(f); next(r) + for row in r: + ev_t.append(int(row[0])) + ev_rate.append(int(row[1])) +ev_t = np.array(ev_t); ev_rate = np.array(ev_rate) + +# Load bloat.csv — track key pgque tables over time +TRACKED = { + "event_1_0": "#268bd2", # blue (line like pgque accent) + "event_1_1": "#2aa198", # cyan + "event_1_2": "#859900", # green + "subscription_0": "#cb4b16", # orange + "subscription_1": "#dc322f", # red + "subscription_2": "#d33682", # magenta + "tick_0": "#6c71c4", # violet + "tick_1": "#b58900", # yellow + "tick_2": "#93a1a1", # base1 + "meta_rotation": "#586e75", # base01 + "queue": "#eee8d5", # base2 +} +tbl_series = {t: ([],[],[]) for t in TRACKED} # (t_seconds, n_live, n_dead) +t0 = None +with open(ROOT / "bloat.csv") as f: + r = csv.reader(f); next(r) + for row in r: + relname_full = row[1] + relname = relname_full.split(".")[-1] + if relname not in TRACKED: continue + ts = datetime.fromisoformat(row[0].replace("Z","")) + if t0 is None: t0 = ts + s = (ts - t0).total_seconds() + tbl_series[relname][0].append(s) + tbl_series[relname][1].append(int(row[2])) # n_live_tup + tbl_series[relname][2].append(int(row[3])) # n_dead_tup + +# --- Solarized Dark palette --- +BG = "#002b36"; SURF = "#073642" +FG = "#839496"; FG_EMPH = "#93a1a1"; FG_DIM = "#586e75" +ALERT = "#dc322f" + +plt.rcParams.update({ + 'figure.facecolor': BG, 'axes.facecolor': BG, 'savefig.facecolor': BG, + 'text.color': FG, 'axes.labelcolor': FG_EMPH, + 'xtick.color': FG, 'ytick.color': FG, + 'axes.edgecolor': FG_DIM, + 'grid.color': SURF, 'grid.linewidth': 0.8, + 'font.family': ['Helvetica', 'Arial', 'DejaVu Sans'], + 'font.size': 10, +}) + +fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(11, 6.5), dpi=110, + gridspec_kw={'height_ratios':[1.0, 0.75], 'hspace':0.35, + 'top':0.91, 'bottom':0.09, 'left':0.09, 'right':0.94}) + +for ax in (ax1, ax2): + ax.axvspan(PHASE_TX[0], PHASE_TX[1], color=SURF, alpha=1.0, zorder=0) + for b in PHASE_TX: ax.axvline(x=b, color=FG_DIM, lw=0.8, zorder=0.5) + ax.set_xlim(0, X_MAX_S) + ax.set_xticks([0, 300, 600, 900, 1200]) + ax.set_xticklabels(["0", "5m", "10m", "15m", "20m"]) + ax.grid(True); ax.set_axisbelow(True) + for sp in ("top","right"): ax.spines[sp].set_visible(False) + for sp in ("left","bottom"): ax.spines[sp].set_color(FG_DIM) + +# Phase labels +for (s, e, lbl, col) in [(0, PHASE_TX[0], "clean baseline · 5m", FG_DIM), + (PHASE_TX[0], PHASE_TX[1], "xmin horizon blocked · 10m", ALERT), + (PHASE_TX[1], X_MAX_S, "clean recovery · 5m", FG_DIM)]: + ax1.text((s+e)/2, 1.04, lbl, transform=ax1.get_xaxis_transform(), + ha="center", color=col, fontsize=10, + fontweight="bold" if col==ALERT else "normal") + +# Top: events consumed/sec +ax1.plot(ev_t, ev_rate, color="#268bd2", lw=1.5, alpha=0.9) +ax1.axhline(1000, color=FG_DIM, ls="--", lw=0.8, alpha=0.7) +ax1.text(X_MAX_S*0.98, 1020, "producer rate 1000 ev/s (-R 1000)", + color=FG_DIM, ha="right", fontsize=9, fontstyle="italic") +ax1.set_ylabel("events consumed / s", labelpad=8, color=FG_EMPH) +ax1.set_ylim(0, max(2200, ev_rate.max()*1.1)) +ax1.yaxis.set_major_formatter(FuncFormatter(lambda v,_: f"{v:.0f}")) + +# Bottom: dead tuples per pgque table +for name, color in TRACKED.items(): + t_s, nlive, ndead = tbl_series[name] + if not t_s: continue + ax2.plot(t_s, ndead, color=color, lw=2.0, label=name, alpha=0.95) + +ax2.set_ylabel("n_dead_tup per pgque table", labelpad=8, color=FG_EMPH) +ax2.yaxis.set_major_formatter(FuncFormatter(lambda v,_: f"{v:.0f}")) +# Legend top-right +ax2.legend(loc="upper left", frameon=False, fontsize=8, ncol=3, labelcolor=FG) +# Expand y so zero lines are visible; max is meta_rotation=33 +ax2.set_ylim(0, max(50, max([max(s[2]) for s in tbl_series.values() if s[2]], default=10)*1.15)) + +# Titles +fig.text(0.09, 0.965, "R6 combined smoke — pgque PR #62 rotation + events-consumed instrumentation", + ha="left", fontsize=12, fontweight="bold", color=FG_EMPH) +fig.text(0.09, 0.94, "20 min (5m clean + 10m held xmin + 5m recovery). Event-rate: producer was rate-capped at 1000 ev/s.", + ha="left", fontsize=9, color=FG_DIM) + +fig.savefig(OUT, dpi=110, bbox_inches="tight", facecolor=BG) +print(f"wrote {OUT} ({OUT.stat().st_size/1024:.0f} KB)") diff --git a/benchmark/consumers/consumer_pgboss.sql b/benchmark/consumers/consumer_pgboss.sql new file mode 100644 index 00000000..514c3b0a --- /dev/null +++ b/benchmark/consumers/consumer_pgboss.sql @@ -0,0 +1,18 @@ +-- R6 instrumented consumer for pg-boss (SKIP LOCKED + DELETE USING on partitioned job table). +-- pg-boss 'default' queue, state='created'. +DO $$ +DECLARE + n int := 0; +BEGIN + WITH claimed AS ( + SELECT id FROM pgboss.job + WHERE name='bench_queue' AND state='created' + ORDER BY priority DESC, created_on, id + LIMIT 100 FOR UPDATE SKIP LOCKED + ), deleted AS ( + DELETE FROM pgboss.job USING claimed WHERE pgboss.job.id = claimed.id + RETURNING pgboss.job.id + ) + SELECT count(*) INTO n FROM deleted; + RAISE NOTICE 'ev ts=% n=%', extract(epoch from clock_timestamp())::bigint, n; +END $$; diff --git a/benchmark/consumers/consumer_pgmq-partitioned.sql b/benchmark/consumers/consumer_pgmq-partitioned.sql new file mode 100644 index 00000000..962fcd7b --- /dev/null +++ b/benchmark/consumers/consumer_pgmq-partitioned.sql @@ -0,0 +1,14 @@ +-- R6 instrumented consumer for pgmq-partitioned (identical SQL to pgmq; +-- the partitioning is on the schema side only). +DO $$ +DECLARE + msg_ids bigint[]; + n int := 0; +BEGIN + SELECT array_agg(msg_id) FROM pgmq.read('bench_queue', 30, 50) INTO msg_ids; + IF msg_ids IS NOT NULL THEN + n := array_length(msg_ids, 1); + PERFORM pgmq.delete('bench_queue', msg_ids); + END IF; + RAISE NOTICE 'ev ts=% n=%', extract(epoch from clock_timestamp())::bigint, n; +END $$; diff --git a/benchmark/consumers/consumer_pgmq.sql b/benchmark/consumers/consumer_pgmq.sql new file mode 100644 index 00000000..51c730f0 --- /dev/null +++ b/benchmark/consumers/consumer_pgmq.sql @@ -0,0 +1,15 @@ +-- R6 instrumented consumer for pgmq (read-then-delete). +-- pgmq.delete is a function body so pg_stat_statements can't see row counts; +-- NOTICE is the authoritative source. +DO $$ +DECLARE + msg_ids bigint[]; + n int := 0; +BEGIN + SELECT array_agg(msg_id) FROM pgmq.read('bench_queue', 30, 50) INTO msg_ids; + IF msg_ids IS NOT NULL THEN + n := array_length(msg_ids, 1); + PERFORM pgmq.delete('bench_queue', msg_ids); + END IF; + RAISE NOTICE 'ev ts=% n=%', extract(epoch from clock_timestamp())::bigint, n; +END $$; diff --git a/benchmark/consumers/consumer_pgq.sql b/benchmark/consumers/consumer_pgq.sql new file mode 100644 index 00000000..df251f8c --- /dev/null +++ b/benchmark/consumers/consumer_pgq.sql @@ -0,0 +1,14 @@ +-- R6 instrumented consumer for pgq (identical to pgque; schema swap only). +DO $$ +DECLARE + b bigint; + n int := 0; +BEGIN + PERFORM pgq.ticker('bench_queue'); + SELECT pgq.next_batch('bench_queue', 'bench_consumer') INTO b; + IF b IS NOT NULL THEN + SELECT count(*) INTO n FROM pgq.get_batch_events(b); + PERFORM pgq.finish_batch(b); + END IF; + RAISE NOTICE 'ev ts=% n=%', extract(epoch from clock_timestamp())::bigint, n; +END $$; diff --git a/benchmark/consumers/consumer_pgque.sql b/benchmark/consumers/consumer_pgque.sql new file mode 100644 index 00000000..4f6a4cd6 --- /dev/null +++ b/benchmark/consumers/consumer_pgque.sql @@ -0,0 +1,16 @@ +-- R6 instrumented consumer for pgque (batch ticker model). +-- Preserves original behavior; adds one NOTICE per call for honest-events/s. +-- Format: NOTICE: ev ts= n= +DO $$ +DECLARE + b bigint; + n int := 0; +BEGIN + PERFORM pgque.ticker('bench_queue'); + SELECT pgque.next_batch('bench_queue', 'bench_consumer') INTO b; + IF b IS NOT NULL THEN + SELECT count(*) INTO n FROM pgque.get_batch_events(b); + PERFORM pgque.finish_batch(b); + END IF; + RAISE NOTICE 'ev ts=% n=%', extract(epoch from clock_timestamp())::bigint, n; +END $$; diff --git a/benchmark/consumers/consumer_que.sql b/benchmark/consumers/consumer_que.sql new file mode 100644 index 00000000..14b188f8 --- /dev/null +++ b/benchmark/consumers/consumer_que.sql @@ -0,0 +1,18 @@ +-- R6 instrumented consumer for que (Ruby Que gem; SKIP LOCKED + DELETE). +-- Mirror R5 shape (que_jobs with default queue and priority ordering). +DO $$ +DECLARE + n int := 0; +BEGIN + WITH claimed AS ( + SELECT id FROM que_jobs + WHERE finished_at IS NULL AND expired_at IS NULL + ORDER BY priority, run_at, id + LIMIT 100 FOR UPDATE SKIP LOCKED + ), deleted AS ( + DELETE FROM que_jobs USING claimed WHERE que_jobs.id = claimed.id + RETURNING que_jobs.id + ) + SELECT count(*) INTO n FROM deleted; + RAISE NOTICE 'ev ts=% n=%', extract(epoch from clock_timestamp())::bigint, n; +END $$; diff --git a/benchmark/consumers/consumer_river.sql b/benchmark/consumers/consumer_river.sql new file mode 100644 index 00000000..d9be6a3c --- /dev/null +++ b/benchmark/consumers/consumer_river.sql @@ -0,0 +1,18 @@ +-- R6 instrumented consumer for river (SKIP LOCKED + DELETE USING). +-- Wrapped in DO block so NOTICE carries the count. +-- Raw DELETE rowcount is ALSO visible in pg_stat_statements via the +-- wrapped statement (total row count over the run — cross-check). +DO $$ +DECLARE + n int := 0; +BEGIN + WITH claimed AS ( + SELECT id FROM river_job WHERE state='available' AND queue='default' + ORDER BY id LIMIT 100 FOR UPDATE SKIP LOCKED + ), deleted AS ( + DELETE FROM river_job USING claimed WHERE river_job.id = claimed.id + RETURNING river_job.id + ) + SELECT count(*) INTO n FROM deleted; + RAISE NOTICE 'ev ts=% n=%', extract(epoch from clock_timestamp())::bigint, n; +END $$; diff --git a/benchmark/gifs/r4_gif_tps_solarized.py b/benchmark/gifs/r4_gif_tps_solarized.py new file mode 100644 index 00000000..fa6a4b27 --- /dev/null +++ b/benchmark/gifs/r4_gif_tps_solarized.py @@ -0,0 +1,618 @@ +#!/usr/bin/env python3 +"""Failure by MVCC — TPS variant (Solarized Dark). + +Solarized Dark palette applied to the TPS-bottom-panel version. Same +palette + treatment as r4_gif_v17_solarized.py: bg = base03 (#002b36), +TX phase wash = base02 (#073642), text in base0/base1/base01 tiers, +alert in Solarized red (#dc322f). Bottom panel shows consumer TPS with +healthy lines near the top and collapses plunging DOWN toward the +x-axis baseline during the TX window. + +Line colors (Solarized accents): + pgque blue #268bd2 (hero) pgq cyan #2aa198 + pgmq red #dc322f river orange #cb4b16 + que violet #6c71c4 pgboss green #859900 + +Outputs: + /tmp/bench_r4_24h/failure_by_mvcc_tps.gif + /tmp/bench_r4_24h/failure_by_mvcc_tps_hero.png +""" +import csv +import os +import re +import shutil +import subprocess +from datetime import datetime +from pathlib import Path + +import matplotlib +matplotlib.use('Agg') +import matplotlib.pyplot as plt +import matplotlib.transforms as mtransforms +import numpy as np +from matplotlib.ticker import FuncFormatter +from PIL import Image + +# ---- tunables ---------------------------------------------------------------- +ROOT = Path("/tmp/bench_r4_24h") +OUT_GIF = ROOT / "failure_by_mvcc_tps.gif" +OUT_HERO = ROOT / "failure_by_mvcc_tps_hero.png" +FPS = 25 +DURATION_S = 22 +FRAME_COUNT = FPS * DURATION_S +HOLD_FRAMES = FPS * 1 +FIG_W_IN, FIG_H_IN, DPI = 9.0, 7.0, 100 # 900x700 px +X_MAX_MIN = 150 +FLASH_FRAMES = 12 +INLINE_MIN_DEAD = 50_000 # suppress inline dead-tuple labels below this +# TPS: draw the inline label when the line has dropped BELOW this threshold +# (the collapse is the story — healthy-steady-state lines at 10-17k are busy +# at the top of the panel and an inline label there would just add clutter). +INLINE_MAX_TPS_COLLAPSED = 3000 +# ----------------------------------------------------------------------------- + +SYSTEMS = ["pgque", "pgq", "pgmq", "river", "que", "pgboss"] +# --- Solarized Dark palette (Ethan Schoonover) --------------------------- +# Backgrounds +BG = "#002b36" # base03 — primary figure + axes background +SURFACE = "#073642" # base02 — surface/highlight + TX phase-band wash +# Text tiers +FG = "#839496" # base0 — primary text (ticks, values, inline) +FG_EMPH = "#93a1a1" # base1 — emphasized (axis y-labels, column hdrs) +FG_DIM = "#586e75" # base01 — de-emphasized (inactive phase labels) +# Grid / rules / structure +GRID = SURFACE # grid matches surface: reads as structure +SPINE = FG_DIM # base01 for spines +PHASE_RULE = FG_DIM # base01 for boundary axvlines at t=30/90 +# Alert (Solarized's canonical red, no separate ember wash) +RED = "#dc322f" # hinge flash + TX-phase ribbon + active phase label +TX_BAND = SURFACE # TX phase wash = surface, not an alert hue +# Solarized accents (for reference): +# blue #268bd2 cyan #2aa198 red #dc322f magenta #d33682 +# orange #cb4b16 violet #6c71c4 green #859900 yellow #b58900 +# ------------------------------------------------------------------------- + +COLORS = { + "pgque": "#268bd2", # Solarized blue — hero, most saturated + "pgq": "#2aa198", # Solarized cyan — sibling (cool family) + "pgmq": "#dc322f", # Solarized red + "river": "#cb4b16", # Solarized orange + "que": "#6c71c4", # Solarized violet + "pgboss": "#859900", # Solarized green (warm yellow-green) +} +LW = {s: (3.0 if s == "pgque" else 2.2) for s in SYSTEMS} +EVENT_RE = { + "pgque": re.compile(r"^pgque\.event_\d+_\d+$"), + "pgq": re.compile(r"^pgq\.event_\d+_\d+$"), + "pgmq": re.compile(r"^pgmq\.q_bench_queue$"), + "river": re.compile(r"^public\.river_job$"), + "que": re.compile(r"^public\.que_jobs$"), + "pgboss": re.compile(r"^pgboss\.job_common$|^pgboss\.j[a-f0-9]+$"), +} +PHASE_BOUNDS = [(0, 30, "clean"), (30, 90, "tx"), (90, 150, "clean")] + +plt.rcParams.update({ + "font.family": ["Helvetica", "Arial", "DejaVu Sans"], + "font.size": 11, + "axes.spines.top": False, + "axes.spines.right": False, + "axes.spines.bottom": True, + "axes.spines.left": True, + "axes.edgecolor": SPINE, + "axes.linewidth": 0.9, + "axes.grid": True, + "grid.color": GRID, + "grid.linewidth": 0.8, + "axes.axisbelow": True, + # --- dark theme colors (applied to figure, axes, text, ticks) --- + "figure.facecolor": BG, + "axes.facecolor": BG, + "savefig.facecolor": BG, + "text.color": FG, + "axes.labelcolor": FG, + "axes.titlecolor": FG, + "xtick.color": FG, + "ytick.color": FG, +}) + + +def _mins_from(tss): + t0 = datetime.fromisoformat(tss[0].replace("Z", "")) + return np.array([ + (datetime.fromisoformat(t.replace("Z", "")) - t0).total_seconds() / 60 + for t in tss + ]) + + +def load_bloat(): + series = {} + for sysn in SYSTEMS: + p = ROOT / sysn / "bloat.csv" + if not p.exists(): + continue + by_ts = {} + with open(p) as f: + r = csv.reader(f) + next(r) + for row in r: + try: + if not EVENT_RE[sysn].match(row[1]): + continue + by_ts.setdefault(row[0], 0) + by_ts[row[0]] += int(row[3]) + except Exception: + pass + tss = sorted(by_ts) + if not tss: + continue + mins = _mins_from(tss) + dead = np.array([by_ts[t] for t in tss], dtype=float) + series[sysn] = {"mins": mins, "dead": dead} + return series + + +_PROGRESS_RE = re.compile( + r"^progress:\s*([\d.]+)\s*s,\s*([\d.]+)\s*tps,\s*lat\s*([\d.]+)\s*ms" +) + + +def load_tps(): + """Parse the tps field out of 'progress: X s, Y tps, lat Z ms ...' lines + in each system's consumer.log. Returns {sysn: {"mins": ..., "tps": ...}}. + """ + series = {} + for sysn in SYSTEMS: + p = ROOT / sysn / "consumer.log" + if not p.exists(): + continue + mins, tps = [], [] + with open(p) as f: + for ln in f: + m = _PROGRESS_RE.match(ln) + if not m: + continue + t_s = float(m.group(1)) + mins.append(t_s / 60.0) + tps.append(float(m.group(2))) + if not mins: + continue + mins = np.array(mins); tps = np.array(tps) + series[sysn] = {"mins": mins, "tps": tps} + return series + + +def fmt_y_dead(v, _): + if v >= 1e6: return f"{v/1e6:.1f}M" + if v >= 1e3: return f"{v/1e3:.0f}k" + return f"{v:.0f}" + + +def fmt_y_tps(v, _): + v = abs(v) + if v >= 1000: return f"{v/1000:.0f}k" + return f"{v:.0f}" + + +def current_phase(now_min): + for start, end, kind in PHASE_BOUNDS: + if start <= now_min < end: + return kind, start, end + s, e, k = PHASE_BOUNDS[-1] + return k, s, e + + +def _apply_phase_bands(ax): + for start, end, kind in PHASE_BOUNDS: + if kind == "tx": + # Solarized: TX band uses surface (base02) at full alpha so + # the phase reads as a visibly-darker-than-bg region without + # any alert-color wash. + ax.axvspan(start, end, color=TX_BAND, alpha=1.0, zorder=0) + for b in (30, 90): + ax.axvline(x=b, color=PHASE_RULE, linewidth=0.8, zorder=0.5) + + +def _plot_series(ax, series, sysn, now_min, y_key, y_clip=None, + inline_threshold=None, inline_fmt=None, inline_out=None, + inline_label_value_fn=None, inline_cmp="above"): + """Plot a series; return the endpoint y value. + + If inline_threshold/inline_fmt/inline_out are provided, append a + proposed inline-label dict to inline_out for later anti-overlap + placement. inline_cmp="above" (default) appends when y_end >= + threshold; inline_cmp="below" appends when y_end <= threshold — used + for TPS collapse (showing the label ONLY when the line has dropped). + + inline_label_value_fn, if given, is used to compute the text value + instead of raw y_end (useful for display-vs-plot-value differences). + """ + mins = series[sysn]["mins"] + y = series[sysn][y_key].copy() + if y_clip is not None: + y = np.minimum(y, y_clip) + mask = mins <= now_min + if mask.sum() < 1: + return None + x_sel = mins[mask]; y_sel = y[mask] + if mask.sum() < len(mins): + idx = mask.sum() + if 0 < idx < len(mins): + prev_x, next_x = mins[idx - 1], mins[idx] + if next_x > prev_x: + frac = min(max((now_min - prev_x) / (next_x - prev_x), 0), 1) + y_i = y[idx - 1] + frac * (y[idx] - y[idx - 1]) + x_sel = np.append(x_sel, now_min) + y_sel = np.append(y_sel, y_i) + alpha = 1.0 if sysn == "pgque" else 0.85 + ax.plot(x_sel, y_sel, color=COLORS[sysn], linewidth=LW[sysn], alpha=alpha, + zorder=3 if sysn == "pgque" else 2, solid_capstyle="round") + if len(x_sel) > 0: + x_end, y_end = x_sel[-1], y_sel[-1] + ax.plot([x_end], [y_end], marker="o", markersize=5, + color=COLORS[sysn], markeredgecolor=BG, + markeredgewidth=1.0, zorder=5) + if (inline_threshold is not None and inline_fmt is not None + and inline_out is not None): + pass_test = (y_end >= inline_threshold if inline_cmp == "above" + else y_end <= inline_threshold) + if pass_test: + disp_val = (inline_label_value_fn(y_end) + if inline_label_value_fn is not None else y_end) + inline_out.append({ + "sysn": sysn, + "x": x_end, + "y_true": y_end, + "text": f"{sysn}: {inline_fmt(disp_val, None)}", + "color": COLORS[sysn], + "weight": "bold" if sysn == "pgque" else "normal", + }) + return y_sel[-1] if len(y_sel) else None + + +# v13: arrows form a "V" — pgque tail LEFT of cursor, pgq tail RIGHT of +# cursor. Both heads meet at the cursor on the pgque/pgq lines, so the +# two shafts diverge outward at ~45°. Labels sit directly under each tail +# (no separate stagger variable — label x = tail x). +# Tail + label closer to the axis than v12: -0.06 / -0.10 axes-fraction. +TAIL_DX_MIN = 5.0 # tail offset in data-minutes; tuned for ~45° +ARROW_TAIL_AX_FRAC = -0.06 +ARROW_LABEL_AX_FRAC = -0.10 + + +def _draw_cursor_arrows(ax, series, y_key, y_peak, now_min): + """Draw two hairline up-arrows in a divergent "V" pattern. Head of each + arrow is on the pgque/pgq line at x = now_min (pure data coords). Tail + is offset outward by TAIL_DX_MIN minutes (pgque tail left, pgq tail + right) and below the axes by ARROW_TAIL_AX_FRAC in axes-fraction — so + each shaft tilts outward at ~45° in pixel space. + + If a tail would fall outside the axes x-range (near t=0 / t=150), flip + that arrow's direction so it stays on-axis, then widen the gap if the + flip would make the two labels collide. + """ + x_data_min, x_data_max = ax.get_xlim() + now_clamped = max(min(now_min, x_data_max), x_data_min) + + # Directional tail offset per system (pgque goes left, pgq goes right). + tail_dir = {"pgque": -1, "pgq": +1} + tail_x_raw = {s: now_clamped + tail_dir[s] * TAIL_DX_MIN for s in tail_dir} + + # Edge handling: if a tail is off-axis, flip it to the other side. If + # BOTH end up on the same side after flipping, spread them apart with + # ±TAIL_DX_MIN around the cursor (falling back to clamped positions). + for s in ("pgque", "pgq"): + if tail_x_raw[s] < x_data_min or tail_x_raw[s] > x_data_max: + tail_x_raw[s] = now_clamped - tail_dir[s] * TAIL_DX_MIN + # If the flip made them coincide / cross, rebalance. + if abs(tail_x_raw["pgque"] - tail_x_raw["pgq"]) < TAIL_DX_MIN: + # place them symmetrically around the cursor, then clamp joint-wise + lo = max(now_clamped - TAIL_DX_MIN, x_data_min) + hi = lo + 2 * TAIL_DX_MIN + if hi > x_data_max: + hi = x_data_max + lo = hi - 2 * TAIL_DX_MIN + tail_x_raw = {"pgque": lo, "pgq": hi} + # Final per-axis clamp (belt & suspenders). + tail_x_for = {s: max(min(tail_x_raw[s], x_data_max), x_data_min) + for s in ("pgque", "pgq")} + + # Blended transform: x = data, y = axes-fraction. + tail_trans = mtransforms.blended_transform_factory(ax.transData, ax.transAxes) + + for sysn in ("pgque", "pgq"): + if sysn not in series: + continue + mins = series[sysn]["mins"] + ys = series[sysn][y_key] + if len(mins) == 0 or now_clamped < mins[0] or now_clamped > mins[-1]: + continue + y_target = float(np.interp(now_clamped, mins, ys)) + weight = "bold" if sysn == "pgque" else "normal" + tail_x = tail_x_for[sysn] + # Arrow: head on the line (data coords at now_clamped), tail at + # (tail_x, ARROW_TAIL_AX_FRAC). Tilted ~45° because head x != tail x. + ax.annotate( + "", + xy=(now_clamped, y_target), + xycoords=ax.transData, + xytext=(tail_x, ARROW_TAIL_AX_FRAC), + textcoords=tail_trans, + arrowprops=dict( + arrowstyle="->", + color=COLORS[sysn], + linewidth=0.8, + shrinkA=0, shrinkB=2, + ), + annotation_clip=False, + zorder=12, + ) + # Label sits directly under its OWN tail. + ax.text(tail_x, ARROW_LABEL_AX_FRAC, sysn, + transform=tail_trans, + ha="center", va="top", + fontsize=8.5, color=COLORS[sysn], + fontweight=weight, + clip_on=False, zorder=12) + + +def _place_inline_labels(ax, proposals, y_peak, gap_frac=0.04): + """Top-down vertical nudge so labels don't overlap. + Drops any label pushed out of [0, y_peak]. Draws with monospace. + """ + if not proposals: + return + gap = y_peak * gap_frac + # Sort descending by true y; walk top-down, push each label down if too close + ordered = sorted(proposals, key=lambda p: -p["y_true"]) + prev_y = None + for p in ordered: + y = p["y_true"] + if prev_y is not None and y > prev_y - gap: + y = prev_y - gap + p["y_draw"] = y + prev_y = y + dx = X_MAX_MIN * 0.015 + for p in ordered: + y = p["y_draw"] + if y < 0 or y > y_peak: + continue # label forced out of frame — drop it + ax.text(p["x"] + dx, y, p["text"], + color=p["color"], fontsize=8.5, + fontfamily="monospace", fontweight=p["weight"], + ha="left", va="center", zorder=6) + + +def render_frame(bloat, tps_series, frame_idx, total_frames, y_peak_dead, y_peak_tps): + now_min = X_MAX_MIN * (frame_idx / max(total_frames - 1, 1)) + + fig = plt.figure(figsize=(FIG_W_IN, FIG_H_IN), dpi=DPI) + # No title block — top whitespace reclaimed for the axes. + # v12: hspace + bottom margin sized so each panel has figure-level + # whitespace BELOW its x-axis tick labels for the pgque/pgq arrows + # + labels (which live outside the axes via negative axes-fraction y). + # Tuned so top panel bottom lands at ~42% of figure height (not + # compressed) while leaving room below bot-panel for arrow+label. + gs = fig.add_gridspec(2, 1, hspace=0.45, top=0.95, bottom=0.10, + left=0.10, right=0.96, + height_ratios=[1.0, 0.58]) + ax1 = fig.add_subplot(gs[0]) + ax2 = fig.add_subplot(gs[1]) + + for ax in (ax1, ax2): + _apply_phase_bands(ax) + + # Phase labels on top axis only + for start, end, kind in PHASE_BOUNDS: + label = ("idle-in-tx · 1h" if kind == "tx" + else ("clean baseline · 30m" if start == 0 else "clean recovery · 1h")) + color = RED if kind == "tx" else FG_DIM + is_current = (start <= now_min < end) + weight = "bold" if (is_current or kind == "tx") else "normal" + alpha = 1.0 if is_current else (0.7 if kind == "tx" else 0.55) + ax1.text((start + end) / 2, 1.04, label, + transform=ax1.get_xaxis_transform(), + ha="center", color=color, fontsize=10, + fontweight=weight, alpha=alpha) + + # --- panel 1: dead tuples --- + dead_vals = {} + dead_inline = [] + for sysn in SYSTEMS: + if sysn in bloat: + v = _plot_series(ax1, bloat, sysn, now_min, "dead", + inline_threshold=INLINE_MIN_DEAD, + inline_fmt=fmt_y_dead, + inline_out=dead_inline) + if v is not None: + dead_vals[sysn] = v + _place_inline_labels(ax1, dead_inline, y_peak_dead) + _draw_cursor_arrows(ax1, bloat, "dead", y_peak_dead, now_min) + + # --- panel 2: TPS (normal orientation — healthy lines near top, + # collapse plunges down toward the x-axis baseline at y=0) --- + tps_vals = {} + tps_inline = [] + for sysn in SYSTEMS: + if sysn in tps_series: + v = _plot_series(ax2, tps_series, sysn, now_min, "tps", + inline_threshold=INLINE_MAX_TPS_COLLAPSED, + inline_cmp="below", + inline_fmt=fmt_y_tps, + inline_out=tps_inline) + if v is not None: + tps_vals[sysn] = v + _place_inline_labels(ax2, tps_inline, y_peak_tps) + _draw_cursor_arrows(ax2, tps_series, "tps", y_peak_tps, now_min) + + # Hinge flash (v2 style: red rule + white-box callout) + hinge_min = 30 + frames_since_hinge = (now_min - hinge_min) / X_MAX_MIN * (total_frames - 1) + if 0 <= frames_since_hinge <= FLASH_FRAMES: + flash_alpha = 1.0 - (frames_since_hinge / FLASH_FRAMES) + ax1.axvline(x=hinge_min, color=RED, linewidth=3.5, + alpha=flash_alpha * 0.85, zorder=6) + ax2.axvline(x=hinge_min, color=RED, linewidth=3.5, + alpha=flash_alpha * 0.85, zorder=6) + ax1.text(hinge_min + 1, y_peak_dead * 0.92, + "idle-in-tx opens\nvacuum frozen", + color=RED, fontsize=11, fontweight="bold", + alpha=flash_alpha, zorder=7, + bbox=dict(boxstyle="round,pad=0.3", facecolor=BG, + edgecolor=RED, linewidth=1.2, alpha=flash_alpha)) + + # Now cursor on both panels + for ax, y_peak in ((ax1, y_peak_dead), (ax2, y_peak_tps)): + ax.axvline(x=now_min, color=FG_EMPH, linewidth=1.2, alpha=0.70, zorder=8) + ax.plot([now_min], [y_peak * 1.005], marker="v", color=FG_EMPH, + markersize=7, zorder=9, clip_on=False) + + # Phase / time readout + kind, p_start, p_end = current_phase(now_min) + ribbon_color = RED if kind == "tx" else FG_DIM + ribbon_label = ("IDLE-IN-TX (vacuum blocked)" if kind == "tx" + else ("CLEAN BASELINE" if p_start == 0 else "CLEAN RECOVERY")) + # Footer: phase label stays sans; t = ... is mono for stable digit columns. + # Compose as two adjacent fig.text calls anchored around figure center. + phase_text = f"phase: {ribbon_label}" + time_text = f"t = {now_min:5.1f} min / 150 min" + fig.text(0.5, 0.015, phase_text + " | ", + ha="right", fontsize=10, color=ribbon_color, fontweight="bold") + fig.text(0.5, 0.015, time_text, + ha="left", fontsize=10, color=ribbon_color, fontweight="bold", + fontfamily="monospace") + + # Axes + for ax in (ax1, ax2): + ax.set_xlim(-1, X_MAX_MIN) + ax.set_xticks([0, 30, 60, 90, 120, 150]) + ax.set_xticklabels(["0", "30m", "1h", "1h30", "2h", "2h30"]) + # v8: no below-axis headroom (start at 0), but keep '0' tick label hidden + # so the baseline gridline stays without the "0" text crowding the axis. + # v12: clean axis bounds — no negative headroom. The arrows live + # OUTSIDE the axes rectangle via mixed (data, axes-fraction) transforms + # and annotation_clip=False, so the plot data area is not compressed. + ax1.set_ylim(0, y_peak_dead * 1.05) + ax1.yaxis.set_major_formatter(FuncFormatter(fmt_y_dead)) + ax1.set_ylabel("event-table dead tuples", labelpad=8) + + ax2.set_ylim(0, y_peak_tps * 1.05) + ax2.yaxis.set_major_formatter(FuncFormatter(fmt_y_tps)) + ax2.set_ylabel("consumer TPS", labelpad=8) + + # Mono-only for numeric y-tick labels; hide the '0' tick. + for ax in (ax1, ax2): + for lbl, pos in zip(ax.get_yticklabels(), ax.get_yticks()): + lbl.set_fontfamily("monospace") + if pos == 0: + lbl.set_visible(False) + + # Legend: uniform row grid, baseline-aligned, two cells per row. + # Name (sans) right-aligned at x=0.88; value (mono) right-aligned at x=0.985. + # Using va="baseline" for every cell so sans and mono glyphs sit on the + # same typographic baseline regardless of family metrics. + # ROW_SPACING is in axes-fraction — bottom panel is ~0.58x the height of + # top, so we scale to keep visual row rhythm consistent in physical pixels. + def _draw_legend(ax, vals, fmt, header, row_spacing, header_gap): + header_y = 0.97 + rows_start = header_y - header_gap + ax.text(0.985, header_y, header, + transform=ax.transAxes, ha="right", va="baseline", + color=FG_EMPH, fontsize=8.5, zorder=10) + visible = [s for s in SYSTEMS if s in vals] + for i, sysn in enumerate(visible): + cur = vals[sysn] + weight = "bold" if sysn == "pgque" else "normal" + y = rows_start - i * row_spacing + ax.text(0.88, y, sysn, + transform=ax.transAxes, ha="right", va="baseline", + color=COLORS[sysn], fontsize=9, fontweight=weight, + zorder=10) + ax.text(0.985, y, fmt(cur, None), + transform=ax.transAxes, ha="right", va="baseline", + color=COLORS[sysn], fontsize=9, fontweight=weight, + fontfamily="monospace", zorder=10) + + # Scale row spacing by inverse of panel height so visual pixel rhythm + # matches between the hero (top) and supporting (bottom) panels. + _draw_legend(ax1, dead_vals, fmt_y_dead, "dead tuples", + row_spacing=0.065, header_gap=0.05) + _draw_legend(ax2, tps_vals, fmt_y_tps, "TPS", + row_spacing=0.112, header_gap=0.086) + + fig.canvas.draw() + buf = np.asarray(fig.canvas.buffer_rgba()) + img = Image.fromarray(buf).convert("RGB") + plt.close(fig) + return img + + +def build_animation(): + print("Loading bloat CSVs...") + bloat = load_bloat() + print(f" bloat series: {sorted(bloat.keys())}") + print("Loading TPS from consumer logs...") + tps_series = load_tps() + print(f" tps series: {sorted(tps_series.keys())}") + + y_peak_dead = max(bloat[s]["dead"].max() for s in bloat) + y_peak_tps = max(tps_series[s]["tps"].max() for s in tps_series) + print(f" y_peak dead = {y_peak_dead:.0f}, y_peak tps = {y_peak_tps:.0f}") + # Log each system's peak + min during TX phase so the report can cite + # which ones drop furthest (just informational). + for s in sorted(tps_series): + mins = tps_series[s]["mins"]; tps = tps_series[s]["tps"] + tx_mask = (mins >= 30) & (mins <= 90) + if tx_mask.any(): + peak_all = tps.max() + min_in_tx = tps[tx_mask].min() + print(f" {s}: steady_peak={peak_all:.0f} tps min_in_tx={min_in_tx:.0f} tps") + + print(f"Rendering {FRAME_COUNT} frames ({FIG_W_IN*DPI:.0f}x{FIG_H_IN*DPI:.0f} px)...") + frames = [] + for i in range(FRAME_COUNT): + if i % 50 == 0: + print(f" frame {i}/{FRAME_COUNT}") + frames.append(render_frame(bloat, tps_series, i, FRAME_COUNT, y_peak_dead, y_peak_tps)) + for _ in range(HOLD_FRAMES): + frames.append(frames[-1]) + + frames[-1].save(OUT_HERO, format="PNG", optimize=True) + print(f"Wrote hero PNG: {OUT_HERO}") + + gifski = shutil.which("gifski") + if gifski: + print("Using gifski...") + tmp_dir = Path("/tmp/r4_gif_frames_tps_solarized") + if tmp_dir.exists(): + shutil.rmtree(tmp_dir) + tmp_dir.mkdir() + for idx, fr in enumerate(frames): + fr.save(tmp_dir / f"frame_{idx:04d}.png") + # Solarized dark has large flat-color background regions (base03 + # figure bg + base02 TX phase wash) that low gifski quality turns + # into visible speckle. Push quality to default=90, add --extra for + # the slower-but-cleaner palette path, and --lossy-quality=100 to + # disable the noise/streak dither that produces the speckle. + subprocess.run( + [gifski, "-o", str(OUT_GIF), "--fps", str(FPS), + "--width", str(int(FIG_W_IN * DPI)), + "--quality", "68", + "--lossy-quality", "75", + "--extra", + *sorted(str(p) for p in tmp_dir.glob("*.png"))], + check=True, + ) + shutil.rmtree(tmp_dir) + else: + print("gifski missing; using PIL...") + pal_frames = [f.quantize(colors=128, method=Image.MEDIANCUT, dither=Image.Dither.NONE) + for f in frames] + pal_frames[0].save(OUT_GIF, save_all=True, + append_images=pal_frames[1:], + duration=int(1000 / FPS), loop=0, + optimize=True, disposal=2) + print(f"Wrote GIF: {OUT_GIF} ({OUT_GIF.stat().st_size/1e6:.2f} MB)") + + +if __name__ == "__main__": + build_animation() diff --git a/benchmark/gifs/r4_gif_v17_solarized.py b/benchmark/gifs/r4_gif_v17_solarized.py new file mode 100644 index 00000000..b5afff5b --- /dev/null +++ b/benchmark/gifs/r4_gif_v17_solarized.py @@ -0,0 +1,610 @@ +#!/usr/bin/env python3 +"""Failure by MVCC — animated GIF v17 Solarized Dark (v16 dark palette swap). + +Ethan Schoonover's Solarized Dark palette. Backgrounds: base03 (#002b36) +for the figure + axes, base02 (#073642) for the TX phase-band wash and +gridlines. Text in three tiers: base0 (#839496) primary, base1 (#93a1a1) +emphasized (axis y-labels, column headers, now-cursor), base01 (#586e75) +de-emphasized (spines, phase-boundary axvlines, inactive phase labels). +Alert: Solarized red (#dc322f) for the hinge flash box + active TX-phase +label + TX ribbon. + +Line palette uses the Solarized accent colors: + pgque = blue #268bd2 (hero — most saturated) + pgq = cyan #2aa198 (sibling, cool family) + pgmq = red #dc322f + river = orange #cb4b16 + que = violet #6c71c4 + pgboss = green #859900 (warm yellow-green) +Yellow (#b58900) reserved for future highlighting. + +Edge handling: if a tail would fall off-axis (near t=0 or t=150), flip +that arrow's direction only so it stays on-axis; if flipping would make +the two tails coincide, rebalance with joint-clamp ±TAIL_DX_MIN. + +Everything else unchanged from v12: clean set_ylim(0, y_peak*1.05), +uniform legend, mono numbers, inline endpoint labels + anti-overlap, +no title, hinge flash, 6 systems. + +Outputs: + /tmp/bench_r4_24h/death_spiral.gif + /tmp/bench_r4_24h/death_spiral_hero.png +""" +import csv +import os +import re +import shutil +import subprocess +from datetime import datetime +from pathlib import Path + +import matplotlib +matplotlib.use('Agg') +import matplotlib.pyplot as plt +import matplotlib.transforms as mtransforms +import numpy as np +from matplotlib.ticker import FuncFormatter +from PIL import Image + +# ---- tunables ---------------------------------------------------------------- +ROOT = Path("/tmp/bench_r4_24h") +OUT_GIF = ROOT / "death_spiral.gif" +OUT_HERO = ROOT / "death_spiral_hero.png" +FPS = 25 +DURATION_S = 22 +FRAME_COUNT = FPS * DURATION_S +HOLD_FRAMES = FPS * 1 +FIG_W_IN, FIG_H_IN, DPI = 9.0, 7.0, 100 # 900x700 px +X_MAX_MIN = 150 +FLASH_FRAMES = 12 +LAT_CLIP_MS = 1200 +INLINE_MIN_DEAD = 50_000 # suppress inline dead-tuple labels below this +INLINE_MIN_LAT_MS = 10 # suppress inline latency labels below this +# ----------------------------------------------------------------------------- + +SYSTEMS = ["pgque", "pgq", "pgmq", "river", "que", "pgboss"] +# --- Solarized Dark palette (Ethan Schoonover) --------------------------- +# Backgrounds +BG = "#002b36" # base03 — primary figure + axes background +SURFACE = "#073642" # base02 — surface/highlight + TX phase-band wash +# Text tiers +FG = "#839496" # base0 — primary text (ticks, values, inline) +FG_EMPH = "#93a1a1" # base1 — emphasized (axis y-labels, column hdrs) +FG_DIM = "#586e75" # base01 — de-emphasized (inactive phase labels) +# Grid / rules / structure +GRID = SURFACE # grid matches surface: reads as structure +SPINE = FG_DIM # base01 for spines +PHASE_RULE = FG_DIM # base01 for boundary axvlines at t=30/90 +# Alert (Solarized's canonical red, no separate ember wash) +RED = "#dc322f" # hinge flash + TX-phase ribbon + active phase label +TX_BAND = SURFACE # TX phase wash = surface, not an alert hue +# Solarized accents (for reference): +# blue #268bd2 cyan #2aa198 red #dc322f magenta #d33682 +# orange #cb4b16 violet #6c71c4 green #859900 yellow #b58900 +# ------------------------------------------------------------------------- + +COLORS = { + "pgque": "#268bd2", # Solarized blue — hero, most saturated + "pgq": "#2aa198", # Solarized cyan — sibling (cool family) + "pgmq": "#dc322f", # Solarized red + "river": "#cb4b16", # Solarized orange + "que": "#6c71c4", # Solarized violet + "pgboss": "#859900", # Solarized green (warm yellow-green) +} +LW = {s: (3.0 if s == "pgque" else 2.2) for s in SYSTEMS} +EVENT_RE = { + "pgque": re.compile(r"^pgque\.event_\d+_\d+$"), + "pgq": re.compile(r"^pgq\.event_\d+_\d+$"), + "pgmq": re.compile(r"^pgmq\.q_bench_queue$"), + "river": re.compile(r"^public\.river_job$"), + "que": re.compile(r"^public\.que_jobs$"), + "pgboss": re.compile(r"^pgboss\.job_common$|^pgboss\.j[a-f0-9]+$"), +} +PHASE_BOUNDS = [(0, 30, "clean"), (30, 90, "tx"), (90, 150, "clean")] + +plt.rcParams.update({ + "font.family": ["Helvetica", "Arial", "DejaVu Sans"], + "font.size": 11, + "axes.spines.top": False, + "axes.spines.right": False, + "axes.spines.bottom": True, + "axes.spines.left": True, + "axes.edgecolor": SPINE, + "axes.linewidth": 0.9, + "axes.grid": True, + "grid.color": GRID, + "grid.linewidth": 0.8, + "axes.axisbelow": True, + # --- dark theme colors (applied to figure, axes, text, ticks) --- + "figure.facecolor": BG, + "axes.facecolor": BG, + "savefig.facecolor": BG, + "text.color": FG, + "axes.labelcolor": FG, + "axes.titlecolor": FG, + "xtick.color": FG, + "ytick.color": FG, +}) + + +def _mins_from(tss): + t0 = datetime.fromisoformat(tss[0].replace("Z", "")) + return np.array([ + (datetime.fromisoformat(t.replace("Z", "")) - t0).total_seconds() / 60 + for t in tss + ]) + + +def load_bloat(): + series = {} + for sysn in SYSTEMS: + p = ROOT / sysn / "bloat.csv" + if not p.exists(): + continue + by_ts = {} + with open(p) as f: + r = csv.reader(f) + next(r) + for row in r: + try: + if not EVENT_RE[sysn].match(row[1]): + continue + by_ts.setdefault(row[0], 0) + by_ts[row[0]] += int(row[3]) + except Exception: + pass + tss = sorted(by_ts) + if not tss: + continue + mins = _mins_from(tss) + dead = np.array([by_ts[t] for t in tss], dtype=float) + series[sysn] = {"mins": mins, "dead": dead} + return series + + +_PROGRESS_RE = re.compile( + r"^progress:\s*([\d.]+)\s*s,\s*[\d.]+\s*tps,\s*lat\s*([\d.]+)\s*ms\s*stddev\s*([\d.]+)" +) + + +def load_latency(): + series = {} + for sysn in SYSTEMS: + p = ROOT / sysn / "consumer.log" + if not p.exists(): + continue + mins, lat, std = [], [], [] + with open(p) as f: + for ln in f: + m = _PROGRESS_RE.match(ln) + if not m: + continue + t_s = float(m.group(1)) + mins.append(t_s / 60.0) + lat.append(float(m.group(2))) + std.append(float(m.group(3))) + if not mins: + continue + mins = np.array(mins); lat = np.array(lat); std = np.array(std) + series[sysn] = {"mins": mins, "lat": lat, "std": std} + return series + + +def fmt_y_dead(v, _): + if v >= 1e6: return f"{v/1e6:.1f}M" + if v >= 1e3: return f"{v/1e3:.0f}k" + return f"{v:.0f}" + + +def fmt_y_lat(v, _): + if v >= 1000: return f"{v/1000:.1f}s" + if v >= 10: return f"{v:.0f}ms" + if v >= 1: return f"{v:.1f}ms" + return f"{v:.2f}ms" # sub-ms: show real number, not 0 + + +def current_phase(now_min): + for start, end, kind in PHASE_BOUNDS: + if start <= now_min < end: + return kind, start, end + s, e, k = PHASE_BOUNDS[-1] + return k, s, e + + +def _apply_phase_bands(ax): + for start, end, kind in PHASE_BOUNDS: + if kind == "tx": + # Solarized: TX band uses surface (base02) at full alpha so + # the phase reads as a visibly-darker-than-bg region without + # any alert-color wash. + ax.axvspan(start, end, color=TX_BAND, alpha=1.0, zorder=0) + for b in (30, 90): + ax.axvline(x=b, color=PHASE_RULE, linewidth=0.8, zorder=0.5) + + +def _plot_series(ax, series, sysn, now_min, y_key, y_clip=None, + inline_threshold=None, inline_fmt=None, inline_out=None): + """Plot a series; return the endpoint y value. + + If inline_threshold/inline_fmt/inline_out are provided and the endpoint + y is above threshold, append a proposed inline-label dict to + inline_out for later anti-overlap placement in data coordinates. + """ + mins = series[sysn]["mins"] + y = series[sysn][y_key].copy() + if y_clip is not None: + y = np.minimum(y, y_clip) + mask = mins <= now_min + if mask.sum() < 1: + return None + x_sel = mins[mask]; y_sel = y[mask] + if mask.sum() < len(mins): + idx = mask.sum() + if 0 < idx < len(mins): + prev_x, next_x = mins[idx - 1], mins[idx] + if next_x > prev_x: + frac = min(max((now_min - prev_x) / (next_x - prev_x), 0), 1) + y_i = y[idx - 1] + frac * (y[idx] - y[idx - 1]) + x_sel = np.append(x_sel, now_min) + y_sel = np.append(y_sel, y_i) + alpha = 1.0 if sysn == "pgque" else 0.85 + ax.plot(x_sel, y_sel, color=COLORS[sysn], linewidth=LW[sysn], alpha=alpha, + zorder=3 if sysn == "pgque" else 2, solid_capstyle="round") + if len(x_sel) > 0: + x_end, y_end = x_sel[-1], y_sel[-1] + ax.plot([x_end], [y_end], marker="o", markersize=5, + color=COLORS[sysn], markeredgecolor=BG, + markeredgewidth=1.0, zorder=5) + if (inline_threshold is not None and inline_fmt is not None + and inline_out is not None and y_end >= inline_threshold): + inline_out.append({ + "sysn": sysn, + "x": x_end, + "y_true": y_end, + "text": f"{sysn}: {inline_fmt(y_end, None)}", + "color": COLORS[sysn], + "weight": "bold" if sysn == "pgque" else "normal", + }) + return y_sel[-1] if len(y_sel) else None + + +# v13: arrows form a "V" — pgque tail LEFT of cursor, pgq tail RIGHT of +# cursor. Both heads meet at the cursor on the pgque/pgq lines, so the +# two shafts diverge outward at ~45°. Labels sit directly under each tail +# (no separate stagger variable — label x = tail x). +# v14 spacing: tail at -0.06, label at -0.07 (label sits just below the +# arrow tail — much snugger than v13's -0.06 / -0.10). +TAIL_DX_MIN = 5.0 # tail offset in data-minutes; tuned for ~45° +ARROW_TAIL_AX_FRAC = -0.06 # unchanged from v13 +ARROW_LABEL_AX_FRAC = -0.07 # v14: raised from -0.10, snug under tail + + +def _draw_cursor_arrows(ax, series, y_key, y_peak, now_min): + """Draw two hairline up-arrows in a divergent "V" pattern. Head of each + arrow is on the pgque/pgq line at x = now_min (pure data coords). Tail + is offset outward by TAIL_DX_MIN minutes (pgque tail left, pgq tail + right) and below the axes by ARROW_TAIL_AX_FRAC in axes-fraction — so + each shaft tilts outward at ~45° in pixel space. + + If a tail would fall outside the axes x-range (near t=0 / t=150), flip + that arrow's direction so it stays on-axis, then widen the gap if the + flip would make the two labels collide. + """ + x_data_min, x_data_max = ax.get_xlim() + now_clamped = max(min(now_min, x_data_max), x_data_min) + + # Directional tail offset per system (pgque goes left, pgq goes right). + tail_dir = {"pgque": -1, "pgq": +1} + tail_x_raw = {s: now_clamped + tail_dir[s] * TAIL_DX_MIN for s in tail_dir} + + # Edge handling: if a tail is off-axis, flip it to the other side. If + # BOTH end up on the same side after flipping, spread them apart with + # ±TAIL_DX_MIN around the cursor (falling back to clamped positions). + for s in ("pgque", "pgq"): + if tail_x_raw[s] < x_data_min or tail_x_raw[s] > x_data_max: + tail_x_raw[s] = now_clamped - tail_dir[s] * TAIL_DX_MIN + # If the flip made them coincide / cross, rebalance. + if abs(tail_x_raw["pgque"] - tail_x_raw["pgq"]) < TAIL_DX_MIN: + # place them symmetrically around the cursor, then clamp joint-wise + lo = max(now_clamped - TAIL_DX_MIN, x_data_min) + hi = lo + 2 * TAIL_DX_MIN + if hi > x_data_max: + hi = x_data_max + lo = hi - 2 * TAIL_DX_MIN + tail_x_raw = {"pgque": lo, "pgq": hi} + # Final per-axis clamp (belt & suspenders). + tail_x_for = {s: max(min(tail_x_raw[s], x_data_max), x_data_min) + for s in ("pgque", "pgq")} + + # Blended transform: x = data, y = axes-fraction. + tail_trans = mtransforms.blended_transform_factory(ax.transData, ax.transAxes) + + for sysn in ("pgque", "pgq"): + if sysn not in series: + continue + mins = series[sysn]["mins"] + ys = series[sysn][y_key] + if len(mins) == 0 or now_clamped < mins[0] or now_clamped > mins[-1]: + continue + y_target = float(np.interp(now_clamped, mins, ys)) + weight = "bold" if sysn == "pgque" else "normal" + tail_x = tail_x_for[sysn] + # Arrow: head on the line (data coords at now_clamped), tail at + # (tail_x, ARROW_TAIL_AX_FRAC). Tilted ~45° because head x != tail x. + ax.annotate( + "", + xy=(now_clamped, y_target), + xycoords=ax.transData, + xytext=(tail_x, ARROW_TAIL_AX_FRAC), + textcoords=tail_trans, + arrowprops=dict( + arrowstyle="->", + color=COLORS[sysn], + linewidth=0.8, + shrinkA=0, shrinkB=2, + ), + annotation_clip=False, + zorder=12, + ) + # Label sits directly under its OWN tail. + ax.text(tail_x, ARROW_LABEL_AX_FRAC, sysn, + transform=tail_trans, + ha="center", va="top", + fontsize=8.5, color=COLORS[sysn], + fontweight=weight, + clip_on=False, zorder=12) + + +def _place_inline_labels(ax, proposals, y_peak, gap_frac=0.04): + """Top-down vertical nudge so labels don't overlap. + Drops any label pushed out of [0, y_peak]. Draws with monospace. + """ + if not proposals: + return + gap = y_peak * gap_frac + # Sort descending by true y; walk top-down, push each label down if too close + ordered = sorted(proposals, key=lambda p: -p["y_true"]) + prev_y = None + for p in ordered: + y = p["y_true"] + if prev_y is not None and y > prev_y - gap: + y = prev_y - gap + p["y_draw"] = y + prev_y = y + dx = X_MAX_MIN * 0.015 + for p in ordered: + y = p["y_draw"] + if y < 0 or y > y_peak: + continue # label forced out of frame — drop it + ax.text(p["x"] + dx, y, p["text"], + color=p["color"], fontsize=8.5, + fontfamily="monospace", fontweight=p["weight"], + ha="left", va="center", zorder=6) + + +def render_frame(bloat, latency, frame_idx, total_frames, y_peak_dead, y_peak_lat): + now_min = X_MAX_MIN * (frame_idx / max(total_frames - 1, 1)) + + fig = plt.figure(figsize=(FIG_W_IN, FIG_H_IN), dpi=DPI) + # No title block — top whitespace reclaimed for the axes. + # v12: hspace + bottom margin sized so each panel has figure-level + # whitespace BELOW its x-axis tick labels for the pgque/pgq arrows + # + labels (which live outside the axes via negative axes-fraction y). + # Tuned so top panel bottom lands at ~42% of figure height (not + # compressed) while leaving room below bot-panel for arrow+label. + gs = fig.add_gridspec(2, 1, hspace=0.45, top=0.95, bottom=0.10, + left=0.10, right=0.96, + height_ratios=[1.0, 0.58]) + ax1 = fig.add_subplot(gs[0]) + ax2 = fig.add_subplot(gs[1]) + + for ax in (ax1, ax2): + _apply_phase_bands(ax) + + # Phase labels on top axis only + for start, end, kind in PHASE_BOUNDS: + label = ("xmin horizon blocked · 1h" if kind == "tx" + else ("clean baseline · 30m" if start == 0 else "clean recovery · 1h")) + color = RED if kind == "tx" else FG_DIM + is_current = (start <= now_min < end) + weight = "bold" if (is_current or kind == "tx") else "normal" + alpha = 1.0 if is_current else (0.7 if kind == "tx" else 0.55) + ax1.text((start + end) / 2, 1.04, label, + transform=ax1.get_xaxis_transform(), + ha="center", color=color, fontsize=10, + fontweight=weight, alpha=alpha) + + # --- panel 1: dead tuples --- + dead_vals = {} + dead_inline = [] + for sysn in SYSTEMS: + if sysn in bloat: + v = _plot_series(ax1, bloat, sysn, now_min, "dead", + inline_threshold=INLINE_MIN_DEAD, + inline_fmt=fmt_y_dead, + inline_out=dead_inline) + if v is not None: + dead_vals[sysn] = v + _place_inline_labels(ax1, dead_inline, y_peak_dead) + _draw_cursor_arrows(ax1, bloat, "dead", y_peak_dead, now_min) + + # --- panel 2: latency --- + lat_vals = {} + lat_inline = [] + for sysn in SYSTEMS: + if sysn in latency: + v = _plot_series(ax2, latency, sysn, now_min, "lat", y_clip=LAT_CLIP_MS, + inline_threshold=INLINE_MIN_LAT_MS, + inline_fmt=fmt_y_lat, + inline_out=lat_inline) + if v is not None: + lat_vals[sysn] = v + _place_inline_labels(ax2, lat_inline, y_peak_lat) + _draw_cursor_arrows(ax2, latency, "lat", y_peak_lat, now_min) + + # Hinge flash — dark theme: black-bg + bright-red border + bright-red text + hinge_min = 30 + frames_since_hinge = (now_min - hinge_min) / X_MAX_MIN * (total_frames - 1) + if 0 <= frames_since_hinge <= FLASH_FRAMES: + flash_alpha = 1.0 - (frames_since_hinge / FLASH_FRAMES) + ax1.axvline(x=hinge_min, color=RED, linewidth=3.5, + alpha=flash_alpha * 0.85, zorder=6) + ax2.axvline(x=hinge_min, color=RED, linewidth=3.5, + alpha=flash_alpha * 0.85, zorder=6) + ax1.text(hinge_min + 1, y_peak_dead * 0.92, + "xmin horizon blocked\nvacuum can't advance", + color=RED, fontsize=11, fontweight="bold", + alpha=flash_alpha, zorder=7, + bbox=dict(boxstyle="round,pad=0.3", facecolor=BG, + edgecolor=RED, linewidth=1.2, alpha=flash_alpha)) + + # Now cursor on both panels + for ax, y_peak in ((ax1, y_peak_dead), (ax2, y_peak_lat)): + ax.axvline(x=now_min, color=FG_EMPH, linewidth=1.2, alpha=0.70, zorder=8) + ax.plot([now_min], [y_peak * 1.005], marker="v", color=FG_EMPH, + markersize=7, zorder=9, clip_on=False) + + # Phase / time readout + kind, p_start, p_end = current_phase(now_min) + ribbon_color = RED if kind == "tx" else FG_DIM + ribbon_label = ("XMIN HORIZON BLOCKED (vacuum stalled)" if kind == "tx" + else ("CLEAN BASELINE" if p_start == 0 else "CLEAN RECOVERY")) + # Footer: phase label stays sans; t = ... is mono for stable digit columns. + # Compose as two adjacent fig.text calls anchored around figure center. + phase_text = f"phase: {ribbon_label}" + time_text = f"t = {now_min:5.1f} min / 150 min" + fig.text(0.5, 0.015, phase_text + " | ", + ha="right", fontsize=10, color=ribbon_color, fontweight="bold") + fig.text(0.5, 0.015, time_text, + ha="left", fontsize=10, color=ribbon_color, fontweight="bold", + fontfamily="monospace") + + # Axes + for ax in (ax1, ax2): + ax.set_xlim(-1, X_MAX_MIN) + ax.set_xticks([0, 30, 60, 90, 120, 150]) + ax.set_xticklabels(["0", "30m", "1h", "1h30", "2h", "2h30"]) + # v8: no below-axis headroom (start at 0), but keep '0' tick label hidden + # so the baseline gridline stays without the "0" text crowding the axis. + # v12: clean axis bounds — no negative headroom. The arrows live + # OUTSIDE the axes rectangle via mixed (data, axes-fraction) transforms + # and annotation_clip=False, so the plot data area is not compressed. + ax1.set_ylim(0, y_peak_dead * 1.05) + ax1.yaxis.set_major_formatter(FuncFormatter(fmt_y_dead)) + ax1.set_ylabel("event-table dead tuples", labelpad=8) + + ax2.set_ylim(0, y_peak_lat * 1.05) + ax2.yaxis.set_major_formatter(FuncFormatter(fmt_y_lat)) + ax2.set_ylabel("consumer latency (mean)", labelpad=8) + + # Mono-only for numeric y-tick labels; hide the '0' tick. + for ax in (ax1, ax2): + for lbl, pos in zip(ax.get_yticklabels(), ax.get_yticks()): + lbl.set_fontfamily("monospace") + if pos == 0: + lbl.set_visible(False) + + # Legend: uniform row grid, baseline-aligned, two cells per row. + # Name (sans) right-aligned at x=0.88; value (mono) right-aligned at x=0.985. + # Using va="baseline" for every cell so sans and mono glyphs sit on the + # same typographic baseline regardless of family metrics. + # ROW_SPACING is in axes-fraction — bottom panel is ~0.58x the height of + # top, so we scale to keep visual row rhythm consistent in physical pixels. + def _draw_legend(ax, vals, fmt, header, row_spacing, header_gap): + header_y = 0.97 + rows_start = header_y - header_gap + ax.text(0.985, header_y, header, + transform=ax.transAxes, ha="right", va="baseline", + color=FG_EMPH, fontsize=8.5, zorder=10) + visible = [s for s in SYSTEMS if s in vals] + for i, sysn in enumerate(visible): + cur = vals[sysn] + weight = "bold" if sysn == "pgque" else "normal" + y = rows_start - i * row_spacing + ax.text(0.88, y, sysn, + transform=ax.transAxes, ha="right", va="baseline", + color=COLORS[sysn], fontsize=9, fontweight=weight, + zorder=10) + ax.text(0.985, y, fmt(cur, None), + transform=ax.transAxes, ha="right", va="baseline", + color=COLORS[sysn], fontsize=9, fontweight=weight, + fontfamily="monospace", zorder=10) + + # Scale row spacing by inverse of panel height so visual pixel rhythm + # matches between the hero (top) and supporting (bottom) panels. + _draw_legend(ax1, dead_vals, fmt_y_dead, "dead tuples (event tables)", + row_spacing=0.065, header_gap=0.05) + _draw_legend(ax2, lat_vals, fmt_y_lat, "latency (mean)", + row_spacing=0.112, header_gap=0.086) + + fig.canvas.draw() + buf = np.asarray(fig.canvas.buffer_rgba()) + img = Image.fromarray(buf).convert("RGB") + plt.close(fig) + return img + + +def build_animation(): + print("Loading bloat CSVs...") + bloat = load_bloat() + print(f" bloat series: {sorted(bloat.keys())}") + print("Loading latency from consumer logs...") + latency = load_latency() + print(f" latency series: {sorted(latency.keys())}") + + y_peak_dead = max(bloat[s]["dead"].max() for s in bloat) + peak_lat_observed = max( + min(latency[s]["lat"].max(), LAT_CLIP_MS) for s in latency + ) + y_peak_lat = peak_lat_observed + print(f" y_peak dead = {y_peak_dead:.0f}, y_peak lat = {y_peak_lat:.1f} ms") + + print(f"Rendering {FRAME_COUNT} frames ({FIG_W_IN*DPI:.0f}x{FIG_H_IN*DPI:.0f} px)...") + frames = [] + for i in range(FRAME_COUNT): + if i % 50 == 0: + print(f" frame {i}/{FRAME_COUNT}") + frames.append(render_frame(bloat, latency, i, FRAME_COUNT, y_peak_dead, y_peak_lat)) + for _ in range(HOLD_FRAMES): + frames.append(frames[-1]) + + frames[-1].save(OUT_HERO, format="PNG", optimize=True) + print(f"Wrote hero PNG: {OUT_HERO}") + + gifski = shutil.which("gifski") + if gifski: + print("Using gifski...") + tmp_dir = Path("/tmp/r4_gif_frames_v17_solarized") + if tmp_dir.exists(): + shutil.rmtree(tmp_dir) + tmp_dir.mkdir() + for idx, fr in enumerate(frames): + fr.save(tmp_dir / f"frame_{idx:04d}.png") + # Solarized dark has large flat-color background regions (base03 + # figure bg + base02 TX phase wash) that low gifski quality turns + # into visible speckle. Push quality to default=90, add --extra for + # the slower-but-cleaner palette path, and --lossy-quality=100 to + # disable the noise/streak dither that produces the speckle. + subprocess.run( + [gifski, "-o", str(OUT_GIF), "--fps", str(FPS), + "--width", str(int(FIG_W_IN * DPI)), + "--quality", "68", + "--lossy-quality", "75", + "--extra", + *sorted(str(p) for p in tmp_dir.glob("*.png"))], + check=True, + ) + shutil.rmtree(tmp_dir) + else: + print("gifski missing; using PIL...") + pal_frames = [f.quantize(colors=128, method=Image.MEDIANCUT, dither=Image.Dither.NONE) + for f in frames] + pal_frames[0].save(OUT_GIF, save_all=True, + append_images=pal_frames[1:], + duration=int(1000 / FPS), loop=0, + optimize=True, disposal=2) + print(f"Wrote GIF: {OUT_GIF} ({OUT_GIF.stat().st_size/1e6:.2f} MB)") + + +if __name__ == "__main__": + build_animation() diff --git a/benchmark/install/README.md b/benchmark/install/README.md new file mode 100644 index 00000000..3517d838 --- /dev/null +++ b/benchmark/install/README.md @@ -0,0 +1,19 @@ +# Install scripts + +All seven VMs share `bootstrap.sh` as the base — installs PG18 (PGDG) + pg_cron + pg_stat_statements + pg_ash + pgfr, applies postgresql.conf tuning, enables local trust auth, restarts PG. + +Then each VM layers a system-specific install on top: + +| System | Per-system installer | Notes | +|---|---|---| +| pgque | — (cloned + `make install` in userdata before bootstrap — PR #62 branch) | The pgque binary install happens in AMI user-data; bootstrap only prepares PG. Actual pgque SQL install: `git clone NikolayS/pgque -b feat/rotation`, `make USE_PGXS=1 install`, `psql -f build/pgque.sql`, `SELECT pgque.create_queue('bench_queue')`. | +| pgq | [install_pgq.sh](install_pgq.sh) | PL-only fork via `switch_plonly.sql`. | +| pgmq | [install_pgmq.sh](install_pgmq.sh) | SQL-only install from tembo-io/pgmq v1.11.0 raw URL. | +| pgmq-partitioned | [install_pgmq.sh](install_pgmq.sh) + `pgmq.create_partitioned(...)` + pg_partman | See [pgmq-partitioned_setup_5min.sql](pgmq-partitioned_setup_5min.sql) for the partman cron schedule. | +| river | [install_river.sh](install_river.sh) | `go install` the CLI, run `river migrate-up`. | +| que | — (Ruby gem installed in VM userdata) | `gem install que -v 2.4.x`, `bundle exec que:install`. The VM userdata handles this; the bench only consumes the schema. | +| pgboss | [install_pgboss.sh](install_pgboss.sh) | `npm install pg-boss@12.15` locally on the VM, run `new PgBoss(DSN).start()` once to migrate schema. | + +There is **no** `install_que.sh` or `install_pgmq-partitioned.sh` as separate shell files — the que gem install is driven from the AMI user-data (Ruby + bundler), and pgmq-partitioned reuses `install_pgmq.sh` + `pg_partman` + a single-call SQL script. + +Reset procedure between runs: [../runners/clean_reinstall.sh](../runners/clean_reinstall.sh). See [../OPS_GOTCHAS.md §4, §5, §7](../OPS_GOTCHAS.md) for the schema-drop pitfalls. diff --git a/benchmark/install/bootstrap.sh b/benchmark/install/bootstrap.sh new file mode 100644 index 00000000..84c6c8ad --- /dev/null +++ b/benchmark/install/bootstrap.sh @@ -0,0 +1,95 @@ +#!/usr/bin/env bash +set -euo pipefail +export DEBIAN_FRONTEND=noninteractive + +echo "=== [$(hostname)] bootstrap start $(date -u +%FT%TZ) ===" + +# Wait for any cloud-init apt locks to clear +for i in $(seq 1 60); do + if ! sudo fuser /var/lib/dpkg/lock-frontend >/dev/null 2>&1; then break; fi + echo " waiting for dpkg lock..." + sleep 5 +done + +sudo apt-get update -qq +sudo apt-get install -y curl gnupg lsb-release build-essential git python3-pip python3-psycopg2 tmux jq -qq + +# PGDG +curl -fsSL https://www.postgresql.org/media/keys/ACCC4CF8.asc \ + | sudo gpg --dearmor -o /usr/share/keyrings/postgresql.gpg +echo "deb [signed-by=/usr/share/keyrings/postgresql.gpg] https://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" \ + | sudo tee /etc/apt/sources.list.d/postgresql.list >/dev/null + +sudo apt-get update -qq +sudo apt-get install -y postgresql-18 postgresql-server-dev-18 postgresql-18-cron -qq + +sudo pip3 install psycopg2-binary --break-system-packages >/dev/null 2>&1 || true + +# pg_ash (pure SQL, no make) +sudo rm -rf /tmp/pg_ash +sudo git clone https://github.com/NikolayS/pg_ash /tmp/pg_ash + +# pg-flight-recorder (pure SQL, no make) +sudo rm -rf /tmp/pgfr +sudo git clone https://github.com/NikolayS/pg-flight-recorder /tmp/pgfr +sudo ln -sfn /tmp/pgfr/_record/sql /record_sql +sudo ln -sfn /tmp/pgfr/_analyze/sql /analyze_sql + +sudo mkdir -p /data +sudo chown postgres:postgres /data + +# ── postgresql.conf tuning (same as issue #77) ─────────────────────────────── +sudo tee -a /etc/postgresql/18/main/postgresql.conf >/dev/null <<'CONF' + +# ── pgq bench tuning (issue #77) ───────────────────────────────────────────── +shared_preload_libraries = 'pg_stat_statements,pg_cron' +cron.database_name = 'bench' + +shared_buffers = 4GB +effective_cache_size = 12GB + +synchronous_commit = off +wal_level = minimal +wal_compression = lz4 +max_wal_size = 16GB +checkpoint_completion_target = 0.9 + +bgwriter_delay = 50ms +bgwriter_lru_maxpages = 400 +bgwriter_lru_multiplier = 4.0 + +random_page_cost = 1.1 +effective_io_concurrency = 200 +max_connections = 200 + +max_wal_senders = 0 + +autovacuum_vacuum_scale_factor = 0.01 +autovacuum_analyze_scale_factor = 0.01 +autovacuum_vacuum_cost_delay = 2ms + +jit = off + +listen_addresses = 'localhost' +CONF + +# trust local for bench workers +sudo sed -i '/^host.*127.0.0.1.*scram-sha-256/i host all postgres 127.0.0.1/32 trust' \ + /etc/postgresql/18/main/pg_hba.conf + +sudo systemctl restart postgresql@18-main + +# create bench DB +sudo -u postgres psql -c "CREATE DATABASE bench;" +sudo -u postgres psql -d bench -c "CREATE EXTENSION IF NOT EXISTS pg_stat_statements; CREATE EXTENSION IF NOT EXISTS pg_cron;" + +# pg_ash install + start sampling +sudo -u postgres psql -d bench -f /tmp/pg_ash/sql/ash-install.sql >/dev/null +sudo -u postgres psql -d bench -c "SELECT * FROM ash.start();" >/dev/null + +# pgfr install +sudo -u postgres psql -d bench -f /tmp/pgfr/_record/install.sql >/dev/null 2>&1 || true +sudo -u postgres psql -d bench -f /tmp/pgfr/_analyze/install.sql >/dev/null 2>&1 || true +sudo -u postgres psql -d bench -f /tmp/pgfr/_control/install.sql >/dev/null 2>&1 || true + +echo "=== [$(hostname)] bootstrap DONE $(date -u +%FT%TZ) ===" diff --git a/benchmark/install/install_pgboss.sh b/benchmark/install/install_pgboss.sh new file mode 100644 index 00000000..5a04d458 --- /dev/null +++ b/benchmark/install/install_pgboss.sh @@ -0,0 +1,70 @@ +#!/usr/bin/env bash +# pgboss install with covering index for polling query (avoids 7.9 GB disk sort pathology). +set -euo pipefail +echo "=== install pg-boss v12.15.0 + Node worker + covering index ===" + +curl -fsSL https://deb.nodesource.com/setup_20.x | sudo bash - >/dev/null 2>&1 +sudo apt-get install -y nodejs -qq +sudo npm install -g pg-boss@12.15.0 >/dev/null 2>&1 + +# Schema + queue creation +sudo NODE_PATH=/usr/lib/node_modules node -e " +const { PgBoss } = require('pg-boss'); +const boss = new PgBoss('postgres://postgres@127.0.0.1/bench'); +boss.start().then(async () => { + await boss.createQueue('bench_queue'); + console.log('pgboss schema + bench_queue ready'); + process.exit(0); +}).catch(e => { console.error(e.message); process.exit(1); }); +" + +# Covering index for the polling query ORDER BY priority DESC, created_on, id WHERE state<'active' +# pgboss.job is PARTITIONED — must target the specific leaf partition for bench_queue. +# The partition name is j. +sudo -u postgres psql -d bench > /tmp/pgboss_index.log 2>&1 <<'SQL' +DO $$ +DECLARE + partition_name text; +BEGIN + -- Find the leaf partition for bench_queue by asking pg_inherits. + -- bench_queue was created via boss.createQueue; pg-boss 12 creates a named partition per queue. + SELECT c.relname INTO partition_name + FROM pg_inherits i + JOIN pg_class c ON c.oid = i.inhrelid + JOIN pg_class p ON p.oid = i.inhparent + JOIN pg_namespace np ON np.oid = p.relnamespace + WHERE np.nspname = 'pgboss' AND p.relname = 'job' + AND pg_get_expr(c.relpartbound, c.oid) LIKE '%bench_queue%' + LIMIT 1; + + IF partition_name IS NULL THEN + -- Fallback: try job_common (shared partition when no dedicated one exists) + partition_name := 'job_common'; + END IF; + + RAISE NOTICE 'Creating covering index on pgboss.%', partition_name; + EXECUTE format( + 'CREATE INDEX IF NOT EXISTS idx_bench_polling ON pgboss.%I (name, priority DESC, created_on, id) WHERE state < ''active''', + partition_name); +END +$$; + +-- Verify: +SELECT schemaname, tablename, indexname FROM pg_indexes +WHERE schemaname='pgboss' AND indexname='idx_bench_polling'; +SQL +cat /tmp/pgboss_index.log + +# Worker script +sudo tee /root/pgboss_worker.js >/dev/null <<'EOF' +const { PgBoss } = require('pg-boss'); +const boss = new PgBoss('postgres://postgres@127.0.0.1/bench'); +boss.start().then(() => { + boss.work('bench_queue', { teamSize: 8, pollingIntervalSeconds: 0.5 }, async jobs => Promise.resolve()); + console.log('pgboss worker running'); +}).catch(e => { console.error(e); process.exit(1); }); +process.on('SIGTERM', () => boss.stop({ graceful: false }).then(() => process.exit(0))); +process.on('SIGINT', () => boss.stop({ graceful: false }).then(() => process.exit(0))); +EOF + +echo "=== pgboss installed + covering index ===" diff --git a/benchmark/install/install_pgmq.sh b/benchmark/install/install_pgmq.sh new file mode 100644 index 00000000..0b9a802e --- /dev/null +++ b/benchmark/install/install_pgmq.sh @@ -0,0 +1,11 @@ +#!/usr/bin/env bash +set -euo pipefail +echo "=== install pgmq v1.11.0 (SQL-only) ===" + +curl -fsSL \ + https://raw.githubusercontent.com/tembo-io/pgmq/v1.11.0/pgmq-extension/sql/pgmq.sql \ + -o /tmp/pgmq.sql + +sudo -u postgres psql -d bench -f /tmp/pgmq.sql +sudo -u postgres psql -d bench -c "SELECT pgmq.create('bench_queue');" +echo "=== pgmq installed ===" diff --git a/benchmark/install/install_pgq.sh b/benchmark/install/install_pgq.sh new file mode 100644 index 00000000..f8ab8823 --- /dev/null +++ b/benchmark/install/install_pgq.sh @@ -0,0 +1,30 @@ +#!/usr/bin/env bash +set -euo pipefail +echo "=== install pgq v3.5.1 ===" + +sudo rm -rf /tmp/pgq +sudo git clone --branch v3.5.1 --depth 1 https://github.com/pgq/pgq /tmp/pgq +cd /tmp/pgq +sudo make USE_PGXS=1 PG_CONFIG=/usr/lib/postgresql/18/bin/pg_config >/dev/null +sudo make USE_PGXS=1 install >/dev/null + +sudo -u postgres psql -d bench -c "CREATE EXTENSION pgq;" +# Switch to PL-only mode — replace C insert_event_raw with PL/pgSQL version +sudo -u postgres psql -d bench -f /tmp/pgq/sql/switch_plonly.sql + +# Verify PL-only +lang=$(sudo -u postgres psql -d bench -Atc \ + "SELECT l.lanname FROM pg_proc p JOIN pg_language l ON p.prolang=l.oid WHERE proname='insert_event_raw' AND pronamespace=(SELECT oid FROM pg_namespace WHERE nspname='pgq');") +if [ "$lang" != "plpgsql" ]; then + echo "ERROR: pgq still on C (lang=$lang)"; exit 1 +fi +echo "pgq PL-only confirmed (lang=$lang)" + +sudo -u postgres psql -d bench -c "SELECT pgq.create_queue('bench_queue');" +sudo -u postgres psql -d bench -c "SELECT pgq.ticker('bench_queue');" +sudo -u postgres psql -d bench -c "SELECT pgq.register_consumer('bench_queue', 'bench_consumer');" +# Ticker via pg_cron (60s interval, bench.py also ticks inline) +sudo -u postgres psql -d bench -c "SELECT cron.schedule('pgq-ticker', '* * * * *', \$\$SELECT pgq.ticker()\$\$);" +# And maint (rotation etc) +sudo -u postgres psql -d bench -c "SELECT cron.schedule('pgq-maint', '* * * * *', \$\$SELECT pgq.maint_operations()\$\$);" || true +echo "=== pgq installed ===" diff --git a/benchmark/install/install_river.sh b/benchmark/install/install_river.sh new file mode 100644 index 00000000..afdf8ee2 --- /dev/null +++ b/benchmark/install/install_river.sh @@ -0,0 +1,80 @@ +#!/usr/bin/env bash +set -euo pipefail +echo "=== install river v0.34.0 + Go worker ===" + +# i4i.2xlarge is x86_64 +curl -fsSL https://go.dev/dl/go1.22.4.linux-amd64.tar.gz | sudo tar -C /usr/local -xz +export PATH=$PATH:/usr/local/go/bin +echo 'export PATH=$PATH:/usr/local/go/bin:/root/go/bin' | sudo tee -a /root/.bashrc >/dev/null + +sudo bash -c "export PATH=\$PATH:/usr/local/go/bin; go install github.com/riverqueue/river/cmd/river@v0.34.0" + +sudo -u postgres psql -d bench -c "CREATE EXTENSION IF NOT EXISTS pgcrypto;" +sudo /root/go/bin/river migrate-up --database-url "postgres://postgres@127.0.0.1/bench" + +# Go worker program (river CLI has no 'work' subcommand in 0.34) +sudo mkdir -p /root/riverworker +sudo tee /root/riverworker/go.mod >/dev/null <<'EOF' +module riverworker + +go 1.22 +EOF + +sudo tee /root/riverworker/main.go >/dev/null <<'EOF' +package main + +import ( + "context" + "log" + "os/signal" + "syscall" + + "github.com/jackc/pgx/v5/pgxpool" + "github.com/riverqueue/river" + "github.com/riverqueue/river/riverdriver/riverpgxv5" +) + +type BenchArgs struct{} + +func (BenchArgs) Kind() string { return "bench" } + +type BenchWorker struct { + river.WorkerDefaults[BenchArgs] +} + +func (w *BenchWorker) Work(ctx context.Context, job *river.Job[BenchArgs]) error { + return nil +} + +func main() { + ctx, cancel := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM) + defer cancel() + + pool, err := pgxpool.New(ctx, "postgres://postgres@127.0.0.1/bench?sslmode=disable") + if err != nil { + log.Fatal(err) + } + defer pool.Close() + + workers := river.NewWorkers() + river.AddWorker(workers, &BenchWorker{}) + + client, err := river.NewClient(riverpgxv5.New(pool), &river.Config{ + Queues: map[string]river.QueueConfig{"default": {MaxWorkers: 8}}, + Workers: workers, + }) + if err != nil { + log.Fatal(err) + } + + if err := client.Start(ctx); err != nil { + log.Fatal(err) + } + <-ctx.Done() +} +EOF + +cd /root/riverworker +sudo bash -c "cd /root/riverworker && export PATH=\$PATH:/usr/local/go/bin && go mod tidy && go build -o /root/riverworker/bench_worker main.go" + +echo "=== river installed, worker binary at /root/riverworker/bench_worker ===" diff --git a/benchmark/install/pgmq-partitioned_setup_5min.sql b/benchmark/install/pgmq-partitioned_setup_5min.sql new file mode 100644 index 00000000..f1fd1ffb --- /dev/null +++ b/benchmark/install/pgmq-partitioned_setup_5min.sql @@ -0,0 +1,6 @@ +-- Schedule partman run_maintenance every minute (partitions are 5-min) +SELECT cron.schedule('partman_run', '* * * * *', 'CALL public.run_maintenance_proc()'); +-- Verify +SELECT jobname, schedule FROM cron.job WHERE jobname = 'partman_run'; +-- Test ash +SELECT count(*) FROM ash.samples(p_interval => '1 minute'::interval, p_limit => 10); diff --git a/benchmark/producers/producer_pgboss.sql b/benchmark/producers/producer_pgboss.sql new file mode 100644 index 00000000..e4b90933 --- /dev/null +++ b/benchmark/producers/producer_pgboss.sql @@ -0,0 +1,2 @@ +INSERT INTO pgboss.job(name,data,priority) +VALUES('bench_queue', json_build_object('data', repeat('x', 200))::jsonb, 0); diff --git a/benchmark/producers/producer_pgmq-partitioned.sql b/benchmark/producers/producer_pgmq-partitioned.sql new file mode 100644 index 00000000..27efcbb4 --- /dev/null +++ b/benchmark/producers/producer_pgmq-partitioned.sql @@ -0,0 +1 @@ +SELECT pgmq.send('bench_queue', json_build_object('data', repeat('x', 200))::jsonb); diff --git a/benchmark/producers/producer_pgmq.sql b/benchmark/producers/producer_pgmq.sql new file mode 100644 index 00000000..27efcbb4 --- /dev/null +++ b/benchmark/producers/producer_pgmq.sql @@ -0,0 +1 @@ +SELECT pgmq.send('bench_queue', json_build_object('data', repeat('x', 200))::jsonb); diff --git a/benchmark/producers/producer_pgq.sql b/benchmark/producers/producer_pgq.sql new file mode 100644 index 00000000..b816c778 --- /dev/null +++ b/benchmark/producers/producer_pgq.sql @@ -0,0 +1 @@ +SELECT pgq.insert_event('bench_queue', 'bench', repeat('x', 200)); diff --git a/benchmark/producers/producer_pgque.sql b/benchmark/producers/producer_pgque.sql new file mode 100644 index 00000000..2f5bdfd2 --- /dev/null +++ b/benchmark/producers/producer_pgque.sql @@ -0,0 +1 @@ +SELECT pgque.send('bench_queue', json_build_object('data', repeat('x', 200))::jsonb); diff --git a/benchmark/producers/producer_que.sql b/benchmark/producers/producer_que.sql new file mode 100644 index 00000000..fe426577 --- /dev/null +++ b/benchmark/producers/producer_que.sql @@ -0,0 +1,4 @@ +INSERT INTO public.que_jobs(job_class, args, priority, queue, run_at, job_schema_version) +VALUES('Bench', + jsonb_build_array(jsonb_build_object('data', repeat('x', 200))), + 100, 'default', now(), 2); diff --git a/benchmark/producers/producer_river.sql b/benchmark/producers/producer_river.sql new file mode 100644 index 00000000..71aaf4c8 --- /dev/null +++ b/benchmark/producers/producer_river.sql @@ -0,0 +1,2 @@ +INSERT INTO river_job(kind,args,max_attempts,queue,state) +VALUES('bench', json_build_object('data', repeat('x', 200))::jsonb, 25, 'default', 'available'); diff --git a/benchmark/runners/clean_reinstall.sh b/benchmark/runners/clean_reinstall.sh new file mode 100644 index 00000000..30e54f53 --- /dev/null +++ b/benchmark/runners/clean_reinstall.sh @@ -0,0 +1,76 @@ +#!/usr/bin/env bash +# Per-system clean reinstall + stats reset. +set -uo pipefail +SYS=${1:?system} + +echo "=== [$(date -u +%FT%TZ)] clean reinstall: $SYS ===" + +# Kill any leftover processes +sudo pkill -f 'idle_in_tx' 2>/dev/null || true +sudo pkill -f 'bench.py' 2>/dev/null || true +sudo pkill -f 'bloat_sampler' 2>/dev/null || true +sudo pkill -f 'pgq_ticker_daemon' 2>/dev/null || true +sudo pkill -f 'bench_worker' 2>/dev/null || true +sudo pkill -f 'que_worker' 2>/dev/null || true +sudo pkill -f 'pgboss_worker' 2>/dev/null || true +sleep 2 + +# Terminate any dangling backends +sudo -u postgres psql -d bench -c " + SELECT pg_terminate_backend(pid) + FROM pg_stat_activity + WHERE datname='bench' AND pid<>pg_backend_pid() + AND (state='idle in transaction' OR application_name LIKE 'idle_in_tx%' OR application_name LIKE 'pgq_ticker%'); +" >/dev/null 2>&1 + +# Drop queue-specific artifacts +case $SYS in + pgque) + sudo -u postgres psql -d bench -c "DROP SCHEMA IF EXISTS pgque CASCADE" >/dev/null + # Also unschedule any leftover pg_cron jobs named pgque_* + sudo -u postgres psql -d bench -Atc "SELECT jobname FROM cron.job WHERE jobname LIKE 'pgque%'" | while read j; do + [ -n "$j" ] && sudo -u postgres psql -d bench -c "SELECT cron.unschedule('$j')" >/dev/null + done + ;; + pgq) + sudo -u postgres psql -d bench -c "DROP EXTENSION IF EXISTS pgq CASCADE; DROP SCHEMA IF EXISTS pgq CASCADE" >/dev/null + sudo -u postgres psql -d bench -Atc "SELECT jobname FROM cron.job WHERE jobname LIKE 'pgq%'" | while read j; do + [ -n "$j" ] && sudo -u postgres psql -d bench -c "SELECT cron.unschedule('$j')" >/dev/null + done + ;; + pgmq) + sudo -u postgres psql -d bench -c "DROP SCHEMA IF EXISTS pgmq CASCADE" >/dev/null + ;; + river) + sudo -u postgres psql -d bench -c " + DO \$\$ DECLARE t text; + BEGIN + FOR t IN SELECT tablename FROM pg_tables WHERE schemaname='public' AND tablename LIKE 'river_%' + LOOP EXECUTE 'DROP TABLE IF EXISTS '||t||' CASCADE'; END LOOP; + END \$\$; + " >/dev/null + ;; + que) + sudo -u postgres psql -d bench -c " + DO \$\$ DECLARE t text; + BEGIN + FOR t IN SELECT tablename FROM pg_tables WHERE schemaname='public' AND tablename LIKE 'que_%' + LOOP EXECUTE 'DROP TABLE IF EXISTS '||t||' CASCADE'; END LOOP; + END \$\$; + " >/dev/null + ;; + pgboss) + sudo -u postgres psql -d bench -c "DROP SCHEMA IF EXISTS pgboss CASCADE" >/dev/null + ;; +esac + +# Re-run install +bash /tmp/install.sh + +# Full stats reset AFTER reinstall so the reinstall's own writes don't count +sudo -u postgres psql -d bench -f /tmp/full_reset.sql >/dev/null + +# ANALYZE for fresh stats +sudo -u postgres psql -d bench -c "ANALYZE" >/dev/null + +echo "=== [$(date -u +%FT%TZ)] clean reinstall done: $SYS ===" diff --git a/benchmark/runners/fix_nvme_mount.sh b/benchmark/runners/fix_nvme_mount.sh new file mode 100644 index 00000000..74bebc19 --- /dev/null +++ b/benchmark/runners/fix_nvme_mount.sh @@ -0,0 +1,65 @@ +#!/bin/bash +# Format + mount NVMe instance store, move PG data dir to NVMe, re-setup symlink. +set -euo pipefail +DEV=/dev/nvme1n1 +MNT=/mnt/pgdata +PG_DATA_SRC=/var/lib/postgresql/18/main +PG_DATA_DST=$MNT/postgresql/18/main + +echo "=== pre-flight ===" +lsblk | grep nvme1n1 +findmnt $MNT 2>&1 | head -1 || echo "$MNT not mounted" + +# Stop PG +echo "=== stopping postgresql ===" +sudo systemctl stop postgresql@18-main || true +sleep 2 + +# Format if needed +if ! sudo blkid $DEV 2>&1 | grep -q xfs; then + echo "=== formatting $DEV as xfs ===" + sudo mkfs.xfs -f $DEV +fi + +# Mount +echo "=== mounting $DEV → $MNT (noatime,nodiratime) ===" +sudo mkdir -p $MNT +# /mnt/pgdata may currently be a non-empty dir (has the 'bench' symlink target). +# Move contents aside first. +sudo mkdir -p /tmp/_pgdata_old +sudo mv $MNT/* /tmp/_pgdata_old/ 2>/dev/null || true +sudo mount -o noatime,nodiratime $DEV $MNT +# Restore any content (e.g. bench dir) +sudo mv /tmp/_pgdata_old/* $MNT/ 2>/dev/null || true +sudo rmdir /tmp/_pgdata_old 2>/dev/null || true + +# Create PG data path on NVMe +sudo mkdir -p $MNT/postgresql/18 +sudo chown postgres:postgres $MNT/postgresql $MNT/postgresql/18 + +# Move current data (if symlink points somewhere, follow) +REAL_SRC=$(sudo readlink -f $PG_DATA_SRC) +echo "=== PG data currently at: $REAL_SRC (mv to $PG_DATA_DST) ===" +if [ -d "$REAL_SRC" ] && [ "$REAL_SRC" != "$PG_DATA_DST" ]; then + sudo mv "$REAL_SRC" $PG_DATA_DST +fi +# Remove the source (it's either a symlink now or gone) +sudo rm -rf $PG_DATA_SRC +# Make parent +sudo mkdir -p $(dirname $PG_DATA_SRC) +# Symlink +sudo ln -s $PG_DATA_DST $PG_DATA_SRC +sudo chown -h postgres:postgres $PG_DATA_SRC + +# Start PG +echo "=== starting postgresql ===" +sudo systemctl start postgresql@18-main +sleep 3 +sudo systemctl status postgresql@18-main --no-pager | head -5 + +# Verify +echo "=== verify ===" +findmnt $MNT +df $MNT +sudo -u postgres psql -tAc "SHOW data_directory" +sudo stat --format='%F %N' $PG_DATA_SRC diff --git a/benchmark/runners/run_r7.sh b/benchmark/runners/run_r7.sh new file mode 100644 index 00000000..44eb9386 --- /dev/null +++ b/benchmark/runners/run_r7.sh @@ -0,0 +1,115 @@ +#!/usr/bin/env bash +# run_r7.sh [duration_s] +# R7: 1.5h bench (30m clean + 30m idle-in-tx + 30m recovery) +# Adds over R6: +# * sys_metrics_sampler.py running alongside bloat_sampler.py +# * pg_ash ASH + pgfr snapshots copied to CSV at end (schemas preserved) +# +# Same instrumented consumer NOTICE format as R6 for honest ev/s parsing. +set -uo pipefail +SYS=${1:?system} +DUR=${2:-5400} # 1.5h = 5400s + +R7_DIR=${R7_DIR:-/tmp/r7} +R6_DIR=${R6_DIR:-/tmp/r6} +# Consumer SQL may be in either /tmp/consumer.sql (pushed), /tmp/r7/consumer_.sql, or /tmp/r6/consumer_.sql +for c in "$R7_DIR/consumer_${SYS}.sql" "/tmp/consumer.sql" "$R6_DIR/consumer_${SYS}.sql"; do + if [[ -f "$c" ]]; then CONSUMER_SQL="$c"; break; fi +done +if [[ -z "${CONSUMER_SQL:-}" ]]; then + echo "ERROR: no consumer SQL found for $SYS" >&2 + exit 1 +fi +echo "consumer: $CONSUMER_SQL" + +mkdir -p /tmp/bench && chmod 777 /tmp/bench +rm -f /tmp/bench/* + +case $SYS in + pgque|pgq) CONS_C=1 ;; + *) CONS_C=4 ;; +esac + +# System-specific pre-run nudges +if [[ "$SYS" == "pgque" ]]; then + sudo -u postgres psql -d bench -c "UPDATE pgque.queue SET queue_rotation_period='30 seconds'::interval WHERE queue_name='bench_queue';" >/dev/null +fi +if [[ "$SYS" == "pgq" ]]; then + sudo -u postgres psql -d bench -c "UPDATE pgq.queue SET queue_rotation_period='30 seconds'::interval WHERE queue_name='bench_queue';" >/dev/null + [[ -f /tmp/pgq_ticker_daemon.py ]] && sudo -u postgres nohup python3 /tmp/pgq_ticker_daemon.py > /tmp/bench/pgq_ticker.log 2>&1 < /dev/null & + sleep 1 +fi + +sudo -u postgres psql -d bench -c "SELECT pg_stat_statements_reset()" >/dev/null + +# --- samplers --- +python3 /tmp/bloat_sampler.py --system "$SYS" --interval 30 --duration "$DUR" > /tmp/bench/bloat.csv & +BLOAT_PID=$! + +# NEW in R7: sys_metrics_sampler.py — CPU/mem/disk every 10s, NVMe instance-store device +python3 "$R7_DIR/sys_metrics_sampler.py" --interval 10 --duration "$DUR" --device nvme1n1 --out /tmp/bench/sys_metrics.csv > /tmp/bench/sys_metrics.log 2>&1 & +SYSM_PID=$! + +# pgss snapshot cross-check (carried over from R6) +python3 "$R6_DIR/pg_stat_statements_snapshot.py" \ + --dsn "host=127.0.0.1 dbname=bench user=postgres" \ + --interval 10 --duration "$DUR" \ + --out /tmp/bench/pgss_timeseries.csv \ + > /tmp/bench/pgss_snapshotter.log 2>&1 & +PGSS_PID=$! + +# Producer — R7 full: -R 5000 (was 1000 in smoke/R6) +pgbench -h 127.0.0.1 -U postgres -d bench -n -f /tmp/producer.sql \ + -c 1 -j 1 -R 5000 -T "$DUR" -P 30 \ + --aggregate-interval=10 --log --log-prefix=/tmp/bench/producer_agg \ + > /tmp/bench/producer.log 2>&1 & +PROD_PID=$! + +# Consumer (R6 instrumented SQL) +pgbench -h 127.0.0.1 -U postgres -d bench -n -f "$CONSUMER_SQL" \ + -c $CONS_C -j $CONS_C -T "$DUR" -P 30 \ + --aggregate-interval=10 --log --log-prefix=/tmp/bench/consumer_agg \ + > /tmp/bench/consumer.log 2>&1 & +CONS_PID=$! + +# Phase scheduler — only open idle_in_tx for long runs. For --duration < 3000s (smoke), skip. +if [[ "$DUR" -ge 3000 ]]; then +( + sleep 1800 + echo "[$(date -u +%FT%TZ)] opening idle_in_tx" >> /tmp/bench/phases.log + python3 /tmp/idle_in_tx.py > /tmp/bench/idle.log 2>&1 & + IDLE_PID=$! + sudo -u postgres psql -d bench -c "VACUUM VERBOSE" > /tmp/bench/vacuum_preTX.txt 2>&1 + sleep 1800 + echo "[$(date -u +%FT%TZ)] closing idle_in_tx" >> /tmp/bench/phases.log + kill $IDLE_PID 2>/dev/null + sudo -u postgres psql -d bench -c "VACUUM VERBOSE" > /tmp/bench/vacuum_postTX.txt 2>&1 +) & +fi + +wait $PROD_PID +wait $CONS_PID 2>/dev/null +kill $BLOAT_PID 2>/dev/null +kill $SYSM_PID 2>/dev/null +kill $PGSS_PID 2>/dev/null +sudo pkill -f idle_in_tx 2>/dev/null || true +[[ "$SYS" == "pgq" ]] && sudo pkill -f pgq_ticker_daemon 2>/dev/null || true + +# End-of-run copies +sudo -u postgres psql -d bench -c "COPY (SELECT sample_time, database_name, active_backends, wait_event, query_id FROM ash.samples(p_interval => '2 hour'::interval, p_limit => 2000000)) TO '/tmp/bench/ash.csv' CSV HEADER" 2>&1 | tee /tmp/bench/ash_copy.log + +# pgfr: dump the main snapshot tables (partitioned & v2 forms) +sudo -u postgres psql -d bench -c "COPY (SELECT * FROM pgfr_record.snapshots) TO '/tmp/bench/pgfr_snapshots.csv' CSV HEADER" 2>&1 | tee /tmp/bench/pgfr_snapshots_copy.log +sudo -u postgres psql -d bench -c "COPY (SELECT * FROM pgfr_record.table_snapshots) TO '/tmp/bench/pgfr_table_snapshots.csv' CSV HEADER" 2>&1 | tee /tmp/bench/pgfr_tables_copy.log || true +sudo -u postgres psql -d bench -c "COPY (SELECT * FROM pgfr_record.statement_snapshots) TO '/tmp/bench/pgfr_statement_snapshots.csv' CSV HEADER" 2>&1 | tee /tmp/bench/pgfr_stmts_copy.log || true + +sudo -u postgres psql -d bench -c "COPY (SELECT query, calls, total_exec_time::bigint, rows FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 100) TO '/tmp/bench/pgss.csv' CSV HEADER" + +# Post-process: produce events_consumed_per_sec.csv +python3 "$R6_DIR/parse_events_consumed.py" \ + --bench-dir /tmp/bench \ + --bucket 1 \ + --system "$SYS" \ + > /tmp/bench/events_consumed_parse.log 2>&1 || true + +echo "=== R7 done: $SYS ===" diff --git a/benchmark/tooling/bloat_sampler.py b/benchmark/tooling/bloat_sampler.py new file mode 100644 index 00000000..3498ae85 --- /dev/null +++ b/benchmark/tooling/bloat_sampler.py @@ -0,0 +1,86 @@ +#!/usr/bin/env python3 +"""bloat_sampler.py — sample queue table bloat every N seconds. +For pgboss, also samples all partitions of pgboss.job. +Output CSV: ts,table,n_live_tup,n_dead_tup,heap_bytes,toast_bytes,index_bytes,total_bytes +""" +import argparse, time, sys +from datetime import datetime, timezone +import psycopg2 + +DSN = "host=127.0.0.1 dbname=bench user=postgres" + +# Per-system base queries; pgboss gets special treatment (partitions) +TABLES_QUERY = { + "pgque": """ + SELECT schemaname||'.'||relname FROM pg_stat_user_tables + WHERE schemaname = 'pgque' ORDER BY 1 + """, + "pgq": """ + SELECT schemaname||'.'||relname FROM pg_stat_user_tables + WHERE schemaname = 'pgq' ORDER BY 1 + """, + "pgmq": """ + SELECT schemaname||'.'||relname FROM pg_stat_user_tables + WHERE schemaname = 'pgmq' AND relname LIKE 'q_%' ORDER BY 1 + """, + "river": "SELECT 'public.river_job'", + "que": "SELECT 'public.que_jobs'", + # pgboss: include parent + all partitions of pgboss.job + archive + "pgboss": """ + SELECT n.nspname||'.'||c.relname + FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace + WHERE n.nspname = 'pgboss' + AND (c.relname IN ('job','archive') + OR c.relispartition + OR c.relkind = 'p') + ORDER BY 1 + """, + "pgmq-partitioned": """ + SELECT n.nspname||'.'||c.relname + FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace + WHERE n.nspname = 'pgmq' + AND (c.relname LIKE 'q_%' OR c.relname LIKE 'a_%') + AND (c.relispartition OR c.relkind = 'p' OR c.relkind = 'r') + ORDER BY 1 + """, +} + +def main(): + ap = argparse.ArgumentParser() + ap.add_argument("--system", required=True) + ap.add_argument("--interval", type=int, default=30) + ap.add_argument("--duration", type=int, default=1800) + args = ap.parse_args() + + conn = psycopg2.connect(DSN); conn.autocommit = True + cur = conn.cursor() + + print("ts,table,n_live_tup,n_dead_tup,heap_bytes,toast_bytes,index_bytes,total_bytes", flush=True) + t_end = time.monotonic() + args.duration + while time.monotonic() < t_end: + ts = datetime.now(timezone.utc).isoformat() + try: + cur.execute(TABLES_QUERY[args.system]) + tables = [r[0] for r in cur.fetchall()] + for tbl in tables: + cur.execute(""" + SELECT + COALESCE((SELECT n_live_tup FROM pg_stat_user_tables + WHERE schemaname||'.'||relname = %s), 0), + COALESCE((SELECT n_dead_tup FROM pg_stat_user_tables + WHERE schemaname||'.'||relname = %s), 0), + COALESCE(pg_relation_size(%s::regclass), 0), + COALESCE(pg_table_size(%s::regclass) - pg_relation_size(%s::regclass), 0), + COALESCE(pg_indexes_size(%s::regclass), 0), + COALESCE(pg_total_relation_size(%s::regclass), 0) + """, (tbl, tbl, tbl, tbl, tbl, tbl, tbl)) + row = cur.fetchone() + print(f"{ts},{tbl},{row[0]},{row[1]},{row[2]},{row[3]},{row[4]},{row[5]}", flush=True) + except Exception as e: + print(f"# err: {e}", file=sys.stderr, flush=True) + time.sleep(args.interval) + + conn.close() + +if __name__ == "__main__": + main() diff --git a/benchmark/tooling/idle_in_tx.py b/benchmark/tooling/idle_in_tx.py new file mode 100644 index 00000000..16e0302b --- /dev/null +++ b/benchmark/tooling/idle_in_tx.py @@ -0,0 +1,30 @@ +#!/usr/bin/env python3 +"""idle_in_tx.py — open a REPEATABLE READ transaction and hold xmin forever. +Brandur's recipe for reproducing the death spiral. Kill with SIGTERM. +""" +import time, signal, sys +import psycopg2 + +DSN = "host=127.0.0.1 dbname=bench user=postgres application_name=idle_in_tx" + +conn = psycopg2.connect(DSN) +conn.autocommit = False +cur = conn.cursor() +cur.execute("BEGIN ISOLATION LEVEL REPEATABLE READ") +cur.execute("SELECT 1") +print(f"idle_in_tx: holding xmin via pid=", conn.get_backend_pid(), flush=True) + +def shutdown(signum, frame): + try: + conn.rollback() + except: pass + sys.exit(0) + +signal.signal(signal.SIGTERM, shutdown) +signal.signal(signal.SIGINT, shutdown) +while True: + time.sleep(60) + # Keep the transaction alive with a trivial query + try: + cur.execute("SELECT 1") + except: shutdown(None, None) diff --git a/benchmark/tooling/microbench.sh b/benchmark/tooling/microbench.sh new file mode 100644 index 00000000..9feb032b --- /dev/null +++ b/benchmark/tooling/microbench.sh @@ -0,0 +1,33 @@ +#!/usr/bin/env bash +# Per-VM microbench: CPU (sysbench), memory (sysbench), disk (fio on NVMe) +set -uo pipefail +HOST=$(hostname) + +# Ensure tools installed +sudo apt-get install -q -y sysbench fio >/dev/null 2>&1 + +echo "=== HOST: $HOST at $(date -u +%FT%TZ) ===" + +echo "--- CPU (sysbench cpu --cpu-max-prime=20000 --threads=8 --time=10) ---" +sysbench cpu --cpu-max-prime=20000 --threads=8 --time=10 run 2>/dev/null | grep -E "events per second|total time|avg:|min:|max:" + +echo "" +echo "--- Memory (sysbench memory --memory-block-size=1M --memory-total-size=20G --threads=8 --time=10) ---" +sysbench memory --memory-block-size=1M --memory-total-size=20G --threads=8 --time=10 run 2>/dev/null | grep -E "transferred|MiB/sec|total time" + +echo "" +echo "--- FIO 4k randwrite iodepth=32 numjobs=4 runtime=15 on NVMe ---" +FIO_FILE=/mnt/pgdata/_microbench_fio_test +sudo fio --name=rand_write --ioengine=libaio --iodepth=32 --rw=randwrite --bs=4k --size=512M \ + --numjobs=4 --time_based --runtime=15 --filename=$FIO_FILE --direct=1 --group_reporting \ + --output-format=normal 2>/dev/null | grep -E "WRITE:|iops|bw" | head -5 +sudo rm -f $FIO_FILE + +echo "" +echo "--- FIO 1M seq write iodepth=8 runtime=10 on NVMe ---" +sudo fio --name=seq_write --ioengine=libaio --iodepth=8 --rw=write --bs=1M --size=2G \ + --numjobs=1 --time_based --runtime=10 --filename=$FIO_FILE --direct=1 --group_reporting \ + --output-format=normal 2>/dev/null | grep -E "WRITE:|iops|bw" | head -5 +sudo rm -f $FIO_FILE + +echo "=== DONE: $HOST ===" diff --git a/benchmark/tooling/parse_events_consumed.py b/benchmark/tooling/parse_events_consumed.py new file mode 100755 index 00000000..8afa831e --- /dev/null +++ b/benchmark/tooling/parse_events_consumed.py @@ -0,0 +1,169 @@ +#!/usr/bin/env python3 +"""parse_events_consumed.py — turn R6 NOTICE-instrumented consumer.log into +an honest events-delivered-per-second time series. + +Inputs (by convention, --bench-dir): + consumer.log : pgbench stdout+stderr merged; contains lines like + "NOTICE: ev ts= n=" + (and the usual pgbench progress + banner lines) + producer.log : pgbench producer output — used only to derive run t0 + when no NOTICE is present, or for sanity cross-checks. + pgss.csv : optional; queried at end of run; used for sanity check. + bloat.csv : optional; used by the bench-script post-processor only. + +Outputs (to --out-dir or --bench-dir): + events_consumed_per_sec.csv columns: second_since_start,events_consumed + events_consumed_summary.txt human summary + +Method: + - Primary: parse NOTICE lines. Group by second_since_start=(ts - t0). + - Fallback: if zero NOTICE lines found, emit a single row with total=0 + and log a warning (caller should switch to pgss sampling). + - t0: earliest NOTICE ts; or first timestamp in producer_agg log if present. + +Robust to: missing files, truncated runs, interleaved stderr from +multiple pgbench clients, oversized files (streamed line-by-line). +""" +from __future__ import annotations +import argparse, csv, os, re, sys +from collections import defaultdict +from pathlib import Path + +NOTICE_RE = re.compile(r"NOTICE:\s+ev\s+ts=(\d+)\s+n=(\d+)") +# pgbench aggregate log format: "start_ts.usec n_tx lat sqlat ..." +# but our producer --aggregate-interval=10 --log writes one row per interval. +AGG_FIRST_TS_RE = re.compile(r"^(\d{10})(?:\.\d+)?\s") + + +def parse_notice_log(path: Path) -> list[tuple[int, int]]: + """Return list of (ts_epoch_s, n_events) tuples from NOTICE lines.""" + out: list[tuple[int, int]] = [] + if not path.is_file(): + return out + with path.open("r", errors="replace") as f: + for line in f: + m = NOTICE_RE.search(line) + if m: + out.append((int(m.group(1)), int(m.group(2)))) + return out + + +def find_t0_from_producer(bench_dir: Path) -> int | None: + """Find earliest timestamp from producer_agg. file (pgbench aggregate log).""" + candidates = sorted(bench_dir.glob("producer_agg.*")) + for c in candidates: + try: + with c.open("r", errors="replace") as f: + for line in f: + m = AGG_FIRST_TS_RE.match(line) + if m: + return int(m.group(1)) + except OSError: + continue + return None + + +def bucket(events: list[tuple[int, int]], t0: int, width_s: int) -> dict[int, int]: + """Return {second_since_start_bucket_start: total_events}.""" + hist: dict[int, int] = defaultdict(int) + for ts, n in events: + s = ts - t0 + if s < 0: + s = 0 + bucket_start = (s // width_s) * width_s + hist[bucket_start] += n + return hist + + +def main() -> int: + ap = argparse.ArgumentParser() + ap.add_argument("--bench-dir", default="/tmp/bench", + help="Directory with consumer.log, producer.log, etc.") + ap.add_argument("--out-dir", default=None, + help="Where to write events_consumed_per_sec.csv (default: --bench-dir)") + ap.add_argument("--bucket", type=int, default=1, + help="Bucket width in seconds (1 or 10 typical).") + ap.add_argument("--system", default=None, + help="Informational: system name; recorded in summary.") + args = ap.parse_args() + + bench_dir = Path(args.bench_dir) + out_dir = Path(args.out_dir) if args.out_dir else bench_dir + out_dir.mkdir(parents=True, exist_ok=True) + + consumer_log = bench_dir / "consumer.log" + events = parse_notice_log(consumer_log) + + out_csv = out_dir / "events_consumed_per_sec.csv" + summary_txt = out_dir / "events_consumed_summary.txt" + + if not events: + # No NOTICE lines — emit empty CSV and a warning summary. + with out_csv.open("w", newline="") as f: + w = csv.writer(f) + w.writerow(["second_since_start", "events_consumed"]) + with summary_txt.open("w") as f: + f.write( + f"system={args.system or 'unknown'}\n" + f"bench_dir={bench_dir}\n" + f"method=NOTICE\n" + f"notice_lines_found=0\n" + "WARNING: no NOTICE events_consumed data — verify consumer.sql\n" + " is the R6 instrumented version. Fallback: use\n" + " pg_stat_statements snapshot (pgss_timeseries.csv)\n" + " or compute via producer_rate - queue_bloat_delta.\n" + ) + print(f"WARN: no NOTICE events in {consumer_log}; wrote empty {out_csv}", + file=sys.stderr) + return 0 + + # t0: prefer producer_agg earliest ts; else first NOTICE ts. + t0 = find_t0_from_producer(bench_dir) + if t0 is None: + t0 = min(e[0] for e in events) + + hist = bucket(events, t0, args.bucket) + + # Write CSV, sorted. + sec_keys = sorted(hist.keys()) + max_sec = sec_keys[-1] if sec_keys else 0 + + with out_csv.open("w", newline="") as f: + w = csv.writer(f) + w.writerow(["second_since_start", "events_consumed"]) + # Emit every bucket from 0..max, filling zeros for empty buckets — + # this produces a proper TS the plotter can consume without gaps. + for s in range(0, max_sec + args.bucket, args.bucket): + w.writerow([s, hist.get(s, 0)]) + + total_events = sum(n for _, n in events) + total_calls = len(events) + # Calls with n>0 (a.k.a. useful polls) + useful_calls = sum(1 for _, n in events if n > 0) + run_s = max_sec + args.bucket + avg_eps = total_events / run_s if run_s > 0 else 0.0 + + with summary_txt.open("w") as f: + f.write( + f"system={args.system or 'unknown'}\n" + f"bench_dir={bench_dir}\n" + f"method=NOTICE\n" + f"t0_epoch_s={t0}\n" + f"bucket_s={args.bucket}\n" + f"total_notice_lines={total_calls}\n" + f"useful_calls_n_gt_0={useful_calls}\n" + f"total_events_consumed={total_events}\n" + f"approx_run_duration_s={run_s}\n" + f"avg_events_per_sec={avg_eps:.1f}\n" + f"hit_rate_useful_over_total={useful_calls}/{total_calls}" + f" = {useful_calls / total_calls:.4f}\n" + f"csv={out_csv}\n" + ) + print(f"OK: wrote {out_csv} and {summary_txt}") + print(f" events={total_events} over ~{run_s}s -> {avg_eps:.1f} ev/s" + f" ({useful_calls}/{total_calls} useful polls)") + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/benchmark/tooling/pg_stat_statements_snapshot.py b/benchmark/tooling/pg_stat_statements_snapshot.py new file mode 100755 index 00000000..2ade4722 --- /dev/null +++ b/benchmark/tooling/pg_stat_statements_snapshot.py @@ -0,0 +1,72 @@ +#!/usr/bin/env python3 +"""pg_stat_statements_snapshot.py — take periodic pg_stat_statements +snapshots. For each snapshot, record (ts, queryid, query_head, calls, rows, +total_exec_time_ms). Output appended as CSV — the downstream analyzer +diffs consecutive snapshots per queryid to obtain per-interval rates. + +Used as a cross-check for the NOTICE-based instrumentation, and as a +fallback for systems where the DO-block wrapper makes statement-level +rows invisible (pgque, pgq, pgmq). +""" +from __future__ import annotations +import argparse, csv, sys, time +from datetime import datetime, timezone +import psycopg2 + +DSN_DEFAULT = "host=127.0.0.1 dbname=bench user=postgres" + +SAMPLE_SQL = """ +SELECT + queryid::text, + left(regexp_replace(query, '\\s+', ' ', 'g'), 200) AS query_head, + calls, + rows, + total_exec_time::bigint AS total_exec_time_ms +FROM pg_stat_statements +WHERE query ~ ANY (ARRAY[ + 'next_batch', 'pgque\\.ticker', 'pgq\\.ticker', + 'pgmq\\.read', 'pgmq\\.delete', + 'river_job', 'que_jobs', 'pgboss\\.job' +]) +ORDER BY calls DESC +""" + + +def main() -> int: + ap = argparse.ArgumentParser() + ap.add_argument("--dsn", default=DSN_DEFAULT) + ap.add_argument("--interval", type=int, default=10) + ap.add_argument("--duration", type=int, default=5400) + ap.add_argument("--out", default="/tmp/bench/pgss_timeseries.csv") + args = ap.parse_args() + + conn = psycopg2.connect(args.dsn) + conn.autocommit = True + cur = conn.cursor() + + f = open(args.out, "w", newline="") + w = csv.writer(f) + w.writerow(["ts", "queryid", "query_head", "calls", "rows", "total_exec_time_ms"]) + f.flush() + + t_end = time.monotonic() + args.duration + try: + while time.monotonic() < t_end: + ts = datetime.now(timezone.utc).isoformat() + try: + cur.execute(SAMPLE_SQL) + rows = cur.fetchall() + for r in rows: + w.writerow([ts, *r]) + f.flush() + except Exception as e: # noqa: BLE001 + print(f"# pgss snap err: {e}", file=sys.stderr, flush=True) + time.sleep(args.interval) + finally: + f.close() + conn.close() + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/benchmark/tooling/pgq_ticker_daemon.py b/benchmark/tooling/pgq_ticker_daemon.py new file mode 100644 index 00000000..5829dae1 --- /dev/null +++ b/benchmark/tooling/pgq_ticker_daemon.py @@ -0,0 +1,42 @@ +#!/usr/bin/env python3 +"""pgq_ticker_daemon.py — tight loop calling pgq.ticker + pgq.maint_operations. +Replaces the 1-min pg_cron invocation to match pgque's inline cadence. +Run as: sudo -u postgres python3 /root/pgq_ticker_daemon.py +""" +import time, signal, sys +import psycopg2 + +DSN = "host=127.0.0.1 dbname=bench user=postgres application_name=pgq_ticker" +conn = psycopg2.connect(DSN); conn.autocommit = True +cur = conn.cursor() + +def shutdown(signum, frame): + try: conn.close() + except: pass + sys.exit(0) +signal.signal(signal.SIGTERM, shutdown) +signal.signal(signal.SIGINT, shutdown) + +print("pgq_ticker_daemon: starting loop", flush=True) +last_maint = 0 +while True: + try: + cur.execute("SELECT pgq.ticker()") + # Run maint_operations() periodically (every 5s) — each op in its own implicit tx + now = time.time() + if now - last_maint >= 5: + cur.execute("SELECT func_name, func_arg FROM pgq.maint_operations()") + for fn, arg in cur.fetchall(): + try: + if arg is None: + cur.execute(f"SELECT {fn}()") + else: + cur.execute(f"SELECT {fn}(%s)", (arg,)) + except Exception as me: + # Some operations are expected to fail (e.g. NOWAIT lock) — ignore + pass + last_maint = now + except Exception as e: + print(f"ticker err: {e}", file=sys.stderr, flush=True) + time.sleep(1) + time.sleep(1) # 1 Hz ticker diff --git a/benchmark/tooling/sys_metrics_sampler.py b/benchmark/tooling/sys_metrics_sampler.py new file mode 100644 index 00000000..4897e27b --- /dev/null +++ b/benchmark/tooling/sys_metrics_sampler.py @@ -0,0 +1,121 @@ +#!/usr/bin/env python3 +"""R7 v2 sys metrics sampler: CPU + RAM + per-device NVMe IOPS / MiB/s / latency. + +Columns: + ts_iso, + cpu_user_pct, cpu_system_pct, cpu_iowait_pct, cpu_idle_pct, + mem_total_mb, mem_used_mb, mem_buff_cache_mb, mem_available_mb, + disk_read_iops, disk_write_iops, + disk_read_mib_s, disk_write_mib_s, + disk_read_lat_ms, disk_write_lat_ms + +Reads /proc/stat, /proc/meminfo, /proc/diskstats. No external deps. +""" +import argparse, time, os +from datetime import datetime, timezone + +def read_proc_stat(): + with open('/proc/stat') as f: + fields = f.readline().split() + # fields: cpu user nice system idle iowait irq softirq steal guest guest_nice + vals = [int(x) for x in fields[1:]] + return { + 'user': vals[0] + vals[1], # user + nice + 'system': vals[2] + vals[5] + vals[6], # system + irq + softirq + 'iowait': vals[4], + 'idle': vals[3], + 'total': sum(vals), + } + +def read_meminfo(): + info = {} + with open('/proc/meminfo') as f: + for ln in f: + key, v = ln.split(':', 1) + info[key] = int(v.strip().split()[0]) # kB + total = info['MemTotal'] / 1024 + free = info['MemFree'] / 1024 + buff = info.get('Buffers', 0) / 1024 + cache = info.get('Cached', 0) / 1024 + info.get('SReclaimable', 0) / 1024 + avail = info.get('MemAvailable', free) / 1024 + used = total - avail + return total, used, buff + cache, avail + +def read_diskstat(device): + """Returns dict for named device, or None.""" + with open('/proc/diskstats') as f: + for ln in f: + parts = ln.split() + if len(parts) < 14: continue + if parts[2] != device: continue + return { + 'reads': int(parts[3]), + 'sectors_read': int(parts[5]), + 'read_ms': int(parts[6]), + 'writes': int(parts[7]), + 'sectors_writ': int(parts[9]), + 'write_ms': int(parts[10]), + } + return None + +def main(): + ap = argparse.ArgumentParser() + ap.add_argument('--interval', type=int, default=10) + ap.add_argument('--duration', type=int, default=5400) + ap.add_argument('--device', default='nvme1n1') + ap.add_argument('--out', default='/tmp/bench/sys_metrics.csv') + args = ap.parse_args() + + os.makedirs(os.path.dirname(args.out), exist_ok=True) + fh = open(args.out, 'w', buffering=1) # line-buffered + fh.write('ts_iso,cpu_user_pct,cpu_system_pct,cpu_iowait_pct,cpu_idle_pct,' + 'mem_total_mb,mem_used_mb,mem_buff_cache_mb,mem_available_mb,' + 'disk_read_iops,disk_write_iops,disk_read_mib_s,disk_write_mib_s,' + 'disk_read_lat_ms,disk_write_lat_ms\n') + + prev_cpu = read_proc_stat() + prev_disk = read_diskstat(args.device) or {} + t_end = time.monotonic() + args.duration + + while time.monotonic() < t_end: + time.sleep(args.interval) + ts = datetime.now(timezone.utc).isoformat() + + # CPU delta + cur_cpu = read_proc_stat() + dt_total = cur_cpu['total'] - prev_cpu['total'] or 1 + cpu_user = 100.0 * (cur_cpu['user'] - prev_cpu['user']) / dt_total + cpu_sys = 100.0 * (cur_cpu['system'] - prev_cpu['system']) / dt_total + cpu_iow = 100.0 * (cur_cpu['iowait'] - prev_cpu['iowait']) / dt_total + cpu_idle = 100.0 * (cur_cpu['idle'] - prev_cpu['idle']) / dt_total + prev_cpu = cur_cpu + + mem_total, mem_used, mem_bc, mem_avail = read_meminfo() + + # Disk delta + cur_disk = read_diskstat(args.device) + if cur_disk and prev_disk: + dr = cur_disk['reads'] - prev_disk.get('reads', 0) + dw = cur_disk['writes'] - prev_disk.get('writes', 0) + d_sr = cur_disk['sectors_read'] - prev_disk.get('sectors_read', 0) + d_sw = cur_disk['sectors_writ'] - prev_disk.get('sectors_writ', 0) + d_rms = cur_disk['read_ms'] - prev_disk.get('read_ms', 0) + d_wms = cur_disk['write_ms'] - prev_disk.get('write_ms', 0) + iops_r = dr / args.interval + iops_w = dw / args.interval + mib_r = (d_sr * 512) / (args.interval * (1 << 20)) + mib_w = (d_sw * 512) / (args.interval * (1 << 20)) + lat_r = d_rms / dr if dr > 0 else 0 + lat_w = d_wms / dw if dw > 0 else 0 + else: + iops_r = iops_w = mib_r = mib_w = lat_r = lat_w = 0 + prev_disk = cur_disk or {} + + fh.write(f'{ts},{cpu_user:.2f},{cpu_sys:.2f},{cpu_iow:.2f},{cpu_idle:.2f},' + f'{mem_total:.1f},{mem_used:.1f},{mem_bc:.1f},{mem_avail:.1f},' + f'{iops_r:.1f},{iops_w:.1f},{mib_r:.3f},{mib_w:.3f},' + f'{lat_r:.3f},{lat_w:.3f}\n') + fh.close() + +if __name__ == '__main__': + main() From 18a11b71d6170ccea140423eec04c5fe68fd92a8 Mon Sep 17 00:00:00 2001 From: Nik Samokhvalov Date: Wed, 29 Apr 2026 20:18:18 -0700 Subject: [PATCH 2/6] chore(benchmark): shell-style polish per CLAUDE.md - benchmark/runners/fix_nvme_mount.sh: switch to /usr/bin/env bash shebang; use set -Eeuo pipefail (was set -euo pipefail without -E) - benchmark/runners/run_r7.sh: add -Ee flags to existing pipefail - benchmark/runners/clean_reinstall.sh: read -r in two while-loops Co-Authored-By: Claude Opus 4.7 (1M context) --- benchmark/runners/clean_reinstall.sh | 4 ++-- benchmark/runners/fix_nvme_mount.sh | 4 ++-- benchmark/runners/run_r7.sh | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/benchmark/runners/clean_reinstall.sh b/benchmark/runners/clean_reinstall.sh index 30e54f53..e6d94213 100644 --- a/benchmark/runners/clean_reinstall.sh +++ b/benchmark/runners/clean_reinstall.sh @@ -28,13 +28,13 @@ case $SYS in pgque) sudo -u postgres psql -d bench -c "DROP SCHEMA IF EXISTS pgque CASCADE" >/dev/null # Also unschedule any leftover pg_cron jobs named pgque_* - sudo -u postgres psql -d bench -Atc "SELECT jobname FROM cron.job WHERE jobname LIKE 'pgque%'" | while read j; do + sudo -u postgres psql -d bench -Atc "SELECT jobname FROM cron.job WHERE jobname LIKE 'pgque%'" | while read -r j; do [ -n "$j" ] && sudo -u postgres psql -d bench -c "SELECT cron.unschedule('$j')" >/dev/null done ;; pgq) sudo -u postgres psql -d bench -c "DROP EXTENSION IF EXISTS pgq CASCADE; DROP SCHEMA IF EXISTS pgq CASCADE" >/dev/null - sudo -u postgres psql -d bench -Atc "SELECT jobname FROM cron.job WHERE jobname LIKE 'pgq%'" | while read j; do + sudo -u postgres psql -d bench -Atc "SELECT jobname FROM cron.job WHERE jobname LIKE 'pgq%'" | while read -r j; do [ -n "$j" ] && sudo -u postgres psql -d bench -c "SELECT cron.unschedule('$j')" >/dev/null done ;; diff --git a/benchmark/runners/fix_nvme_mount.sh b/benchmark/runners/fix_nvme_mount.sh index 74bebc19..fb8a747d 100644 --- a/benchmark/runners/fix_nvme_mount.sh +++ b/benchmark/runners/fix_nvme_mount.sh @@ -1,6 +1,6 @@ -#!/bin/bash +#!/usr/bin/env bash # Format + mount NVMe instance store, move PG data dir to NVMe, re-setup symlink. -set -euo pipefail +set -Eeuo pipefail DEV=/dev/nvme1n1 MNT=/mnt/pgdata PG_DATA_SRC=/var/lib/postgresql/18/main diff --git a/benchmark/runners/run_r7.sh b/benchmark/runners/run_r7.sh index 44eb9386..0ecf4b59 100644 --- a/benchmark/runners/run_r7.sh +++ b/benchmark/runners/run_r7.sh @@ -6,7 +6,7 @@ # * pg_ash ASH + pgfr snapshots copied to CSV at end (schemas preserved) # # Same instrumented consumer NOTICE format as R6 for honest ev/s parsing. -set -uo pipefail +set -Eeuo pipefail SYS=${1:?system} DUR=${2:-5400} # 1.5h = 5400s From 916a4f42d50b95e94de6f82115e1c99fd2d78d6f Mon Sep 17 00:00:00 2001 From: Nik Samokhvalov Date: Thu, 30 Apr 2026 02:26:17 -0700 Subject: [PATCH 3/6] chore(benchmark): anti-leak scrub + polish pass MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Remove all references to private internal URLs, round numbering (R4/R5/R6/R8), and private repository paths from benchmark/ files. - METHODOLOGY.md: drop internal URL + note IDs from header; remove internal posting-style section (§9); neutralize round refs; fix /tmp/bench_r path reference - README.md: drop internal reference comment - HARDWARE.md: fix binary units (GB→GiB, TB→TiB); drop R7 round ref - OPS_GOTCHAS.md: neutralize R4/R6 round refs in lessons; fix binary units (GB→GiB, MB/s→MiB/s) - consumers/*.sql: drop "R6 instrumented" prefix from all 7 files - runners/run_r7.sh: remove R6/R7 round refs from inline comments - tooling/sys_metrics_sampler.py: remove R7 from docstring - tooling/parse_events_consumed.py: remove R6 from docstring + msg - charts/r5_analyze.py, r6_smoke_chart.py: remove Rn from docstrings, chart titles, and file-size output (KB→KiB) PR description updated separately via gh pr edit. Co-Authored-By: Claude Sonnet 4.6 --- benchmark/HARDWARE.md | 10 ++--- benchmark/METHODOLOGY.md | 44 +++++++------------ benchmark/OPS_GOTCHAS.md | 14 +++--- benchmark/README.md | 6 +-- benchmark/charts/r5_analyze.py | 6 +-- benchmark/charts/r6_smoke_chart.py | 6 +-- benchmark/consumers/consumer_pgboss.sql | 2 +- .../consumers/consumer_pgmq-partitioned.sql | 2 +- benchmark/consumers/consumer_pgmq.sql | 2 +- benchmark/consumers/consumer_pgq.sql | 2 +- benchmark/consumers/consumer_pgque.sql | 2 +- benchmark/consumers/consumer_que.sql | 4 +- benchmark/consumers/consumer_river.sql | 2 +- benchmark/runners/run_r7.sh | 22 +++++----- benchmark/tooling/parse_events_consumed.py | 4 +- benchmark/tooling/sys_metrics_sampler.py | 2 +- 16 files changed, 57 insertions(+), 73 deletions(-) diff --git a/benchmark/HARDWARE.md b/benchmark/HARDWARE.md index e171b937..d525dd9f 100644 --- a/benchmark/HARDWARE.md +++ b/benchmark/HARDWARE.md @@ -5,10 +5,10 @@ | Spec | Value | |---|---| | vCPU | 8 (Intel Ice Lake Xeon 8375C, 2.9 GHz base / 3.5 GHz turbo) | -| RAM | 64 GB | -| NVMe instance store | 1 × 1.75 TB (physical attach, NVMe) | +| RAM | 64 GiB | +| NVMe instance store | 1 × 1.75 TiB (physical attach, NVMe) | | Network | Up to 12 Gbps | -| EBS (root) | 8 GB gp3 (Ubuntu 24.04 AMI default) | +| EBS (root) | 8 GiB gp3 (Ubuntu 24.04 AMI default) | | Spot price (us-east-2, 2026-04) | ~$0.20–0.30 / hour | | On-demand price | $0.686 / hour | @@ -27,7 +27,7 @@ Run via [tooling/microbench.sh](tooling/microbench.sh). Expected order-of-magnit | fio 4 k randwrite, 99 p latency | ~100 µs | | fio 4 k randread, QD=32, on NVMe | ~400 k IOPS | -*Actual R7 microbench numbers to be filled in from the R7 microbench pass.* +*Actual numbers to be filled in after running [tooling/microbench.sh](tooling/microbench.sh) on a fresh VM.* ## Postgres tuning (shared across all 7 VMs) @@ -66,6 +66,6 @@ listen_addresses = 'localhost' `synchronous_commit=off` is deliberate — queue workloads are almost always idempotent at the application layer, and the WAL-flush path is the dominant cost for low-latency producers. It's the only PG knob we touch that materially changes safety posture. -`jit=off` because 5-s JIT warmups on `DO` blocks dominated our first-transaction latency in R4. +`jit=off` because 5-s JIT warmups on `DO` blocks dominated first-transaction latency in early bench runs. `autovacuum_*_scale_factor=0.01` is aggressive on purpose — we want autovacuum attempting to clean every 1 % dead-tuple ratio so the held-xmin phase of the bench exposes the *inability* to vacuum, not a lazy autovacuum schedule. diff --git a/benchmark/METHODOLOGY.md b/benchmark/METHODOLOGY.md index a61898a4..79bbb6fc 100644 --- a/benchmark/METHODOLOGY.md +++ b/benchmark/METHODOLOGY.md @@ -1,13 +1,10 @@ # Bench methodology — the definitive reference -Adapted from GitLab issue [postgres-ai/postgresql-consulting/tests-and-benchmarks#77, note 3263767264](https://gitlab.com/postgres-ai/postgresql-consulting/tests-and-benchmarks/-/issues/77#note_3263767264). - This document is the single source of truth for how the PostgreSQL-queue bench is structured, what it measures, and every script/config that makes it work. It is written so a reviewer can reproduce the whole thing end-to-end. Cross-links: - Upstream fix: [pgque PR #62](https://github.com/NikolayS/pgque/pull/62) · issue [NikolayS/pgque#61](https://github.com/NikolayS/pgque/issues/61) -- GitLab rounds: R4 final narrative — note 3262968456; R5 root — note 3263000815; R5 full writeup — note 3263438287; pgque rotation PoC under held xmin — note 3263417609. --- @@ -31,7 +28,7 @@ The Go/Ruby/Node workers are installed end-to-end so the schema is authentic, bu Workload shape (all runs): -- Producer: 1 client, pgbench `-R 1000` rate-cap (may move to `-R 5000` in R7/R8) +- Producer: 1 client, pgbench `-R 1000` rate-cap (full runs use `-R 5000`) - Consumer: 1 client for pgque/pgq, 4 clients for everything else - Three 30-minute phases back-to-back = **1.5 h per run**: 1. **Clean baseline** — no held xmin @@ -42,20 +39,20 @@ Workload shape (all runs): ## 2. Infrastructure -- AWS us-east-2, **i4i.2xlarge** (8 vCPU, 64 GB, NVMe instance store) +- AWS us-east-2, **i4i.2xlarge** (8 vCPU, 64 GiB, NVMe instance store) - Spot where available; on-demand only as last resort (see Section 10) - Ubuntu 24.04, **PG18 from PGDG**, `pg_cron`, `pg_stat_statements`, pg_ash, pgfr - Data dir moved to NVMe: `/mnt/pgdata/postgresql/18/main` symlink (see [runners/fix_nvme_mount.sh](runners/fix_nvme_mount.sh) and [OPS_GOTCHAS.md §1](OPS_GOTCHAS.md)) - One VM per system so any tuning / runtime / GC behaviour is contained - SSH key: `` (us-east-2) -Live R7 VMs (redacted): ``, ``, ``, ``, ``, ``, ``. +VM IPs use placeholder form: ``, ``, ``, ``, ``, ``, ``. --- ## 3. Observability stack -Seven parallel streams run during every bench. All CSVs land in `/tmp/bench/` and are rsynced to the local `/tmp/bench_r//` tree every 30 min. +Seven parallel streams run during every bench. All CSVs land in `/tmp/bench/` and are rsynced to the local `/tmp/bench//` tree every 30 min. ### (a) bloat_sampler.py — pg_stat_user_tables every 30 s @@ -85,7 +82,7 @@ COPY (SELECT * FROM pgfr_record.statement_snapshots) TO '/tmp/bench/pgfr_stateme Full pgfr_record schema on every VM includes: `snapshots`/`snapshots_v2` (partitioned), `table_snapshots(_v2)`, `statement_snapshots(_v2)`, `index_snapshots(_v2)`, `replication_snapshots(_v2)`, `vacuum_progress_snapshots(_v2)`, `activity_samples(_archive_v2)`, `lock_samples(_archive_v2)`, `config_snapshots`, `db_role_config_snapshots`. -### (d) sys_metrics_sampler.py — CPU / mem / disk every 10 s (R7+) +### (d) sys_metrics_sampler.py — CPU / mem / disk every 10 s Reads `/proc/stat`, `/proc/meminfo`, `/proc/diskstats` directly (psutil-optional). NVMe device is `nvme1n1` (the instance store, which is what `/mnt/pgdata` sits on). v2 adds per-device IOPS and latency columns. @@ -101,7 +98,7 @@ Source: [tooling/pg_stat_statements_snapshot.py](tooling/pg_stat_statements_snap Both producer and consumer run with per-10 s aggregate logs (min/max/sum/sumsq latency). Files: `producer_agg.` and `consumer_agg..` under `/tmp/bench/`. -### (g) Consumer NOTICE instrumentation (R6+) +### (g) Consumer NOTICE instrumentation Each instrumented `consumer.sql` is wrapped in a DO block that emits exactly one `RAISE NOTICE 'ev ts= n='` per call. This gives us an authoritative per-call consumed-events stream that is immune to the pgss DO-wrapper opacity problem. @@ -165,28 +162,18 @@ All producers are in [producers/](producers/). ## 8. Analysis and chart generation -- [charts/r5_analyze.py](charts/r5_analyze.py) — R5 full 2-panel chart (dead tuples + consumer latency, linear scale, no symlog — per [MEMORY rule](../CLAUDE.md) "Never use log/symlog on charts"). -- [charts/r6_smoke_chart.py](charts/r6_smoke_chart.py) — R6 smoke Solarized-Dark 2-panel chart (events/s + pgque per-table dead tuples). -- [gifs/r4_gif_v17_solarized.py](gifs/r4_gif_v17_solarized.py) — R4 dead-tuples animated GIF (7 systems, Solarized-Dark). -- [gifs/r4_gif_tps_solarized.py](gifs/r4_gif_tps_solarized.py) — R4 TPS/latency animated GIF. - ---- - -## 9. GitLab posting style - -- Threads open under `### header`; replies stay inside the same discussion. -- Every `
` block has **one blank line** between the opening tag and the triple-backtick fence, and **one blank line** between the closing fence and `
` — and another blank line after `
`. This is what makes them render correctly on GitLab. -- Every significant result embeds `![alt](/uploads/…)` — charts are first-class citizens. GIFs too. -- Post bodies via `curl --data-urlencode "body@/tmp/path"` to the discussions endpoint — new thread, not a reply under an existing one. +- [charts/r5_analyze.py](charts/r5_analyze.py) — full 2-panel chart (dead tuples + consumer latency, linear scale, no symlog). +- [charts/r6_smoke_chart.py](charts/r6_smoke_chart.py) — smoke Solarized-Dark 2-panel chart (events/s + pgque per-table dead tuples). +- [gifs/r4_gif_v17_solarized.py](gifs/r4_gif_v17_solarized.py) — dead-tuples animated GIF (7 systems, Solarized-Dark). +- [gifs/r4_gif_tps_solarized.py](gifs/r4_gif_tps_solarized.py) — TPS/latency animated GIF. --- -## 10. Cost discipline +## 9. Cost discipline - us-east-2 spot preferred (~$0.22/h); when spot is exhausted we hop to another region before going on-demand. -- At R7 time 2 VMs are on-demand (pgmq-partitioned and pgboss; pgque+que moved on-demand after R6 spot reclaim). -- **~$15 per 1.5 h bench round** (7 VMs). -- Total spend through R7 is ~$100. +- On-demand for the primary subject under test when data integrity matters most. +- **~$15 per 1.5 h bench run** (7 VMs). - Rsync every 30 min pulls `/tmp/bench/` to local, so a spot reclaim loses at most a partial phase. See [OPS_GOTCHAS.md §12](OPS_GOTCHAS.md) for spot-reclaim mitigations. @@ -199,8 +186,7 @@ See [OPS_GOTCHAS.md §12](OPS_GOTCHAS.md) for spot-reclaim mitigations. 2. `scp` per-system `install.sh`, `producer.sql`, instrumented `consumer.sql`, plus the samplers in [tooling/](tooling/). 3. `bash clean_reinstall.sh ` on each VM. 4. `bash run_r7.sh ` — 1.5 h, writes `/tmp/bench/*.csv` + `*.log`. -5. Rsync everything back to `/tmp/bench_r7//`. -6. Run [charts/r5_analyze.py](charts/r5_analyze.py) (or R7 successor) for the verdict table + 2-panel PNG. -7. Post results as a threaded reply with the `
`/blank-line style above. +5. Rsync everything back to `/tmp/bench//`. +6. Run [charts/r5_analyze.py](charts/r5_analyze.py) for the verdict table + 2-panel PNG. That is the whole rig. diff --git a/benchmark/OPS_GOTCHAS.md b/benchmark/OPS_GOTCHAS.md index 6c78d190..d2391822 100644 --- a/benchmark/OPS_GOTCHAS.md +++ b/benchmark/OPS_GOTCHAS.md @@ -6,7 +6,7 @@ Every operational lesson accumulated while running the pgque-vs-pgq-vs-pgmq-vs-r ## 1. AWS i4i.2xlarge: NVMe instance store is NOT auto-mounted -The Ubuntu 24.04 AMI boots with `/dev/nvme1n1` **present but not formatted, not mounted**. You MUST format and mount it BEFORE installing Postgres data dir, or the data goes to the 8 GB root EBS volume and the bench dies the moment you fill the page cache. +The Ubuntu 24.04 AMI boots with `/dev/nvme1n1` **present but not formatted, not mounted**. You MUST format and mount it BEFORE installing Postgres data dir, or the data goes to the 8 GiB root EBS volume and the bench dies the moment you fill the page cache. ```bash sudo mkfs.xfs /dev/nvme1n1 @@ -15,7 +15,7 @@ sudo mount -o noatime,nodiratime /dev/nvme1n1 /mnt/pgdata sudo chown postgres:postgres /mnt/pgdata ``` -If PG is installed first (default path `/var/lib/postgresql/18/main`), data goes to the 8 GB root EBS → disaster under bench load. +If PG is installed first (default path `/var/lib/postgresql/18/main`), data goes to the 8 GiB root EBS → disaster under bench load. **Symlink pattern** after moving data: @@ -34,7 +34,7 @@ See [runners/fix_nvme_mount.sh](runners/fix_nvme_mount.sh) for the recovery proc Default `/tmp` is on the root disk. pgbench `--log-prefix=/tmp/bench/producer_agg` + `consumer.log` with 17 k NOTICE/s + `bloat.csv` + `sys_metrics.csv` all write there. -At bench rates this is 1–5 MB/s of root-disk I/O. Not catastrophic but adds noise you can see in the sys_metrics disk panel of the non-subject VM. +At bench rates this is 1–5 MiB/s of root-disk I/O. Not catastrophic but adds noise you can see in the sys_metrics disk panel of the non-subject VM. **Fix:** @@ -57,7 +57,7 @@ If investigating slow queries later, enable `log_statement='all'` + `logging_col ## 4. Clean-slate reset without losing adjacent schemas -`DROP SCHEMA pgque CASCADE` can cascade into objects it doesn't own if a function somewhere references pgque. R6 lost the `ash` schema this way on the pgque VM. +`DROP SCHEMA pgque CASCADE` can cascade into objects it doesn't own if a function somewhere references pgque. This happened during testing — the `ash` schema was silently dropped on the pgque VM. **Safer pattern for pgque:** `TRUNCATE` all `pgque.*` tables instead of drop-and-recreate. If a reinstall is actually needed: @@ -105,7 +105,7 @@ pg_partman is installed in schema `public` (not `partman` as the docs sometimes Default `premake=4` works for bench (4 future 5-min partitions = 20 min buffer, enough for 1.5 h bench given 1-min maintenance cron cadence). -R6 tried `premake=20` = 24 partitions steady-state, which killed pgmq-partitioned consumer perf (525 TPS vs 6621 with premake=4) due to planner cost across all partitions per query. +`premake=20` (24 partitions steady-state) killed pgmq-partitioned consumer perf (525 TPS vs 6621 with premake=4) due to planner cost across all partitions per query. `infinite_time_partitions=true` is needed for the maintenance job to keep creating future partitions indefinitely: @@ -190,8 +190,8 @@ If you see this on a VM that wasn't freshly installed from the PR #62 branch, re i4i.2xlarge spot price is ~$0.20–0.30/h vs on-demand $0.686/h. The us-east-2 spot market is reliable for short bench runs but DOES reclaim. -- R4 lost `pgmq-partitioned` spot mid-run. -- R6 lost `pgque` AND `que` spots between runs (hours-long windows). +- `pgmq-partitioned` spot was reclaimed mid-run during one bench session. +- `pgque` AND `que` spots were reclaimed between runs (hours-long windows) in another. **Mitigation:** on-demand for the primary subject under test (pgque for PR #62 work). Cost delta: ~$3–4 per 7-VM 1.5 h run. Worth it for the subject you're landing a PR on; leave the comparison set on spot. diff --git a/benchmark/README.md b/benchmark/README.md index 0ef65520..beaa7ce9 100644 --- a/benchmark/README.md +++ b/benchmark/README.md @@ -44,7 +44,7 @@ ASH and pgfr output (`ash.csv`, `pgfr_snapshots.csv`) land in `/tmp/bench/` post ``` benchmark/ README.md # this file - METHODOLOGY.md # full methodology (from GitLab #77) + METHODOLOGY.md # full methodology OPS_GOTCHAS.md # operational lessons HARDWARE.md # VM + PG tuning tooling/ @@ -85,8 +85,8 @@ benchmark/ install_pgboss.sh # note: pgque + que driven from AMI userdata charts/ - r5_analyze.py # R5 2-panel chart - r6_smoke_chart.py # R6 smoke Solarized-Dark chart + r5_analyze.py # 2-panel chart (dead tuples + consumer latency) + r6_smoke_chart.py # smoke Solarized-Dark chart (events/s + dead tuples) gifs/ r4_gif_v17_solarized.py # dead-tuples animated GIF r4_gif_tps_solarized.py # TPS/latency animated GIF diff --git a/benchmark/charts/r5_analyze.py b/benchmark/charts/r5_analyze.py index 7a5b76aa..212dd2c4 100644 --- a/benchmark/charts/r5_analyze.py +++ b/benchmark/charts/r5_analyze.py @@ -1,5 +1,5 @@ #!/usr/bin/env python3 -"""R5 full — verdict table + 2-panel chart (dead tuples + consumer latency).""" +"""r5_analyze — verdict table + 2-panel chart (dead tuples + consumer latency).""" import csv, re from pathlib import Path from datetime import datetime @@ -128,10 +128,10 @@ def fmt_lat(v,_): ax2.yaxis.set_major_formatter(FuncFormatter(fmt_lat)) ax2.set_ylabel("consumer latency (mean)", labelpad=8) -fig.text(0.08, 0.97, "R5 full — 7 systems · 1.5 h · held xmin horizon for 30 min", +fig.text(0.08, 0.97, "pgque bench — 7 systems · 1.5 h · held xmin horizon for 30 min", ha="left", fontsize=13, fontweight="bold") fig.text(0.08, 0.948, "producer -R 1000, pgbench --aggregate-interval=10 --log · consumer latency clipped at 2s", ha="left", fontsize=9, color="#666") fig.savefig("/tmp/r5_full_chart.png", dpi=110, bbox_inches="tight") -print(f"\nwrote /tmp/r5_full_chart.png ({Path('/tmp/r5_full_chart.png').stat().st_size/1024:.0f} KB)") +print(f"\nwrote /tmp/r5_full_chart.png ({Path('/tmp/r5_full_chart.png').stat().st_size/1024:.0f} KiB)") diff --git a/benchmark/charts/r6_smoke_chart.py b/benchmark/charts/r6_smoke_chart.py index e455cc8a..12d40f07 100644 --- a/benchmark/charts/r6_smoke_chart.py +++ b/benchmark/charts/r6_smoke_chart.py @@ -1,5 +1,5 @@ #!/usr/bin/env python3 -"""R6 combined smoke chart — events consumed/s + pgque table dead tuples.""" +"""r6_smoke_chart — events consumed/s + pgque table dead tuples.""" import csv, re from pathlib import Path from datetime import datetime @@ -114,10 +114,10 @@ ax2.set_ylim(0, max(50, max([max(s[2]) for s in tbl_series.values() if s[2]], default=10)*1.15)) # Titles -fig.text(0.09, 0.965, "R6 combined smoke — pgque PR #62 rotation + events-consumed instrumentation", +fig.text(0.09, 0.965, "pgque smoke — PR #62 rotation + events-consumed instrumentation", ha="left", fontsize=12, fontweight="bold", color=FG_EMPH) fig.text(0.09, 0.94, "20 min (5m clean + 10m held xmin + 5m recovery). Event-rate: producer was rate-capped at 1000 ev/s.", ha="left", fontsize=9, color=FG_DIM) fig.savefig(OUT, dpi=110, bbox_inches="tight", facecolor=BG) -print(f"wrote {OUT} ({OUT.stat().st_size/1024:.0f} KB)") +print(f"wrote {OUT} ({OUT.stat().st_size/1024:.0f} KiB)") diff --git a/benchmark/consumers/consumer_pgboss.sql b/benchmark/consumers/consumer_pgboss.sql index 514c3b0a..479a961d 100644 --- a/benchmark/consumers/consumer_pgboss.sql +++ b/benchmark/consumers/consumer_pgboss.sql @@ -1,4 +1,4 @@ --- R6 instrumented consumer for pg-boss (SKIP LOCKED + DELETE USING on partitioned job table). +-- Instrumented consumer for pg-boss (SKIP LOCKED + DELETE USING on partitioned job table). -- pg-boss 'default' queue, state='created'. DO $$ DECLARE diff --git a/benchmark/consumers/consumer_pgmq-partitioned.sql b/benchmark/consumers/consumer_pgmq-partitioned.sql index 962fcd7b..20b48cd8 100644 --- a/benchmark/consumers/consumer_pgmq-partitioned.sql +++ b/benchmark/consumers/consumer_pgmq-partitioned.sql @@ -1,4 +1,4 @@ --- R6 instrumented consumer for pgmq-partitioned (identical SQL to pgmq; +-- Instrumented consumer for pgmq-partitioned (identical SQL to pgmq; -- the partitioning is on the schema side only). DO $$ DECLARE diff --git a/benchmark/consumers/consumer_pgmq.sql b/benchmark/consumers/consumer_pgmq.sql index 51c730f0..52e842d5 100644 --- a/benchmark/consumers/consumer_pgmq.sql +++ b/benchmark/consumers/consumer_pgmq.sql @@ -1,4 +1,4 @@ --- R6 instrumented consumer for pgmq (read-then-delete). +-- Instrumented consumer for pgmq (read-then-delete). -- pgmq.delete is a function body so pg_stat_statements can't see row counts; -- NOTICE is the authoritative source. DO $$ diff --git a/benchmark/consumers/consumer_pgq.sql b/benchmark/consumers/consumer_pgq.sql index df251f8c..536c28f7 100644 --- a/benchmark/consumers/consumer_pgq.sql +++ b/benchmark/consumers/consumer_pgq.sql @@ -1,4 +1,4 @@ --- R6 instrumented consumer for pgq (identical to pgque; schema swap only). +-- Instrumented consumer for pgq (identical to pgque; schema swap only). DO $$ DECLARE b bigint; diff --git a/benchmark/consumers/consumer_pgque.sql b/benchmark/consumers/consumer_pgque.sql index 4f6a4cd6..989b4f62 100644 --- a/benchmark/consumers/consumer_pgque.sql +++ b/benchmark/consumers/consumer_pgque.sql @@ -1,4 +1,4 @@ --- R6 instrumented consumer for pgque (batch ticker model). +-- Instrumented consumer for pgque (batch ticker model). -- Preserves original behavior; adds one NOTICE per call for honest-events/s. -- Format: NOTICE: ev ts= n= DO $$ diff --git a/benchmark/consumers/consumer_que.sql b/benchmark/consumers/consumer_que.sql index 14b188f8..3498ab5b 100644 --- a/benchmark/consumers/consumer_que.sql +++ b/benchmark/consumers/consumer_que.sql @@ -1,5 +1,5 @@ --- R6 instrumented consumer for que (Ruby Que gem; SKIP LOCKED + DELETE). --- Mirror R5 shape (que_jobs with default queue and priority ordering). +-- Instrumented consumer for que (Ruby Que gem; SKIP LOCKED + DELETE). +-- que_jobs with default queue and priority ordering. DO $$ DECLARE n int := 0; diff --git a/benchmark/consumers/consumer_river.sql b/benchmark/consumers/consumer_river.sql index d9be6a3c..727c9463 100644 --- a/benchmark/consumers/consumer_river.sql +++ b/benchmark/consumers/consumer_river.sql @@ -1,4 +1,4 @@ --- R6 instrumented consumer for river (SKIP LOCKED + DELETE USING). +-- Instrumented consumer for river (SKIP LOCKED + DELETE USING). -- Wrapped in DO block so NOTICE carries the count. -- Raw DELETE rowcount is ALSO visible in pg_stat_statements via the -- wrapped statement (total row count over the run — cross-check). diff --git a/benchmark/runners/run_r7.sh b/benchmark/runners/run_r7.sh index 0ecf4b59..9f15582d 100644 --- a/benchmark/runners/run_r7.sh +++ b/benchmark/runners/run_r7.sh @@ -1,18 +1,16 @@ #!/usr/bin/env bash # run_r7.sh [duration_s] -# R7: 1.5h bench (30m clean + 30m idle-in-tx + 30m recovery) -# Adds over R6: -# * sys_metrics_sampler.py running alongside bloat_sampler.py -# * pg_ash ASH + pgfr snapshots copied to CSV at end (schemas preserved) -# -# Same instrumented consumer NOTICE format as R6 for honest ev/s parsing. +# run_r7.sh: 1.5h bench (30m clean + 30m idle-in-tx + 30m recovery) +# sys_metrics_sampler.py runs alongside bloat_sampler.py. +# pg_ash ASH + pgfr snapshots copied to CSV at end (schemas preserved). +# Consumer NOTICE format: NOTICE: ev ts= n= set -Eeuo pipefail SYS=${1:?system} DUR=${2:-5400} # 1.5h = 5400s R7_DIR=${R7_DIR:-/tmp/r7} R6_DIR=${R6_DIR:-/tmp/r6} -# Consumer SQL may be in either /tmp/consumer.sql (pushed), /tmp/r7/consumer_.sql, or /tmp/r6/consumer_.sql +# Consumer SQL may be in either /tmp/consumer.sql (pushed), $R7_DIR/consumer_.sql, or $R6_DIR/consumer_.sql for c in "$R7_DIR/consumer_${SYS}.sql" "/tmp/consumer.sql" "$R6_DIR/consumer_${SYS}.sql"; do if [[ -f "$c" ]]; then CONSUMER_SQL="$c"; break; fi done @@ -46,11 +44,11 @@ sudo -u postgres psql -d bench -c "SELECT pg_stat_statements_reset()" >/dev/null python3 /tmp/bloat_sampler.py --system "$SYS" --interval 30 --duration "$DUR" > /tmp/bench/bloat.csv & BLOAT_PID=$! -# NEW in R7: sys_metrics_sampler.py — CPU/mem/disk every 10s, NVMe instance-store device +# sys_metrics_sampler.py — CPU/mem/disk every 10s, NVMe instance-store device python3 "$R7_DIR/sys_metrics_sampler.py" --interval 10 --duration "$DUR" --device nvme1n1 --out /tmp/bench/sys_metrics.csv > /tmp/bench/sys_metrics.log 2>&1 & SYSM_PID=$! -# pgss snapshot cross-check (carried over from R6) +# pgss snapshot cross-check python3 "$R6_DIR/pg_stat_statements_snapshot.py" \ --dsn "host=127.0.0.1 dbname=bench user=postgres" \ --interval 10 --duration "$DUR" \ @@ -58,14 +56,14 @@ python3 "$R6_DIR/pg_stat_statements_snapshot.py" \ > /tmp/bench/pgss_snapshotter.log 2>&1 & PGSS_PID=$! -# Producer — R7 full: -R 5000 (was 1000 in smoke/R6) +# Producer — full run: -R 5000 pgbench -h 127.0.0.1 -U postgres -d bench -n -f /tmp/producer.sql \ -c 1 -j 1 -R 5000 -T "$DUR" -P 30 \ --aggregate-interval=10 --log --log-prefix=/tmp/bench/producer_agg \ > /tmp/bench/producer.log 2>&1 & PROD_PID=$! -# Consumer (R6 instrumented SQL) +# Consumer (NOTICE-instrumented SQL) pgbench -h 127.0.0.1 -U postgres -d bench -n -f "$CONSUMER_SQL" \ -c $CONS_C -j $CONS_C -T "$DUR" -P 30 \ --aggregate-interval=10 --log --log-prefix=/tmp/bench/consumer_agg \ @@ -112,4 +110,4 @@ python3 "$R6_DIR/parse_events_consumed.py" \ --system "$SYS" \ > /tmp/bench/events_consumed_parse.log 2>&1 || true -echo "=== R7 done: $SYS ===" +echo "=== bench done: $SYS ===" diff --git a/benchmark/tooling/parse_events_consumed.py b/benchmark/tooling/parse_events_consumed.py index 8afa831e..da8b6b51 100755 --- a/benchmark/tooling/parse_events_consumed.py +++ b/benchmark/tooling/parse_events_consumed.py @@ -1,5 +1,5 @@ #!/usr/bin/env python3 -"""parse_events_consumed.py — turn R6 NOTICE-instrumented consumer.log into +"""parse_events_consumed.py — turn NOTICE-instrumented consumer.log into an honest events-delivered-per-second time series. Inputs (by convention, --bench-dir): @@ -109,7 +109,7 @@ def main() -> int: f"method=NOTICE\n" f"notice_lines_found=0\n" "WARNING: no NOTICE events_consumed data — verify consumer.sql\n" - " is the R6 instrumented version. Fallback: use\n" + " uses NOTICE instrumentation. Fallback: use\n" " pg_stat_statements snapshot (pgss_timeseries.csv)\n" " or compute via producer_rate - queue_bloat_delta.\n" ) diff --git a/benchmark/tooling/sys_metrics_sampler.py b/benchmark/tooling/sys_metrics_sampler.py index 4897e27b..e2f562e7 100644 --- a/benchmark/tooling/sys_metrics_sampler.py +++ b/benchmark/tooling/sys_metrics_sampler.py @@ -1,5 +1,5 @@ #!/usr/bin/env python3 -"""R7 v2 sys metrics sampler: CPU + RAM + per-device NVMe IOPS / MiB/s / latency. +"""sys_metrics_sampler.py v2: CPU + RAM + per-device NVMe IOPS / MiB/s / latency. Columns: ts_iso, From d586e3510cafc08a17611e9ca9d761257453b2c3 Mon Sep 17 00:00:00 2001 From: Nik Samokhvalov Date: Thu, 30 Apr 2026 02:35:48 -0700 Subject: [PATCH 4/6] chore(benchmark): address REV r1 findings (anti-leak + style) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Remove private WI refs from bootstrap.sh:41,44 (replace with "see methodology notes") - Fix set -Eeuo pipefail in 7 shell scripts that only had partial flags - Fix broken OPS_GOTCHAS.md:185 link (install_pgque.sh → install/README.md) - Fix binary unit in install_pgboss.sh:2 (GB → GiB) - Fix run_r7.sh tool paths: resolve from benchmark/tooling/ by default instead of hardcoded /tmp/r7 and /tmp/r6; override via R7_DIR/R6_DIR Co-Authored-By: Claude Sonnet 4.6 --- benchmark/OPS_GOTCHAS.md | 2 +- benchmark/install/bootstrap.sh | 6 +++--- benchmark/install/install_pgboss.sh | 4 ++-- benchmark/install/install_pgmq.sh | 2 +- benchmark/install/install_pgq.sh | 2 +- benchmark/install/install_river.sh | 2 +- benchmark/runners/clean_reinstall.sh | 2 +- benchmark/runners/run_r7.sh | 9 +++++++-- benchmark/tooling/microbench.sh | 2 +- 9 files changed, 18 insertions(+), 13 deletions(-) diff --git a/benchmark/OPS_GOTCHAS.md b/benchmark/OPS_GOTCHAS.md index d2391822..1dcc40b6 100644 --- a/benchmark/OPS_GOTCHAS.md +++ b/benchmark/OPS_GOTCHAS.md @@ -182,7 +182,7 @@ ERROR: cannot cast type xid8 to bigint **Fix:** `::text::bigint` as intermediate cast. Landed in [PR #62](https://github.com/NikolayS/pgque/pull/62). -If you see this on a VM that wasn't freshly installed from the PR #62 branch, re-run [install/install_pgque.sh](install/install_pgque.sh). +If you see this on a VM that wasn't freshly installed from the PR #62 branch, re-clone the branch and run `make USE_PGXS=1 install` followed by `psql -f build/pgque.sql` per [install/README.md](install/README.md). --- diff --git a/benchmark/install/bootstrap.sh b/benchmark/install/bootstrap.sh index 84c6c8ad..ace2e7e2 100644 --- a/benchmark/install/bootstrap.sh +++ b/benchmark/install/bootstrap.sh @@ -1,5 +1,5 @@ #!/usr/bin/env bash -set -euo pipefail +set -Eeuo pipefail export DEBIAN_FRONTEND=noninteractive echo "=== [$(hostname)] bootstrap start $(date -u +%FT%TZ) ===" @@ -38,10 +38,10 @@ sudo ln -sfn /tmp/pgfr/_analyze/sql /analyze_sql sudo mkdir -p /data sudo chown postgres:postgres /data -# ── postgresql.conf tuning (same as issue #77) ─────────────────────────────── +# ── postgresql.conf tuning (see methodology notes) ─────────────────────────── sudo tee -a /etc/postgresql/18/main/postgresql.conf >/dev/null <<'CONF' -# ── pgq bench tuning (issue #77) ───────────────────────────────────────────── +# ── pgq bench tuning (see methodology notes) ───────────────────────────────── shared_preload_libraries = 'pg_stat_statements,pg_cron' cron.database_name = 'bench' diff --git a/benchmark/install/install_pgboss.sh b/benchmark/install/install_pgboss.sh index 5a04d458..615dd4d5 100644 --- a/benchmark/install/install_pgboss.sh +++ b/benchmark/install/install_pgboss.sh @@ -1,6 +1,6 @@ #!/usr/bin/env bash -# pgboss install with covering index for polling query (avoids 7.9 GB disk sort pathology). -set -euo pipefail +# pgboss install with covering index for polling query (avoids 7.9 GiB disk sort pathology). +set -Eeuo pipefail echo "=== install pg-boss v12.15.0 + Node worker + covering index ===" curl -fsSL https://deb.nodesource.com/setup_20.x | sudo bash - >/dev/null 2>&1 diff --git a/benchmark/install/install_pgmq.sh b/benchmark/install/install_pgmq.sh index 0b9a802e..a6736f8f 100644 --- a/benchmark/install/install_pgmq.sh +++ b/benchmark/install/install_pgmq.sh @@ -1,5 +1,5 @@ #!/usr/bin/env bash -set -euo pipefail +set -Eeuo pipefail echo "=== install pgmq v1.11.0 (SQL-only) ===" curl -fsSL \ diff --git a/benchmark/install/install_pgq.sh b/benchmark/install/install_pgq.sh index f8ab8823..3cb86799 100644 --- a/benchmark/install/install_pgq.sh +++ b/benchmark/install/install_pgq.sh @@ -1,5 +1,5 @@ #!/usr/bin/env bash -set -euo pipefail +set -Eeuo pipefail echo "=== install pgq v3.5.1 ===" sudo rm -rf /tmp/pgq diff --git a/benchmark/install/install_river.sh b/benchmark/install/install_river.sh index afdf8ee2..897e10d2 100644 --- a/benchmark/install/install_river.sh +++ b/benchmark/install/install_river.sh @@ -1,5 +1,5 @@ #!/usr/bin/env bash -set -euo pipefail +set -Eeuo pipefail echo "=== install river v0.34.0 + Go worker ===" # i4i.2xlarge is x86_64 diff --git a/benchmark/runners/clean_reinstall.sh b/benchmark/runners/clean_reinstall.sh index e6d94213..1fbdf1a9 100644 --- a/benchmark/runners/clean_reinstall.sh +++ b/benchmark/runners/clean_reinstall.sh @@ -1,6 +1,6 @@ #!/usr/bin/env bash # Per-system clean reinstall + stats reset. -set -uo pipefail +set -Eeuo pipefail SYS=${1:?system} echo "=== [$(date -u +%FT%TZ)] clean reinstall: $SYS ===" diff --git a/benchmark/runners/run_r7.sh b/benchmark/runners/run_r7.sh index 9f15582d..b39243c5 100644 --- a/benchmark/runners/run_r7.sh +++ b/benchmark/runners/run_r7.sh @@ -4,12 +4,17 @@ # sys_metrics_sampler.py runs alongside bloat_sampler.py. # pg_ash ASH + pgfr snapshots copied to CSV at end (schemas preserved). # Consumer NOTICE format: NOTICE: ev ts= n= +# +# Tool resolution: scripts in benchmark/tooling/ are used by default. +# Override with R7_DIR or R6_DIR env vars if you need /tmp/r7 or /tmp/r6 copies. set -Eeuo pipefail SYS=${1:?system} DUR=${2:-5400} # 1.5h = 5400s -R7_DIR=${R7_DIR:-/tmp/r7} -R6_DIR=${R6_DIR:-/tmp/r6} +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +TOOLING_DIR="${SCRIPT_DIR}/../tooling" +R7_DIR=${R7_DIR:-"${TOOLING_DIR}"} +R6_DIR=${R6_DIR:-"${TOOLING_DIR}"} # Consumer SQL may be in either /tmp/consumer.sql (pushed), $R7_DIR/consumer_.sql, or $R6_DIR/consumer_.sql for c in "$R7_DIR/consumer_${SYS}.sql" "/tmp/consumer.sql" "$R6_DIR/consumer_${SYS}.sql"; do if [[ -f "$c" ]]; then CONSUMER_SQL="$c"; break; fi diff --git a/benchmark/tooling/microbench.sh b/benchmark/tooling/microbench.sh index 9feb032b..4d1d5978 100644 --- a/benchmark/tooling/microbench.sh +++ b/benchmark/tooling/microbench.sh @@ -1,6 +1,6 @@ #!/usr/bin/env bash # Per-VM microbench: CPU (sysbench), memory (sysbench), disk (fio on NVMe) -set -uo pipefail +set -Eeuo pipefail HOST=$(hostname) # Ensure tools installed From 0aa0d41c34e6283f64301cc7552ce7068c87bc40 Mon Sep 17 00:00:00 2001 From: Nik Samokhvalov Date: Thu, 30 Apr 2026 03:15:45 -0700 Subject: [PATCH 5/6] docs(benchmark): correct three methodology claims - logging_collector=off does not mean zero log I/O; journald still writes to disk (#123) - premake=20 planner cost is first-query-in-session, not per-query; root cause of steady-state regression is a follow-up (#124) - add RAISE NOTICE observer-effect caveat for high-frequency use (#127) Co-Authored-By: Claude Sonnet 4.6 --- benchmark/OPS_GOTCHAS.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/benchmark/OPS_GOTCHAS.md b/benchmark/OPS_GOTCHAS.md index 1dcc40b6..9fe1375b 100644 --- a/benchmark/OPS_GOTCHAS.md +++ b/benchmark/OPS_GOTCHAS.md @@ -49,7 +49,7 @@ Now observability output lands on NVMe alongside PG data. ## 3. `logging_collector=off` means server logs go to systemd journal -With `logging_collector=off` (our default), server logs go to the systemd journal, not a file on disk. Means zero PG log I/O during bench — good for the throughput numbers. +We run with `logging_collector=off`, so server logs go to systemd-journald (and from there to disk) rather than a PG-managed log file. Log I/O is small relative to WAL + queue table writes at the bench's TPS, but it is not zero — quantifying its share is a follow-up (#123). If investigating slow queries later, enable `log_statement='all'` + `logging_collector=on` with `log_directory` inside the NVMe data dir (relative path `'log'` resolves correctly since `data_dir` is symlinked to NVMe). @@ -105,7 +105,7 @@ pg_partman is installed in schema `public` (not `partman` as the docs sometimes Default `premake=4` works for bench (4 future 5-min partitions = 20 min buffer, enough for 1.5 h bench given 1-min maintenance cron cadence). -`premake=20` (24 partitions steady-state) killed pgmq-partitioned consumer perf (525 TPS vs 6621 with premake=4) due to planner cost across all partitions per query. +`premake=20` (24 partitions steady-state) collapsed pgmq-partitioned consumer perf to 525 TPS vs 6621 with premake=4. The dominant cost in PG's per-partition planning is the first-query-in-session penalty (Postgres caches the plan for subsequent queries in the same connection). Why this still hurt the consumer at steady state is a follow-up (#124) — likely short-lived connections (or PgBouncer transaction-mode recycling) paying the penalty repeatedly. `infinite_time_partitions=true` is needed for the maintenance job to keep creating future partitions indefinitely: @@ -241,6 +241,8 @@ pgbench faithfully prints NOTICE lines to stdout (captured as `consumer.log`). P Why NOTICE rather than pgss: DO-wrappers hide per-statement rows from `pg_stat_statements`, so the pgss queryid reported for a consumer is the DO wrapper as a whole — not the SELECT/DELETE inside it — which makes it useless for counting consumed events when the internal shape differs across systems. NOTICE is schema-neutral and authoritative. +Caveat: `RAISE NOTICE` itself has observer effect (server→client protocol message + log subsystem write per fire). The choice was pragmatic for boundary events; high-frequency measurements would benefit from named functions + `pg_stat_statements` instead of DO-wrappers (#127). + --- ## Summary checklist From dbfdfebe16694f29e0f2c387a6d703ae77954446 Mon Sep 17 00:00:00 2001 From: Nik Samokhvalov Date: Thu, 30 Apr 2026 03:19:37 -0700 Subject: [PATCH 6/6] docs(benchmark): drop incorrect PgBouncer speculation Co-Authored-By: Claude Sonnet 4.6 --- benchmark/OPS_GOTCHAS.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/benchmark/OPS_GOTCHAS.md b/benchmark/OPS_GOTCHAS.md index 9fe1375b..fcc62c05 100644 --- a/benchmark/OPS_GOTCHAS.md +++ b/benchmark/OPS_GOTCHAS.md @@ -105,7 +105,7 @@ pg_partman is installed in schema `public` (not `partman` as the docs sometimes Default `premake=4` works for bench (4 future 5-min partitions = 20 min buffer, enough for 1.5 h bench given 1-min maintenance cron cadence). -`premake=20` (24 partitions steady-state) collapsed pgmq-partitioned consumer perf to 525 TPS vs 6621 with premake=4. The dominant cost in PG's per-partition planning is the first-query-in-session penalty (Postgres caches the plan for subsequent queries in the same connection). Why this still hurt the consumer at steady state is a follow-up (#124) — likely short-lived connections (or PgBouncer transaction-mode recycling) paying the penalty repeatedly. +`premake=20` (24 partitions steady-state) collapsed pgmq-partitioned consumer perf to 525 TPS vs 6621 with premake=4. The dominant cost in PG's per-partition planning is the first-query-in-session penalty (Postgres caches the plan for subsequent queries in the same connection). Why this still hurt the consumer at steady state is a follow-up (#124) — the bench used direct connections, so the actual cause needs measurement (e.g., consumer reconnect frequency, cached-plan invalidation, or per-query overhead distinct from initial planning). `infinite_time_partitions=true` is needed for the maintenance job to keep creating future partitions indefinitely: