From df83ceb4aaf76302dbaa06d0588c5e043c017798 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 19 May 2026 19:54:50 +0000 Subject: [PATCH 1/2] docs: draft delayed-delivery rotation blueprint Two-table TRUNCATE rotation for pgque.delayed_events modeled on PgQ's event_N pattern. Removes DELETE/UPDATE/VACUUM from the delayed-delivery hot path so send_at() can serve as a primary write path without dead-tuple churn or catalog growth. --- blueprints/DELAYED_ROTATION.md | 361 +++++++++++++++++++++++++++++++++ 1 file changed, 361 insertions(+) create mode 100644 blueprints/DELAYED_ROTATION.md diff --git a/blueprints/DELAYED_ROTATION.md b/blueprints/DELAYED_ROTATION.md new file mode 100644 index 00000000..7ec69fbb --- /dev/null +++ b/blueprints/DELAYED_ROTATION.md @@ -0,0 +1,361 @@ +# Delayed-delivery rotation + +Blueprint version: `0.1-draft.1` + +## Change log + +| Version | Date | Notes | +|---|---|---| +| `0.1-draft.1` | 2026-05-19 | First draft. Two-table TRUNCATE rotation for `pgque.delayed_events`, modeled on PgQ's `event_N` rotation. Replaces the current single-heap design before promotion of `send_at()` out of experimental. | + +## Goal + +Make `pgque.send_at()` viable as a primary write path, not just a sparingly +used scheduling hook. + +The experimental design in `sql/experimental/delayed.sql` stores every +scheduled message in a single `pgque.delayed_events` heap, INSERTs on +`send_at()`, and DELETEs on delivery. That works while delayed delivery is a +rare path, but it is the exact MVCC-on-hot-path pattern PgQ exists to avoid. +At throughputs where `send_at()` becomes the main write path (scheduled +notifications, reminders, "fire at T+N" workloads), dead-tuple churn and +index bloat would put pgque squarely back into the same operational hole +that pg_boss / Delayed::Job / typical SKIP LOCKED queues already occupy. + +This blueprint specifies a replacement storage scheme that: + +- never DELETEs and never UPDATEs rows on the hot path +- never requires `VACUUM` to keep up with delivery throughput +- never grows the pg_class / pg_inherits catalog +- keeps `send_at()` as a pure INSERT +- delivers events at the granularity of the maint tick (no bucket-width + latency floor) + +## Non-goals + +- No new C extension and no `shared_preload_libraries` dependency. +- No change to `pgque.send()` / `pgque.receive()` / `pgque.ack()` / + `pgque.nack()` API. +- No change to the PgQ-style `event_N` rotation for in-flight messages. +- No cancellation API in 0.2 (see "Open questions"). + +## Why not native range partitioning + +Range-partitioning `pgque.delayed_events` on `de_deliver_at` (one partition +per minute / hour, drop-and-create on rotation) does work and was the first +shape considered. It has two costs the two-table scheme avoids: + +1. **Catalog churn.** A 1-minute bucket with year-long horizons is ~525k + live partition rows over a year and a steady stream of `create table` / + `drop table` DDL on the maintenance path. Planning time on the parent + scales with partition count. +2. **Latency floor.** An event at the start of a bucket waits for the + entire bucket window to close before it can be delivered without + reintroducing per-row DELETE-on-delivery (which is what breaks the + no-bloat invariant). At 1-minute buckets that is up to ~60 s extra + delivery latency. + +Both costs are avoidable. PgQ already solved this exact "deliver a stream +of events without per-row MVCC churn" problem for `event_N`, and the +same solution applies here. + +## Design + +Two physical tables, `pgque.delayed_events_a` and `pgque.delayed_events_b`, +each with the same columns as today's `pgque.delayed_events`. At any +moment one is the **drainer** (rows scheduled for the current rotation +window) and the other is the **future** table (rows scheduled for the next +window or beyond). Roles are stored in a one-row state table and flipped +at rotation. + +```text + send_at(t) + | + v ++----------+ +----------+ +| drainer | | future | +| (today) | | (tomorrow+) | ++----------+ +----------+ + | ^ + | scan rows | re-INSERT rows whose + | with | actual_deliver_at falls + | actual_ | beyond the next window + | deliver_at | + | <= now() | + v | + insert_event() ----+ +``` + +The rotation period (the meaning of "today") is configurable. Default is +24 h aligned to UTC midnight. Per-queue overrides are out of scope for +0.2. + +### Tables + +```sql +create table pgque.delayed_events_a ( + de_id bigserial primary key, + de_queue_name text not null, + de_deliver_at timestamptz not null, + de_type text, + de_data text, + de_extra1 text, + de_extra2 text, + de_extra3 text, + de_extra4 text +); + +create index de_a_deliver_idx + on pgque.delayed_events_a (de_deliver_at); + +-- delayed_events_b is identical +``` + +The parent name `pgque.delayed_events` is kept as a view that unions the +two tables, so introspection (`select * from pgque.delayed_events`) still +shows every scheduled row. + +### State + +```sql +create table pgque.delayed_state ( + singleton boolean primary key default true, + drainer_table regclass not null, + future_table regclass not null, + window_start timestamptz not null, + window_end timestamptz not null, + deliver_watermark timestamptz not null, + constraint delayed_state_singleton check (singleton) +); +``` + +`deliver_watermark` advances during the drain scan. `window_end` is the +boundary that decides "drainer vs future" for new inserts and triggers +rotation. The `singleton` column with a CHECK constraint guarantees at most +one row. + +### `send_at()` + +```text +function send_at(queue, type, payload, deliver_at): + if deliver_at <= now(): + return insert_event(queue, type, payload) + + state = select_for_share from delayed_state + if deliver_at <= state.window_end: + target = state.drainer_table + else: + target = state.future_table + + insert into target (de_queue_name, de_deliver_at, de_type, de_data, ...) + values (queue, deliver_at, type, payload, ...) + + return currval('pgque.delayed_events__de_id_seq') +``` + +Routing is decided at INSERT time against the current window boundary. +No partition pruning at planning time, no dynamic SQL, just a single +INSERT into one of two named tables. + +The shared lock on `delayed_state` is light (no row modification) and +serializes only against rotation, which acquires an exclusive lock for +its single transaction. + +### Drain scan (called from `pgque.maint()`) + +```text +function maint_deliver_delayed(): + begin + state = select * from delayed_state for update + for row in select * from drainer + where actual_deliver_at > state.deliver_watermark + and actual_deliver_at <= now() + order by actual_deliver_at + limit drain_batch_size + loop + insert_event(row.de_queue_name, row.de_type, row.de_data, ...) + end loop + update delayed_state set deliver_watermark = now() + commit +``` + +Key invariant: the watermark advance and the `insert_event()` calls are in +the same transaction. Either the batch is delivered AND the watermark +advanced, or neither. + +The watermark is monotonic within a rotation window and reset on rotation. + +### Rotation + +Triggered when `now() >= state.window_end`. Single transaction: + +```text +function maint_rotate_delayed(): + begin + state = select * from delayed_state for update + if now() < state.window_end: + return -- another worker rotated us + + -- final drain pass for anything scheduled before window_end + drain(state.drainer_table) + + -- carry far-future rows from the future table forward + new_window_end = state.window_end + rotation_interval + insert into state.drainer_table (...) + select ... from state.future_table + where de_deliver_at > new_window_end + + -- the drainer is wiped; far-future rows we just inserted survive + -- in the future table, which is about to become the new drainer + truncate state.drainer_table + + -- swap roles + update delayed_state set + drainer_table = state.future_table, + future_table = state.drainer_table, + window_start = state.window_end, + window_end = new_window_end, + deliver_watermark = state.window_end + commit +``` + +A row scheduled inside the next window survives unchanged in the (newly +named) drainer table. A row scheduled beyond the next window was re-INSERTed +into the (newly named) future table before truncate, and will be carried +forward again at the next rotation if its delivery time is still beyond +that window. + +If the system is paused longer than one rotation interval, `maint_rotate_delayed()` +must loop until `now() < window_end`. Each iteration is its own transaction so +catch-up does not hold a long lock. + +## Properties + +1. **No DELETE on hot path.** Drain writes only to event tables (via + `insert_event()`) and to the state row. +2. **No UPDATE on a delayed-event row.** State row is updated; row data + is immutable from INSERT to TRUNCATE. +3. **No VACUUM dependence.** All reclamation is via TRUNCATE. +4. **Stable catalog.** Two tables, one state table, one view. Forever. +5. **Delivery latency.** Bounded by `maint()` cadence, not by rotation + interval. A row scheduled for `now() + 5 s` delivers on the next tick + after `now() + 5 s`. +6. **Write amplification.** A row scheduled `N` rotation intervals out incurs + one extra INSERT per rotation crossed. For 24 h rotation and a 30-day-out + message: ~30 INSERTs total. For 1 h rotation: ~720. +7. **Crash safety.** Drain, rotation, and watermark advance are each a + single transaction. Recovery is automatic. +8. **Concurrent safety.** `select for update` on the singleton state row + serializes rotation; drain holds it for at most one batch. + +## Tradeoffs + +### Write amplification under long horizons + +A workload that mixes mostly-soon delivery with a small population of +multi-month scheduled messages pays write-amp on the long-tail rows only, +once per rotation. Concretely: at 24 h rotation, 1M scheduled-far-future +messages cause 1M extra INSERTs per day, batched into the rotation +transaction. That is acceptable; it can be chunked if the single TX is +too large in practice. The hot path (soon-to-deliver messages) is +untouched. + +### Rotation interval choice + +24 h (UTC midnight) is the default. Two alternative pickers worth +mentioning: + +- **Shorter rotation (e.g., 1 h or 6 h)** reduces the "future" table's + steady-state population at the cost of higher write amplification on + long-horizon rows. +- **Adaptive rotation** based on observed long-horizon population. Out of + scope for 0.2. Mention only. + +### Comparison vs partition+drop + +| | this design | declarative partition + drop | +|---|---|---| +| catalog rows | 2 tables, 1 view, 1 state row, stable | partition per bucket, churns | +| hot-path DDL | none | `create` / `drop` per bucket | +| `DELETE` / `UPDATE` on rows | none | none | +| `VACUUM` dependence | none | none | +| latency floor | maint tick | bucket width | +| write amplification | `1 + rotations_crossed` per row | 1× always | +| code complexity | low (2 tables + state) | medium (partition manager) | + +Both designs satisfy the no-bloat goal. The two-table scheme wins on +catalog stability, hot-path DDL, and latency floor. It loses on write +amplification for long-horizon rows. + +## Open questions + +1. **Cancellation.** `pgque.cancel_scheduled(de_id)` would need to either + DELETE (breaks the invariant) or push the cancelled id onto a tombstone + side-table that the drain scan filters against. Recommend deferring to + a follow-up blueprint. Stays out of 0.2. +2. **Rotation cadence configurability.** 0.2 ships UTC-midnight 24 h + rotation, hard-coded. Per-queue rotation interval would need a column + on `pgque.queue` and a more elaborate state model (per queue, not + singleton). +3. **Far-future migration chunking.** Default is one big `insert ... select` + inside the rotation transaction. Above some threshold (say 100k rows) + it should split into chunks. Threshold is a tuning knob; default + probably 50k. +4. **View vs. parent table.** The public `pgque.delayed_events` name is + exposed as a view here. Alternative: keep the existing parent name as + the active drainer (via a synonym-ish mechanism). View is simpler and + stable; reconsider only if introspection latency turns out to matter. +5. **`pg_cron` integration.** Rotation must run shortly after + `window_end`. The default pgque ticker calls `maint()` frequently + enough that lazy rotation (triggered on the first maint after + `now() >= window_end`) is sufficient and does not need a dedicated + pg_cron job. Document this. + +## Implementation plan + +Sequenced TDD slices on the branch +`claude/check-delayed-messages-IGSpF`: + +1. **Failing acceptance test.** Extend `tests/acceptance/us4_delayed_delivery.sql` + with a "no dead tuples after N deliveries" assertion against + `pg_stat_user_tables.n_dead_tup` on both tables. Existing test must + continue to pass. +2. **Schema + state.** Add `pgque.delayed_events_a`, `pgque.delayed_events_b`, + `pgque.delayed_state`, and the `pgque.delayed_events` view. Drop the + old `pgque.delayed_events` table inside the same migration (experimental + file, no preserved data needed). +3. **`send_at()` rewrite.** Route INSERTs by window boundary. +4. **`maint_deliver_delayed()` rewrite.** Watermark-driven drain. +5. **`maint_rotate_delayed()` and `maint()` wrapper.** Lazy rotation + call sequenced before the drain pass. +6. **Bench harness.** Reuse `benchmark/` to measure delivery rate and dead + tuple counts under sustained `send_at()` load. Capture numbers in the + PR description. +7. **Docs.** Update `docs/reference.md` experimental section. Promote + `send_at()` documentation into `docs/tutorial.md` / `docs/examples.md` + only after benchmarks confirm the no-bloat property. + +## Acceptance criteria + +- `pg_stat_user_tables.n_dead_tup` on both `pgque.delayed_events_a` and + `pgque.delayed_events_b` remains `0` (modulo state-table updates) after + any number of `send_at` / drain / rotation cycles. +- `pg_class` row count contributed by delayed-delivery objects stays + constant across one full week of simulated traffic. +- `send_at()` followed by `select pgque.maint()` after the scheduled time + produces exactly one event in the target queue (no duplicates, no + losses) under concurrent producers and concurrent maint workers. +- A row scheduled `N` rotation intervals out is delivered exactly once on + the maint pass following its `de_deliver_at` and incurs exactly `N+1` + total row writes to delayed tables. +- Pausing the ticker for `> 1` rotation interval and resuming does not + lose, duplicate, or reorder any scheduled rows. + +## Out of scope + +- Cancellation API. +- Per-queue rotation cadence. +- Migration from a populated 0.1 / 0.2-pre-rotation `delayed_events` + table. The experimental file's existing table is dropped on upgrade. + The first stable release of `send_at()` will need a real migration + story; that lives in its own blueprint. From 92254b8b17fea2d9606e69fa8c11e10b12ae3a0e Mon Sep 17 00:00:00 2001 From: Claude Date: Wed, 20 May 2026 17:02:15 +0000 Subject: [PATCH 2/2] docs: make cancellation first-class in delayed-rotation blueprint cancel_scheduled() is the single exception to the no-DELETE invariant. Dead tuples it produces are bounded above by one rotation interval and TRUNCATEd away on the next rotation that targets the affected table. The bloat-free property of the design is therefore conditional on a low cancellation rate. That caveat is required to appear: - in the SQL comment on cancel_scheduled() - in docs/reference.md next to the function entry - in tutorials / examples that introduce cancellation - in the send_at() function comment For very-high-cancel workloads, docs must steer users to the application-level "skip on consume" pattern instead. --- blueprints/DELAYED_ROTATION.md | 194 +++++++++++++++++++++++++++------ 1 file changed, 160 insertions(+), 34 deletions(-) diff --git a/blueprints/DELAYED_ROTATION.md b/blueprints/DELAYED_ROTATION.md index 7ec69fbb..4f572d9a 100644 --- a/blueprints/DELAYED_ROTATION.md +++ b/blueprints/DELAYED_ROTATION.md @@ -1,11 +1,12 @@ # Delayed-delivery rotation -Blueprint version: `0.1-draft.1` +Blueprint version: `0.1-draft.2` ## Change log | Version | Date | Notes | |---|---|---| +| `0.1-draft.2` | 2026-05-19 | Promote cancellation to first-class. `cancel_scheduled()` uses a row DELETE; document that the bloat-free property is conditional on low cancellation rate and that high-cancel workloads should be discouraged in docs and function comments. | | `0.1-draft.1` | 2026-05-19 | First draft. Two-table TRUNCATE rotation for `pgque.delayed_events`, modeled on PgQ's `event_N` rotation. Replaces the current single-heap design before promotion of `send_at()` out of experimental. | ## Goal @@ -24,12 +25,20 @@ that pg_boss / Delayed::Job / typical SKIP LOCKED queues already occupy. This blueprint specifies a replacement storage scheme that: -- never DELETEs and never UPDATEs rows on the hot path +- never DELETEs and never UPDATEs rows on the **delivery** hot path - never requires `VACUUM` to keep up with delivery throughput - never grows the pg_class / pg_inherits catalog - keeps `send_at()` as a pure INSERT - delivers events at the granularity of the maint tick (no bucket-width latency floor) +- supports out-of-band cancellation via `pgque.cancel_scheduled(de_id)` + as an explicit, low-volume DELETE path with documented caveats + +The bloat-free property holds when cancellation is rare relative to +delivery. The design does not attempt to make cancellation itself +bloat-free; high-cancel workloads (e.g., > a few percent of scheduled +messages cancelled) are explicitly out of the supported envelope and +must be called out in user-facing docs and function comments. ## Non-goals @@ -37,7 +46,9 @@ This blueprint specifies a replacement storage scheme that: - No change to `pgque.send()` / `pgque.receive()` / `pgque.ack()` / `pgque.nack()` API. - No change to the PgQ-style `event_N` rotation for in-flight messages. -- No cancellation API in 0.2 (see "Open questions"). +- No support for high-cancel workloads. `cancel_scheduled()` exists, + but the no-bloat guarantee is contingent on cancellation being a + rare exception, not a routine path (see "Cancellation"). ## Why not native range partitioning @@ -229,27 +240,125 @@ If the system is paused longer than one rotation interval, `maint_rotate_delayed must loop until `now() < window_end`. Each iteration is its own transaction so catch-up does not hold a long lock. +### Cancellation + +```sql +create or replace function pgque.cancel_scheduled(i_de_id bigint) +returns boolean as $$ +declare + rows_deleted integer; +begin + delete from pgque.delayed_events_a where de_id = i_de_id; + get diagnostics rows_deleted = row_count; + if rows_deleted > 0 then + return true; + end if; + + delete from pgque.delayed_events_b where de_id = i_de_id; + get diagnostics rows_deleted = row_count; + return rows_deleted > 0; +end; +$$ language plpgsql security definer set search_path = pgque, pg_catalog; +``` + +Cancellation is the **only** path that performs a row-level DELETE +against `pgque.delayed_events_a` / `pgque.delayed_events_b`. The +DELETE produces a dead tuple in whichever physical table holds the +row. That dead tuple is reclaimed at the next TRUNCATE (when the +table next rotates out as the drainer-becoming-future), so the +bloat is bounded above by **(cancellation rate) × (rotation interval)**. + +Concretely: with default 24 h rotation, dead tuples from cancels +accumulate for at most one rotation cycle before being TRUNCATEd +away with the rest of the table. Autovacuum may also reclaim them +sooner; nothing in the design depends on that. + +The bloat-free guarantee is therefore conditional: + +> `pgque.send_at()` is bloat-free **provided cancellation is rare**. +> A workload that cancels a large fraction of scheduled messages +> (rule of thumb: more than a few percent) is outside the supported +> envelope; consider an application-level "skip on consume" pattern +> instead, where messages are delivered and then dropped by the +> consumer based on application state. + +This caveat must appear: + +- in the SQL comment on `pgque.cancel_scheduled()` itself +- in `docs/reference.md` next to the function entry +- in any tutorial / example that introduces cancellation + +Behavioral notes: + +- Cancellation by `de_id` is O(1) given the primary-key lookup; the + function tries the drainer first (likely target for soon-to-deliver + rows), then the future table. +- A cancellation that races a delivery scan can lose: if the drain + transaction commits first, the event has already been delivered + and `cancel_scheduled()` returns `false`. The function's return + value indicates "row was still pending and is now gone." + Cancellation has no effect on already-delivered events; the + consumer-side `ack`/`nack` flow remains the only post-delivery + control. +- Concurrent cancellations of the same `de_id` are safe — the second + DELETE simply matches zero rows. +- `cancel_scheduled()` does not take any state-table locks; it + contends only with the row it targets. + ## Properties -1. **No DELETE on hot path.** Drain writes only to event tables (via - `insert_event()`) and to the state row. +1. **No DELETE on the delivery hot path.** Drain writes only to event + tables (via `insert_event()`) and to the state row. The only + row-level DELETE on delayed tables is `cancel_scheduled()`, which + is an off-hot-path exception (see "Cancellation"). 2. **No UPDATE on a delayed-event row.** State row is updated; row data - is immutable from INSERT to TRUNCATE. -3. **No VACUUM dependence.** All reclamation is via TRUNCATE. + is immutable from INSERT to either delivery or cancellation. +3. **VACUUM-independent under nominal use.** All reclamation on the + delivery path is via TRUNCATE. Cancellation introduces dead tuples, + but they are bounded by one rotation interval and TRUNCATEd away; + `VACUUM` only matters if the cancellation rate is high enough to + pressure the table between rotations. 4. **Stable catalog.** Two tables, one state table, one view. Forever. 5. **Delivery latency.** Bounded by `maint()` cadence, not by rotation interval. A row scheduled for `now() + 5 s` delivers on the next tick after `now() + 5 s`. -6. **Write amplification.** A row scheduled `N` rotation intervals out incurs - one extra INSERT per rotation crossed. For 24 h rotation and a 30-day-out - message: ~30 INSERTs total. For 1 h rotation: ~720. -7. **Crash safety.** Drain, rotation, and watermark advance are each a - single transaction. Recovery is automatic. +6. **Write amplification.** A row scheduled `N` rotation intervals out + incurs one extra INSERT per rotation crossed. For 24 h rotation and a + 30-day-out message: ~30 INSERTs total. For 1 h rotation: ~720. +7. **Crash safety.** Drain, rotation, watermark advance, and + cancellation are each a single transaction. Recovery is automatic. 8. **Concurrent safety.** `select for update` on the singleton state row serializes rotation; drain holds it for at most one batch. + Cancellation does not touch the state row. ## Tradeoffs +### Conditional bloat-free guarantee + +The no-VACUUM-on-hot-path property holds for workloads where cancellation +is rare. A workload that cancels a large fraction of `send_at()` rows +(say > a few percent) pushes dead tuples into the physical tables faster +than rotation can TRUNCATE them away. In that regime, autovacuum has to +keep up — at which point the design has lost its primary advantage over +the original single-heap. + +This is an explicit, documented contract, not a hidden limitation: + +- The SQL comment on `cancel_scheduled()` warns about the cancellation + rate. +- `docs/reference.md` repeats the caveat next to the function entry and + in the "When to use `send_at()`" preamble. +- Tutorials and examples that introduce cancellation must include the + caveat inline. +- Function comment on `send_at()` mentions that bloat-free behavior + assumes cancellation is exceptional. + +For workloads with very high cancel rates, the documented alternative is +the application-level "skip on consume" pattern: always `send_at()`, +always deliver, and let the consumer drop the message based on +application state at receive time. That keeps cancels off the storage +layer entirely. + ### Write amplification under long horizons A workload that mixes mostly-soon delivery with a small population of @@ -277,39 +386,40 @@ mentioning: |---|---|---| | catalog rows | 2 tables, 1 view, 1 state row, stable | partition per bucket, churns | | hot-path DDL | none | `create` / `drop` per bucket | -| `DELETE` / `UPDATE` on rows | none | none | -| `VACUUM` dependence | none | none | +| `DELETE` / `UPDATE` on rows | only `cancel_scheduled()` | only `cancel_scheduled()` | +| `VACUUM` dependence | none (assuming low cancel rate) | none (assuming low cancel rate) | | latency floor | maint tick | bucket width | | write amplification | `1 + rotations_crossed` per row | 1× always | | code complexity | low (2 tables + state) | medium (partition manager) | -Both designs satisfy the no-bloat goal. The two-table scheme wins on -catalog stability, hot-path DDL, and latency floor. It loses on write -amplification for long-horizon rows. +Both designs satisfy the no-bloat goal under the same cancellation +assumption. The two-table scheme wins on catalog stability, hot-path +DDL, and latency floor. It loses on write amplification for +long-horizon rows. ## Open questions -1. **Cancellation.** `pgque.cancel_scheduled(de_id)` would need to either - DELETE (breaks the invariant) or push the cancelled id onto a tombstone - side-table that the drain scan filters against. Recommend deferring to - a follow-up blueprint. Stays out of 0.2. -2. **Rotation cadence configurability.** 0.2 ships UTC-midnight 24 h +1. **Rotation cadence configurability.** 0.2 ships UTC-midnight 24 h rotation, hard-coded. Per-queue rotation interval would need a column on `pgque.queue` and a more elaborate state model (per queue, not singleton). -3. **Far-future migration chunking.** Default is one big `insert ... select` +2. **Far-future migration chunking.** Default is one big `insert ... select` inside the rotation transaction. Above some threshold (say 100k rows) it should split into chunks. Threshold is a tuning knob; default probably 50k. -4. **View vs. parent table.** The public `pgque.delayed_events` name is +3. **View vs. parent table.** The public `pgque.delayed_events` name is exposed as a view here. Alternative: keep the existing parent name as the active drainer (via a synonym-ish mechanism). View is simpler and stable; reconsider only if introspection latency turns out to matter. -5. **`pg_cron` integration.** Rotation must run shortly after +4. **`pg_cron` integration.** Rotation must run shortly after `window_end`. The default pgque ticker calls `maint()` frequently enough that lazy rotation (triggered on the first maint after `now() >= window_end`) is sufficient and does not need a dedicated pg_cron job. Document this. +5. **High-cancel observability.** Should `pgque.queue_stats()` expose + a cancel-rate metric so users can detect when they have drifted out + of the supported envelope? Cheap to add; defer to the implementation + PR. ## Implementation plan @@ -328,18 +438,30 @@ Sequenced TDD slices on the branch 4. **`maint_deliver_delayed()` rewrite.** Watermark-driven drain. 5. **`maint_rotate_delayed()` and `maint()` wrapper.** Lazy rotation call sequenced before the drain pass. -6. **Bench harness.** Reuse `benchmark/` to measure delivery rate and dead - tuple counts under sustained `send_at()` load. Capture numbers in the - PR description. -7. **Docs.** Update `docs/reference.md` experimental section. Promote - `send_at()` documentation into `docs/tutorial.md` / `docs/examples.md` - only after benchmarks confirm the no-bloat property. +6. **`cancel_scheduled()`.** Add the function, the SQL comment with the + rate caveat, and an acceptance test that asserts both the + "row-gone-after-cancel" success case and the "no-op-after-delivery" + race case. +7. **Bench harness.** Reuse `benchmark/` to measure delivery rate and dead + tuple counts under sustained `send_at()` load, plus a separate run + under sustained `send_at()` + steady cancel rate to validate the + bounded-bloat claim. Capture numbers in the PR description. +8. **Docs.** Update `docs/reference.md` experimental section, including + the cancellation caveat in both the `cancel_scheduled()` entry and + the `send_at()` preamble. Promote `send_at()` documentation into + `docs/tutorial.md` / `docs/examples.md` only after benchmarks + confirm the no-bloat property under nominal (low-cancel) workload. ## Acceptance criteria - `pg_stat_user_tables.n_dead_tup` on both `pgque.delayed_events_a` and - `pgque.delayed_events_b` remains `0` (modulo state-table updates) after - any number of `send_at` / drain / rotation cycles. + `pgque.delayed_events_b` remains `0` after any number of `send_at` / + drain / rotation cycles with no cancellations. +- With a low cancellation rate (≤ 1% of scheduled rows), dead tuples + on either physical table never exceed the count of cancellations + performed since that table last served as the rotation TRUNCATE + target, and return to `0` after the next rotation that moves through + it. - `pg_class` row count contributed by delayed-delivery objects stays constant across one full week of simulated traffic. - `send_at()` followed by `select pgque.maint()` after the scheduled time @@ -348,12 +470,16 @@ Sequenced TDD slices on the branch - A row scheduled `N` rotation intervals out is delivered exactly once on the maint pass following its `de_deliver_at` and incurs exactly `N+1` total row writes to delayed tables. +- `cancel_scheduled(id)` against a pending row returns `true` and the + row is not subsequently delivered. Against an already-delivered or + non-existent row, returns `false` and is a no-op. - Pausing the ticker for `> 1` rotation interval and resuming does not lose, duplicate, or reorder any scheduled rows. ## Out of scope -- Cancellation API. +- High-cancel workloads as a supported use case (see "Cancellation" + and "Tradeoffs"). - Per-queue rotation cadence. - Migration from a populated 0.1 / 0.2-pre-rotation `delayed_events` table. The experimental file's existing table is dropped on upgrade.