Background
The security review identified two High-severity DoS findings (D-1, D-2) and a P0 recommendation to add rate limiting / quotas to df.start(). This issue tracks the analysis and a proposed implementation.
Why rate-limit df.start()
df.start() is the only entry point that creates new durable workflow instances. Every call:
- Inserts rows into
df.instances and df.nodes (storage cost — graphs can be large for df.seq/df.loop/df.parallel).
- Enqueues a work item into
duroxide.* tables that the background worker must poll and execute.
- Eventually causes the worker to open a per-user sqlx connection (pool size = 5) and run user SQL / outbound HTTP.
Because the call is cheap for the caller but expensive downstream, an unprivileged user (or a buggy app) can trivially:
- Exhaust disk by mass-inserting graphs into
df.nodes (no quota today).
- Starve the worker pool — only 5 per-user connections exist; one user flooding
df.start() blocks every other user's instances from making progress (noisy-neighbor DoS).
- Saturate outbound HTTP indirectly (since
df.start is the gate to df.http() execution) — relevant to D-3.
- Inflate duroxide history tables — every retry/replay amplifies the storage cost.
In a single-tenant deployment this is "just" a self-inflicted footgun, but the threat model treats database roles as a trust boundary (RLS, per-user connections), so one role abusing `df.start()` violates that isolation by denying service to others.
Proposed Implementation
Two complementary limits, both enforced inside `df_start()` in `src/dsl.rs` before the instance is enqueued (so a rejected start never touches duroxide). All counts use the captured `submitted_by`/`login_role` identity (already unforgeable per S-2). Superuser bypasses both checks (consistent with I-8).
1. Concurrency cap — `df.max_concurrent_per_user`
GUC, default e.g. `100`. On each `df.start()`:
```sql
SELECT count(*) FROM df.instances
WHERE submitted_by = $1
AND lower(status) IN ('pending','running');
```
If `>= max_concurrent_per_user`, raise:
```
ERROR: df.start rejected: user % has % active instances (limit %)
HINT: wait for in-flight instances to complete or raise df.max_concurrent_per_user
```
Use SPI with `Spi::get_one_with_args` (same parameterized pattern as `df.status`/`df.result`, per T-1).
2. Storage / lifetime quota — `df.max_instances_per_user`
GUC, default e.g. `10000`. Counts all rows in `df.instances` for that user (including completed) so a user can't accumulate unbounded history. Same SPI pattern, different predicate (no status filter). Cleanup is the user's responsibility via a future `df.purge()` — document this in the error hint.
3. (Optional, future) Token-bucket rate (calls/sec)
A true rate limit (vs. concurrency limit) needs shared state across backends. Two options:
- In-memory shared map via `pg_shmem_init_hook` — `HashMap<Oid, (last_refill, tokens)>`. Cheap, no write amplification, lost on restart.
- `df.rate_limit` table keyed by `(role oid, window_start)` updated with `INSERT ... ON CONFLICT DO UPDATE`. Durable, survives restarts.
For v1, the concurrency cap (#1) should be sufficient and avoids the new table entirely.
GUCs to register
In `src/lib.rs` `_PG_init`, alongside `pg_durable.worker_role`:
- `df.max_concurrent_per_user` — int, `PGC_SUSET` (only superuser can change), default 100, range `[0, INT_MAX]` where `0` = unlimited.
- `df.max_instances_per_user` — int, `PGC_SUSET`, default 10000, `0` = unlimited.
`PGC_SUSET` is critical — otherwise an attacker just calls `SET df.max_concurrent_per_user = 1000000` and bypasses the limit.
Where the check goes
Inside `df_start()`, after identity capture (`GetSessionUserId`/`GetOuterUserId`) and before the call that inserts into `df.instances` / enqueues the duroxide work item. Failing here means no rows are written and no worker capacity is consumed.
Testing
Add an E2E test under `tests/e2e/sql/` following the numbered convention:
- `SET df.max_concurrent_per_user = 2` (as superuser).
- As a non-superuser role, start 2 long-running instances (e.g. a slow SQL).
- Assert the 3rd `df.start()` raises the expected error.
- Cancel one, assert the next `df.start()` succeeds.
Also a test for `df.max_instances_per_user` covering completed-instance accumulation.
Acceptance Criteria
Related
- Security review findings: D-1, D-2 (P0)
- Related but out of scope here: D-3 (df.http rate limit), D-4 (statement_timeout), D-7 (queue depth)
- See: `docs/security-review/security-review.md` § 5.1
Background
The security review identified two High-severity DoS findings (D-1, D-2) and a P0 recommendation to add rate limiting / quotas to
df.start(). This issue tracks the analysis and a proposed implementation.Why rate-limit
df.start()df.start()is the only entry point that creates new durable workflow instances. Every call:df.instancesanddf.nodes(storage cost — graphs can be large fordf.seq/df.loop/df.parallel).duroxide.*tables that the background worker must poll and execute.Because the call is cheap for the caller but expensive downstream, an unprivileged user (or a buggy app) can trivially:
df.nodes(no quota today).df.start()blocks every other user's instances from making progress (noisy-neighbor DoS).df.startis the gate todf.http()execution) — relevant to D-3.In a single-tenant deployment this is "just" a self-inflicted footgun, but the threat model treats database roles as a trust boundary (RLS, per-user connections), so one role abusing `df.start()` violates that isolation by denying service to others.
Proposed Implementation
Two complementary limits, both enforced inside `df_start()` in `src/dsl.rs` before the instance is enqueued (so a rejected start never touches duroxide). All counts use the captured `submitted_by`/`login_role` identity (already unforgeable per S-2). Superuser bypasses both checks (consistent with I-8).
1. Concurrency cap — `df.max_concurrent_per_user`
GUC, default e.g. `100`. On each `df.start()`:
```sql
SELECT count(*) FROM df.instances
WHERE submitted_by = $1
AND lower(status) IN ('pending','running');
```
If `>= max_concurrent_per_user`, raise:
```
ERROR: df.start rejected: user % has % active instances (limit %)
HINT: wait for in-flight instances to complete or raise df.max_concurrent_per_user
```
Use SPI with `Spi::get_one_with_args` (same parameterized pattern as `df.status`/`df.result`, per T-1).
2. Storage / lifetime quota — `df.max_instances_per_user`
GUC, default e.g. `10000`. Counts all rows in `df.instances` for that user (including completed) so a user can't accumulate unbounded history. Same SPI pattern, different predicate (no status filter). Cleanup is the user's responsibility via a future `df.purge()` — document this in the error hint.
3. (Optional, future) Token-bucket rate (calls/sec)
A true rate limit (vs. concurrency limit) needs shared state across backends. Two options:
For v1, the concurrency cap (#1) should be sufficient and avoids the new table entirely.
GUCs to register
In `src/lib.rs` `_PG_init`, alongside `pg_durable.worker_role`:
`PGC_SUSET` is critical — otherwise an attacker just calls `SET df.max_concurrent_per_user = 1000000` and bypasses the limit.
Where the check goes
Inside `df_start()`, after identity capture (`GetSessionUserId`/`GetOuterUserId`) and before the call that inserts into `df.instances` / enqueues the duroxide work item. Failing here means no rows are written and no worker capacity is consumed.
Testing
Add an E2E test under `tests/e2e/sql/` following the numbered convention:
Also a test for `df.max_instances_per_user` covering completed-instance accumulation.
Acceptance Criteria
Related