Rate-limit df.start() to prevent DoS (security review D-1, D-2)

## Background

The [security review](docs/security-review/security-review.md) identified two High-severity DoS findings (D-1, D-2) and a P0 recommendation to add rate limiting / quotas to `df.start()`. This issue tracks the analysis and a proposed implementation.

## Why rate-limit `df.start()`

`df.start()` is the only entry point that creates new durable workflow instances. Every call:

1. Inserts rows into `df.instances` and `df.nodes` (storage cost — graphs can be large for `df.seq`/`df.loop`/`df.parallel`).
2. Enqueues a work item into `duroxide.*` tables that the background worker must poll and execute.
3. Eventually causes the worker to open a per-user sqlx connection (pool size = 5) and run user SQL / outbound HTTP.

Because the call is cheap for the caller but expensive downstream, an unprivileged user (or a buggy app) can trivially:

- **Exhaust disk** by mass-inserting graphs into `df.nodes` (no quota today).
- **Starve the worker pool** — only 5 per-user connections exist; one user flooding `df.start()` blocks every other user's instances from making progress (noisy-neighbor DoS).
- **Saturate outbound HTTP** indirectly (since `df.start` is the gate to `df.http()` execution) — relevant to D-3.
- **Inflate duroxide history tables** — every retry/replay amplifies the storage cost.

In a single-tenant deployment this is \"just\" a self-inflicted footgun, but the threat model treats database roles as a trust boundary (RLS, per-user connections), so one role abusing \`df.start()\` violates that isolation by denying service to others.

## Proposed Implementation

Two complementary limits, both enforced inside \`df_start()\` in \`src/dsl.rs\` **before** the instance is enqueued (so a rejected start never touches duroxide). All counts use the captured \`submitted_by\`/\`login_role\` identity (already unforgeable per S-2). Superuser bypasses both checks (consistent with I-8).

### 1. Concurrency cap — \`df.max_concurrent_per_user\`

GUC, default e.g. \`100\`. On each \`df.start()\`:

\`\`\`sql
SELECT count(*) FROM df.instances
WHERE submitted_by = \$1
  AND lower(status) IN ('pending','running');
\`\`\`

If \`>= max_concurrent_per_user\`, raise:

\`\`\`
ERROR: df.start rejected: user % has % active instances (limit %)
HINT: wait for in-flight instances to complete or raise df.max_concurrent_per_user
\`\`\`

Use SPI with \`Spi::get_one_with_args\` (same parameterized pattern as \`df.status\`/\`df.result\`, per T-1).

### 2. Storage / lifetime quota — \`df.max_instances_per_user\`

GUC, default e.g. \`10000\`. Counts **all** rows in \`df.instances\` for that user (including completed) so a user can't accumulate unbounded history. Same SPI pattern, different predicate (no status filter). Cleanup is the user's responsibility via a future \`df.purge()\` — document this in the error hint.

### 3. (Optional, future) Token-bucket rate (calls/sec)

A true rate limit (vs. concurrency limit) needs shared state across backends. Two options:

- **In-memory shared map** via \`pg_shmem_init_hook\` — \`HashMap<Oid, (last_refill, tokens)>\`. Cheap, no write amplification, lost on restart.
- **\`df.rate_limit\` table** keyed by \`(role oid, window_start)\` updated with \`INSERT ... ON CONFLICT DO UPDATE\`. Durable, survives restarts.

For v1, the concurrency cap (#1) should be sufficient and avoids the new table entirely.

### GUCs to register

In \`src/lib.rs\` \`_PG_init\`, alongside \`pg_durable.worker_role\`:

- \`df.max_concurrent_per_user\` — int, **\`PGC_SUSET\`** (only superuser can change), default 100, range \`[0, INT_MAX]\` where \`0\` = unlimited.
- \`df.max_instances_per_user\` — int, **\`PGC_SUSET\`**, default 10000, \`0\` = unlimited.

\`PGC_SUSET\` is critical — otherwise an attacker just calls \`SET df.max_concurrent_per_user = 1000000\` and bypasses the limit.

### Where the check goes

Inside \`df_start()\`, **after** identity capture (\`GetSessionUserId\`/\`GetOuterUserId\`) and **before** the call that inserts into \`df.instances\` / enqueues the duroxide work item. Failing here means no rows are written and no worker capacity is consumed.

### Testing

Add an E2E test under \`tests/e2e/sql/\` following the numbered convention:

1. \`SET df.max_concurrent_per_user = 2\` (as superuser).
2. As a non-superuser role, start 2 long-running instances (e.g. a slow SQL).
3. Assert the 3rd \`df.start()\` raises the expected error.
4. Cancel one, assert the next \`df.start()\` succeeds.

Also a test for \`df.max_instances_per_user\` covering completed-instance accumulation.

## Acceptance Criteria

- [ ] \`df.max_concurrent_per_user\` GUC registered (PGC_SUSET, default 100, 0 = unlimited)
- [ ] \`df.max_instances_per_user\` GUC registered (PGC_SUSET, default 10000, 0 = unlimited)
- [ ] \`df.start()\` enforces both limits before any insert / enqueue
- [ ] Superuser bypasses both checks
- [ ] Clear error messages with HINT
- [ ] E2E test for concurrency cap
- [ ] E2E test for instance quota
- [ ] User guide documents the GUCs and how to tune them

## Related

- Security review findings: D-1, D-2 (P0)
- Related but out of scope here: D-3 (df.http rate limit), D-4 (statement_timeout), D-7 (queue depth)
- See: \`docs/security-review/security-review.md\` § 5.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rate-limit df.start() to prevent DoS (security review D-1, D-2) #139

Background

Why rate-limit `df.start()`

Proposed Implementation

1. Concurrency cap — `df.max_concurrent_per_user`

2. Storage / lifetime quota — `df.max_instances_per_user`

3. (Optional, future) Token-bucket rate (calls/sec)

GUCs to register

Where the check goes

Testing

Acceptance Criteria

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Rate-limit df.start() to prevent DoS (security review D-1, D-2) #139

Description

Background

Why rate-limit df.start()

Proposed Implementation

1. Concurrency cap — `df.max_concurrent_per_user`

2. Storage / lifetime quota — `df.max_instances_per_user`

3. (Optional, future) Token-bucket rate (calls/sec)

GUCs to register

Where the check goes

Testing

Acceptance Criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Why rate-limit `df.start()`