Skip to content

Rate-limit df.start() to prevent DoS (security review D-1, D-2) #139

Description

@pinodeca

Background

The security review identified two High-severity DoS findings (D-1, D-2) and a P0 recommendation to add rate limiting / quotas to df.start(). This issue tracks the analysis and a proposed implementation.

Why rate-limit df.start()

df.start() is the only entry point that creates new durable workflow instances. Every call:

  1. Inserts rows into df.instances and df.nodes (storage cost — graphs can be large for df.seq/df.loop/df.parallel).
  2. Enqueues a work item into duroxide.* tables that the background worker must poll and execute.
  3. Eventually causes the worker to open a per-user sqlx connection (pool size = 5) and run user SQL / outbound HTTP.

Because the call is cheap for the caller but expensive downstream, an unprivileged user (or a buggy app) can trivially:

  • Exhaust disk by mass-inserting graphs into df.nodes (no quota today).
  • Starve the worker pool — only 5 per-user connections exist; one user flooding df.start() blocks every other user's instances from making progress (noisy-neighbor DoS).
  • Saturate outbound HTTP indirectly (since df.start is the gate to df.http() execution) — relevant to D-3.
  • Inflate duroxide history tables — every retry/replay amplifies the storage cost.

In a single-tenant deployment this is "just" a self-inflicted footgun, but the threat model treats database roles as a trust boundary (RLS, per-user connections), so one role abusing `df.start()` violates that isolation by denying service to others.

Proposed Implementation

Two complementary limits, both enforced inside `df_start()` in `src/dsl.rs` before the instance is enqueued (so a rejected start never touches duroxide). All counts use the captured `submitted_by`/`login_role` identity (already unforgeable per S-2). Superuser bypasses both checks (consistent with I-8).

1. Concurrency cap — `df.max_concurrent_per_user`

GUC, default e.g. `100`. On each `df.start()`:

```sql
SELECT count(*) FROM df.instances
WHERE submitted_by = $1
AND lower(status) IN ('pending','running');
```

If `>= max_concurrent_per_user`, raise:

```
ERROR: df.start rejected: user % has % active instances (limit %)
HINT: wait for in-flight instances to complete or raise df.max_concurrent_per_user
```

Use SPI with `Spi::get_one_with_args` (same parameterized pattern as `df.status`/`df.result`, per T-1).

2. Storage / lifetime quota — `df.max_instances_per_user`

GUC, default e.g. `10000`. Counts all rows in `df.instances` for that user (including completed) so a user can't accumulate unbounded history. Same SPI pattern, different predicate (no status filter). Cleanup is the user's responsibility via a future `df.purge()` — document this in the error hint.

3. (Optional, future) Token-bucket rate (calls/sec)

A true rate limit (vs. concurrency limit) needs shared state across backends. Two options:

  • In-memory shared map via `pg_shmem_init_hook` — `HashMap<Oid, (last_refill, tokens)>`. Cheap, no write amplification, lost on restart.
  • `df.rate_limit` table keyed by `(role oid, window_start)` updated with `INSERT ... ON CONFLICT DO UPDATE`. Durable, survives restarts.

For v1, the concurrency cap (#1) should be sufficient and avoids the new table entirely.

GUCs to register

In `src/lib.rs` `_PG_init`, alongside `pg_durable.worker_role`:

  • `df.max_concurrent_per_user` — int, `PGC_SUSET` (only superuser can change), default 100, range `[0, INT_MAX]` where `0` = unlimited.
  • `df.max_instances_per_user` — int, `PGC_SUSET`, default 10000, `0` = unlimited.

`PGC_SUSET` is critical — otherwise an attacker just calls `SET df.max_concurrent_per_user = 1000000` and bypasses the limit.

Where the check goes

Inside `df_start()`, after identity capture (`GetSessionUserId`/`GetOuterUserId`) and before the call that inserts into `df.instances` / enqueues the duroxide work item. Failing here means no rows are written and no worker capacity is consumed.

Testing

Add an E2E test under `tests/e2e/sql/` following the numbered convention:

  1. `SET df.max_concurrent_per_user = 2` (as superuser).
  2. As a non-superuser role, start 2 long-running instances (e.g. a slow SQL).
  3. Assert the 3rd `df.start()` raises the expected error.
  4. Cancel one, assert the next `df.start()` succeeds.

Also a test for `df.max_instances_per_user` covering completed-instance accumulation.

Acceptance Criteria

  • `df.max_concurrent_per_user` GUC registered (PGC_SUSET, default 100, 0 = unlimited)
  • `df.max_instances_per_user` GUC registered (PGC_SUSET, default 10000, 0 = unlimited)
  • `df.start()` enforces both limits before any insert / enqueue
  • Superuser bypasses both checks
  • Clear error messages with HINT
  • E2E test for concurrency cap
  • E2E test for instance quota
  • User guide documents the GUCs and how to tune them

Related

  • Security review findings: D-1, D-2 (P0)
  • Related but out of scope here: D-3 (df.http rate limit), D-4 (statement_timeout), D-7 (queue depth)
  • See: `docs/security-review/security-review.md` § 5.1

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions