Skip to content

schema: add partial index on channel_members.pubkey#359

Open
tlongwell-block wants to merge 1 commit intomainfrom
tlongwell/add-channel-members-pubkey-index
Open

schema: add partial index on channel_members.pubkey#359
tlongwell-block wants to merge 1 commit intomainfrom
tlongwell/add-channel-members-pubkey-index

Conversation

@tlongwell-block
Copy link
Copy Markdown
Collaborator

Problem

Every Nostr REQ (subscription) message calls get_accessible_channel_ids() as its first database operation after auth. This query filters channel_members by WHERE pubkey = $1 AND removed_at IS NULL — but the only index on the table is the primary key (channel_id, pubkey), which has channel_id first. PostgreSQL cannot use this index for pubkey-first lookups, so every subscription triggers a sequential scan of the entire table.

On staging right now:

Metric Value
Sequential scans on channel_members 5.6 million
Total rows read by those scans 7.2 billion
Scan rate ~5/sec steady
Rows per scan ~1,284 (full table)
channels table seq scans (same pattern) 5.8 million

The table is small today (1,360 rows) so each scan completes in <0.3ms and fits in shared_buffers (99.99% cache hit ratio). But this is O(N) per subscription — as users and channels grow, it will degrade linearly and become a real bottleneck.

Symptoms observed

Users are seeing "Failed to refresh channel history after subscribing" and "Timed out while loading channel history" on staging. While investigating, we also found:

  • A rogue redis-cli MONITOR session that had been running for ~6 hours, accumulating a 73MB output buffer in a single client connection (Redis pod limit is 256Mi). Killed it — memory dropped from 71MB → 1.45MB instantly.
  • Datadog agent unreachable from the Istio sidecar — repeated 503 errors in envoy tracing logs adding latency overhead to every request through the mesh.

The index fix addresses the underlying database inefficiency; the Redis MONITOR issue was the acute trigger.

Solution

Add a partial index on channel_members.pubkey for active members:

CREATE INDEX idx_channel_members_pubkey ON channel_members (pubkey)
    WHERE removed_at IS NULL;

This covers the exact predicate used by all hot-path queries in sprout-db/src/channel.rs:

Function Line Query pattern
get_accessible_channel_ids 529 WHERE cm.pubkey = $1 AND cm.removed_at IS NULL
channel_ids_for_pubkey 531 WHERE cm.pubkey = $1 AND cm.removed_at IS NULL
is_member 491 WHERE cm.channel_id = $1 AND cm.pubkey = $2 AND cm.removed_at IS NULL
get_member_role 604 WHERE channel_id = $1 AND pubkey = $2 AND removed_at IS NULL
list_accessible_channels 722 LEFT JOIN ... AND cm.pubkey = $1 AND cm.removed_at IS NULL

Also used by DM lookups in sprout-db/src/dm.rs (lines 244-245, 266-267).

Why partial?

  • No queries in the codebase read removed members (removed_at IS NOT NULL)
  • Partial index is smaller and more cache-friendly
  • Matches the exact WHERE clause PostgreSQL needs to prove index applicability

Why not composite (pubkey, channel_id)?

A composite index would enable index-only scans for SELECT channel_id WHERE pubkey = $1, but that is an optional future optimization. The single-column partial index is sufficient to eliminate the seq scans and is the minimal correct fix.

Queries already covered by existing PK

Queries that filter on (channel_id, pubkey) — like get_member, remove_member, get_member_role — are already well-served by the PK index (channel_id, pubkey). No additional index needed for those.

Rollout

  • CREATE INDEX (not CONCURRENTLY) takes a brief write lock on channel_members. At 1,360 rows this is sub-millisecond and safe.
  • Verify with EXPLAIN ANALYZE after deploy that the planner picks the new index for get_accessible_channel_ids().
  • Pairs well with block-coder-tf-stacks#1101 which bumps pod resource limits (more CPU/memory headroom), but that PR addresses capacity while this one addresses efficiency.

The channel_members table only has a PK index on (channel_id, pubkey).
Every query that looks up channels-for-a-user (WHERE pubkey = $1) does a
sequential scan because the PK has channel_id first.

get_accessible_channel_ids() runs on every REQ (subscription) message —
it is the first thing the relay does after auth. On staging this has
accumulated 5.6M seq scans reading 7.2B rows total (~5 scans/sec steady).

Add a partial index on (pubkey) WHERE removed_at IS NULL, which covers
the exact predicate used by the hot-path queries in sprout-db/channel.rs:
  - get_accessible_channel_ids (line 529)
  - channel_ids_for_pubkey (line 531)
  - is_member (line 491)
  - get_member_role (line 604)

The table is small today (1,360 rows) so each scan is <0.3ms, but this
is O(N) per subscription and will degrade linearly as users grow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants