Skip to content

fix: M:N capture_connectors table for connector provenance#85

Merged
graydawnc merged 1 commit intomainfrom
fix/connector-provenance-m2n
Apr 15, 2026
Merged

fix: M:N capture_connectors table for connector provenance#85
graydawnc merged 1 commit intomainfrom
fix/connector-provenance-m2n

Conversation

@graydawnc
Copy link
Copy Markdown
Collaborator

@graydawnc graydawnc commented Apr 15, 2026

Problem

Two long-standing bugs shared one root cause: the captures table treats connector ownership as a single FK (source_id) plus a single string in metadata.connectorId, but connector → capture is fundamentally M:N:

  • A Reddit post can be both saved and upvoted
  • HN hot/saved overlap on the same story
  • github-stars / github-notifications overlap on the same repo

Symptom A — schema lies about origin

Every connector capture was inserted with source_id = 1 (claude) via a hardcoded workaround in sync-engine.ts, because the sources table only had four rows (claude, codex, opencli, gemini) and connectors didn't fit. Per-connector identity was shoved into metadata.connectorId. Empirically low-impact (no query JOINs captures to sources for connector data), but a real schema smell and a trap for future maintainers.

Symptom B — per-connector counts oscillate (user-visible)

metadata.connectorId is single-valued. When two connectors legitimately share a platform_id, the UPSERT path on (platform, platform_id) overwrites whichever connector was already there. Last sync wins. Reproduced live with Reddit:

  • DB state: 2 saved posts, 4 upvoted posts, 1 overlap
  • UI counts oscillated saved 2↔1 / upvoted 4↔3 across repeated syncs as the overlap post's connectorId flipped ownership

Same trap waiting for HN hot+saved, github-stars+github-notifications, and any future overlapping pair.

Design — why M:N over alternatives

Considered four options:

Symptom A Symptom B Schema churn
1. NULL-able source_id partial small
2. Per-connector sources rows ❌ (still single-valued FK) small
3. M:N capture_connectors medium
4. metadata.connectorIds[] small, but JSON array filtering is ugly

Option 2 was the most tempting alternative — drop a row per connector into sources. But source_id is a single-valued FK, so the Reddit overlap post still can't carry both reddit-saved and reddit-upvoted simultaneously. Option 2 is a dead end for Symptom B; making it work would require either duplicating rows (breaking (platform, platform_id) dedup and doubling FTS cost) or never overwriting source_id (which biases counts toward whichever connector synced first — still wrong).

Only M:N can natively represent "this capture belongs to N connectors at once" without breaking dedup or duplicating index entries. Picked Option 3.

A pragmatic deviation from the original Option 3 sketch: instead of rebuilding the captures table to make source_id nullable (which entangles FTS triggers, indexes, and FK references), I added a generic 'connector' row to sources and repointed all connector captures at it. Same truth-value (the schema no longer claims a Reddit post came from claude) without touching captures's table definition.

Implementation

Schema

CREATE TABLE capture_connectors (
  capture_id   INTEGER NOT NULL REFERENCES captures(id) ON DELETE CASCADE,
  connector_id TEXT NOT NULL,
  PRIMARY KEY (capture_id, connector_id)
);
CREATE INDEX idx_capture_connectors_connector ON capture_connectors(connector_id);

Plus a new ('connector', '<plugin>') row in sources.

Migration v3 (idempotent, transactional)

  1. Backfill M:N from existing metadata.connectorId
  2. Strip connectorId from metadata via json_remove
  3. Repoint captures.source_id for connector rows from claudeconnector
  4. Drop dead idx_captures_source (no query ever used it)

PRAGMA user_version only advances after the transaction commits, so a partial run retries cleanly. All four steps are idempotent (INSERT OR IGNORE, json_remove on already-stripped JSON is a no-op, UPDATE to current value is a no-op, DROP INDEX IF EXISTS).

Sync engine

  • tagConnectorId removed entirely — metadata no longer carries provenance
  • upsertItems now takes connectorId and runs INSERT OR IGNORE INTO capture_connectors after both INSERT and UPDATE paths, so a capture re-synced by a second connector picks up an additional M:N row instead of overwriting
  • getSourceId now resolves 'connector' instead of 'claude'
  • deleteConnectorItems switched to "drop this connector's M:N rows, then delete captures with source='connector' that have no remaining M:N attribution" — preserves shared items legitimately owned by another connector

Six query sites updated

File Change
core/src/connectors/sync-engine.ts upsertItems writes M:N; deleteConnectorItems is M:N-aware
app/src/main/index.ts (uninstall) M:N delete + orphan cleanup; removed DELETE FROM captures WHERE platform = ? fallback (would nuke shared items in multi-connector world)
app/src/main/index.ts (count) SELECT COUNT(*) FROM capture_connectors WHERE connector_id = ?
cli/src/commands/connector-sync.ts (--reset) Same M:N delete pattern
cli/src/commands/connector-sync.ts (final count) M:N count
app/src/main/acp.ts ACP prompt examples teach the assistant to JOIN through M:N

Test helpers

createTestDB in test-helpers.ts updated to mirror the new schema (extra connector source row + capture_connectors table). All 147 core tests pass.

Verification — run against live DB

Pre-migration baseline (308 captures, all carrying metadata.connectorId):

  • user_version = 2, no capture_connectors table

After dev startup (migration v3 ran):

  • user_version = 3, connector source row added
  • 308 M:N rows backfilled, per-connector breakdown identical to pre-migration grouping
  • 0 captures with metadata.connectorId residue
  • 308/308 captures repointed to source='connector'
  • 0 mismatched (M:N row but wrong source) and 0 orphan (right source but no M:N) rows
  • idx_captures_source dropped, idx_capture_connectors_connector created

Symptom B reproduction (Reddit, baseline matches the bug report exactly):

  • 2 saved, 4 upvoted, 1 overlap (t3_1skjbg8 "Dark Fantasy Realms" carries both reddit-saved and reddit-upvoted in M:N)
  • Counts identical across 3 successive syncs — no oscillation

Single-connector uninstall semantics (simulated in transaction, then rolled back):

  • reddit-saved M:N rows cleared
  • The saved-only post deleted
  • The overlap post preserved, M:N now lists only reddit-upvoted
  • reddit-upvoted count unchanged

Full npm-package uninstall (UI):

  • All Reddit M:N + captures + sync_state cleared
  • Other connectors' data untouched

FTS sanity:

  • captures/captures_fts/captures_fts_trigram row counts aligned post-migration
  • "Dark Fantasy" matches in both unicode61 and trigram FTS

Type-check + tests pass on @spool/core, @spool/cli, @spool/app.

Test plan

  • Migration runs cleanly on existing DB (308 captures backfilled with no data loss)
  • Symptom A: connector captures no longer claim source_id = claude
  • Symptom B: Reddit saved/upvoted counts stable across repeated syncs
  • Single-connector reset preserves shared items
  • Full package uninstall preserves unrelated connectors
  • FTS row counts stay aligned with captures
  • Manual smoke test on a fresh install (no pre-existing data) — recommend before release

Two long-standing bugs shared one root cause: the captures table treated
connector ownership as a single FK + a single string in metadata, but
connector→capture is fundamentally M:N (one Reddit post can be both
saved and upvoted; HN hot/saved overlap; github-stars/notifications
overlap on the same repo).

Symptom A: every connector capture had source_id=1 (claude) due to a
hardcoded workaround. The schema lied about origin.

Symptom B: per-connector item counts oscillated across syncs. The single
metadata.connectorId field was clobbered on every UPSERT, so whichever
connector synced last "won" the shared item, and the loser's count
dropped by one until it synced again.

Fix: introduce capture_connectors(capture_id, connector_id) M:N table,
add a generic 'connector' source row, drop dead idx_captures_source.
Migration v3 backfills M:N from existing metadata.connectorId, strips
the field, and repoints connector captures to the new source row.
Six query sites (sync-engine upsert/delete, main uninstall + count,
CLI reset + count, ACP prompt examples) updated to JOIN through M:N.
@graydawnc graydawnc merged commit 8d7447a into main Apr 15, 2026
3 checks passed
@graydawnc graydawnc deleted the fix/connector-provenance-m2n branch April 15, 2026 06:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant