Skip to content

[integrations] Add smart ingest backend#28

Closed
alanshurafa wants to merge 221 commits into
mainfrom
codex/smart-ingest-backend
Closed

[integrations] Add smart ingest backend#28
alanshurafa wants to merge 221 commits into
mainfrom
codex/smart-ingest-backend

Conversation

@alanshurafa

Copy link
Copy Markdown
Owner

Fork-only draft PR to run CI for the smart ingest backend branch. This branch is a fork test only; upstream OB1 already contains integrations/smart-ingest, so this branch should be reviewed as superseded or reworked before any upstream submission.

Reb-Elle-Art and others added 30 commits April 17, 2026 14:56
Keep the public contribution contract unchanged but make the maintainer-local
execution layer legible to future agents, and stop tracking local-only overlays
that have no business landing upstream.

- CLAUDE.md: add Local GSD Execution Layer section pointing at .planning/
- .gitignore: add .local/, .agent/, .claude.json, __pycache__/
- dry_run now uses peekQueueItems() (read-only SELECT) instead of
  claimQueueItems(), so items stay "pending" during preview runs
- claimQueueItems() returns only rows actually claimed via .select(),
  preventing race conditions where concurrent workers see stale results
- markError() clears started_at and worker_version when resetting to
  "pending" so retryable items don't appear stale in monitoring

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Why: Schema stores thoughts.id / entity_extraction_queue.thought_id as UUID
(gen_random_uuid()), not BIGINT. TypeScript types were declaring number,
which is a load-bearing lie: PostgREST returns UUIDs as strings, and any
arithmetic or Number-coerce on a consumer path would produce NaN. Updated
claimQueueItems, peekQueueItems, markComplete, markError, linkThoughtEntity
signatures to string. Entity IDs remain number (BIGSERIAL).
Why: The worker had no upper bound on LLM spend — a misconfigured cron on
a large queue could mint unbounded OpenRouter/OpenAI/Anthropic cost before
anyone noticed. Added ENTITY_EXTRACTION_MAX_CALLS env (default 10000,
0 = unlimited), a module-scoped llmCallCount counter, a pre-call gate that
throws ExtractionCostCapError when the cap is reached, and graceful abort
in the main loop that returns remaining claimed rows to 'pending' so the
next invocation can resume. Summary now reports truncated / truncated_reason
/ llm_calls so callers can observe the cap firing.
Add entities, edges, thought_entities, entity_extraction_queue, and
consolidation_log tables for automatic entity/relationship extraction
from thoughts. A trigger on the thoughts table enqueues new/updated
rows for an async worker (shipped separately in
integrations/entity-extraction-worker/).

Positioned as the extraction-side complement to recipes/ob-graph/ —
the two schemas are independent; ob-graph is a manual 2-table graph,
this is an automatic extraction pipeline with evidence-bearing links.

Part of the OB1 alpha milestone.
Base OB1 thoughts.id is UUID (gen_random_uuid()), not BIGINT.
Fixed thought_entities.thought_id, entity_extraction_queue.thought_id,
consolidation_log.survivor_id, and consolidation_log.loser_id.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…mparison

Why: The comparison table advertised the UUID/BIGSERIAL mismatch as a
design choice, but it was the BLOCKER-1 bug -- thought_id FKs into
upstream thoughts.id (UUID) could not be BIGINT. With the UUID alignment
restored via 495a183 (cherry-picked), the comparison row is stale and
misleading; drop it.
…correct MCP posture

Why: The anon/authenticated SELECT GRANTs with no RLS exposed the
entire entity graph, evidence excerpts, queue errors, and consolidation
audit trail to anyone with the project URL -- same pattern as the Wave 1
PR-A BLOCKER-2. MCP tools on stock OB1 use the service-role key from an
Edge Function server-side, so the "MCP needs anon read access"
justification (HIGH-4) is factually wrong.

Changes:
- Drop all GRANT SELECT ... TO anon on the five tables.
- Enable RLS on entities, edges, thought_entities,
  entity_extraction_queue, consolidation_log.
- Add explicit service_role FOR ALL policies (legibility; role
  bypasses RLS anyway) and a minimum authenticated SELECT USING (true)
  policy as a multi-tenant scaffold. No anon policy.
- Keep service_role full GRANTs and narrow authenticated to SELECT.
- Rewrite README Expected Outcome to describe the actual RLS +
  service-role-only posture, removing the incorrect MCP claim.
…eader

Why: HIGH-1 -- the trigger reads NEW.content_fingerprint unconditionally,
so on a brain that skipped Step 2.6 the migration would install silently
and then crash the first INSERT INTO thoughts, breaking the primary
write path. Adding an information_schema precheck at the top of the
file hard-fails at install time with a clear remediation pointer,
which also covers MEDIUM-4 (the backfill reads the same column).
INFO-1 -- the top-of-file comment was a renaming artifact; rename
"Knowledge Graph Tables" to "Entity Extraction Tables" while we're
editing the header block.
Why: entity_extraction_queue and consolidation_log are both unbounded
with no documented retention strategy. The README positions this as a
production schema for long-running brains, which need operational
guidance. Add a "Pruning and Retention" section with terminal-state
DELETEs for the queue (safe because the trigger re-queues on edit) and
a simple age-based DELETE for the consolidation log. No schema change
-- retention stays an operator choice.
Why: The security posture of this schema depends on the RLS pattern
documented in primitives/rls/. Declaring the dependency in metadata
makes the gate's Rule 10 linkage visible and tells downstream indexers
that this schema consumes the rls primitive.
Why: Deno's global fetch has no body-read timeout on Supabase Edge. A
stuck OpenRouter / OpenAI / Anthropic upstream would hang the invocation
until the 150s platform wall-clock killed it, leaving claimed rows in
'processing' with no status update. Added fetchWithTimeout() helper that
wraps AbortController around fetch, defaulting to 60s and overridable via
FETCH_TIMEOUT_MS. All three LLM call sites in extractEntities now route
through it. Timeouts surface as thrown errors, caught by the per-item
try/catch, and flow through markError — so the standard retry-then-fail
path handles them (reset to 'pending' on attempt < 5, 'failed' on cap).
Why: Thought content was interpolated directly into the LLM prompt, giving
any captured text (emails, browser history, Slack dumps) a direct channel
to override the extraction instructions — which would then flow unescaped
into entities.canonical_name, a TEXT column rendered by dashboards and
MCP tools (stored XSS vector). Four layered defenses:

1. Wrap content in <thought_content>...</thought_content> tags and tell the
   model explicitly that content inside is untrusted data, not instructions.
2. Escape literal tag occurrences in content so an attacker can't forge a
   close-tag and break out of the wrapper.
3. Enforce response_format: { type: "json_object" } on OpenRouter (OpenAI
   already had it) so prose wrapping doesn't crash the JSON parser.
4. sanitizeEntityName() strips control chars and clips to 200 chars before
   entities land in the DB — caps the blast radius of a surviving injection.
Why: Supabase Edge Functions hard-kill at 150s. With limit=50 and LLM calls
averaging 3s each, cumulative latency alone could exceed the budget — killing
the invocation mid-loop and leaving rows stuck in 'processing' with no
recovery except the manual SQL from the README. Added startTime at
invocation, INVOCATION_BUDGET_MS = 140000 (30s headroom), and a per-item
gate that releases remaining claimed rows back to 'pending' so the next
invocation picks them up. Surfaces as summary.truncated=true with
truncated_reason='wall_clock_budget', plus elapsed_ms for monitoring.
Why: The knowledge-graph schema's queue_entity_extraction trigger re-queues
a thought when its content changes. The worker then re-runs extraction but
only upserted new thought_entities rows — it never deleted links from the
prior extraction. So editing a thought from {Alice, Bob, PostgreSQL} to
{Alice, Redis} ended up with the thought linked to all four entities. Over
time this silently corrupts the graph: edges.support_count inflates with
thoughts that no longer mention the underlying entity. Delete our own
prior links (scoped to source='entity_worker' so we don't clobber links
from other sources) before re-writing. Non-fatal if DELETE fails — we
still attempt the upserts, because missed extraction is worse than drift.
Why: The code fixes for BLOCKER-3 (ENTITY_EXTRACTION_MAX_CALLS),
BLOCKER-4 (FETCH_TIMEOUT_MS), and WARNING-2 (wall-clock budget) added
env knobs and summary fields (truncated, truncated_reason, llm_calls,
elapsed_ms) that weren't surfaced anywhere users would see. Also
documented the 'skipped' queue state per INFO-1 — the worker marks
system-generated thoughts (metadata.generated_by) as skipped, and until
now only the schema comment knew that. Added queue status reference and
a short note that dry-run leaves the queue untouched.
Scheduled script that queries the past N days of thoughts, paginates and ranks by importance, synthesizes with an LLM, and delivers to Telegram or stdout. Filters restricted/personal by default; opt in with --include-personal.
Generates per-entity markdown wiki pages by aggregating thought_entities links and synthesizing with an LLM. Three output modes (file, entity-metadata, thought) let users choose between filesystem, graph metadata, or thought store — with the pollution trade-offs of the last documented.
Adds thought_edges table (supports, contradicts, supersedes, evolved_into, depends_on, related_to) plus valid_from/valid_until/decay_weight columns on the existing entity edges table. Documents open design questions around the supersedes overlap with provenance-chains.
Classifier that populates thought_edges: Haiku filters candidate thought pairs, Opus confirms the relation. Cost-capped, batch-processed. Documents the unresolved question of whether to mirror supersedes edges back to public.thoughts.
Weekly quality audit across three cost tiers: SQL-only orphan/dup lint (free), graph-based edge-weakness lint (free), and LLM-assisted contradiction sampling (budget-capped). Read-only — produces a report, never mutates.
Also fixes P1-3: upsert dossier by entity_id, move timestamp to metadata.

Dossier thoughts now (a) compute and store an embedding so they are
retrievable via match_thoughts, matching the MCP capture flow in
server/index.ts, and (b) dedup by metadata.wiki_entity_id instead of
content fingerprint, so regenerating a wiki for the same entity refreshes
the existing row in place rather than accumulating duplicates. The
per-run timestamp now lives in metadata.generated_at, not the content
body, so upsert_thought's content-fingerprint dedup is not defeated.
…tion prereq

The README linked to schemas/entity-extraction/ and
integrations/entity-extraction-worker/, neither of which is in OB1 main.
Add a prominent warning at the top of the README and in the Prerequisites
block noting the companion PRs that must land first, remove broken
internal links to those pending paths, and add a deferred-follow-up
warning about the O(N) listBatchCandidates scalability concern for
large brains.
justfinethanku and others added 26 commits June 1, 2026 14:37
…recipe-editorial-policy

[recipes] Add editorial-policy + weekly auditor recipe
…an11/typed-edge-classifier-openrouter

[recipes/typed-edge-classifier] Add OpenRouter provider support
…aware-routing-openrouter-provider

docs(schema-aware-routing): document OpenRouter as alternative provider
…er-agent-identity

[schemas] Add per-agent identity primitive
…er/enhanced-thoughts-status-columns

[schemas] Make enhanced-thoughts self-contained for status columns
…ions/telegram-capture

[integrations] Add Telegram capture bot
…wicegood/dashboard-next-bump

[dashboards] open-brain-dashboard-next: bump next to 16.2.4
…estone/ob1-gate-v2

[docs] Refresh OB1 PR gate workflow registration
…estone/openclaw-tool-schemas

[integrations] Add OpenClaw Agent Memory tool schemas
…enclaw-memory-host-hooks

fix(openclaw-agent-memory): wire memory-host hooks for auto-recall (NateBJones-Projects#279)
…it/local-brain-no-mcp

[recipes] Add local-brain-no-mcp recipe + ob1-local-http skill
…napsynapse/canonical-landing-page

[dashboards] Canonical landing page for openbrain.fyi
@github-actions github-actions Bot added dashboard documentation Improvements or additions to documentation labels Jun 8, 2026
@alanshurafa

Copy link
Copy Markdown
Owner Author

Closing this fork-only test PR because it was dirty against alanshurafa/OB1:main. Superseded by clean fork-main-based CI simulation PR #33.

@alanshurafa alanshurafa closed this Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dashboard documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.