[integrations] Smart ingest edge function by alanshurafa · Pull Request #10 · alanshurafa/OB1

alanshurafa · 2026-04-06T16:16:00Z

Summary

Standalone Supabase Edge Function for LLM-powered atomic thought extraction from raw text
Ported from ExoCortex production smart-ingest pipeline (1,369 completed jobs) with OB1 adaptations
Depends on schemas/smart-ingest-tables (PR [schemas] Smart ingest pipeline tables #4) for ingestion_jobs and ingestion_items tables

What It Does

Accepts raw text via HTTP POST, extracts atomic thoughts using an LLM (OpenRouter primary, OpenAI/Anthropic fallback), then deduplicates each thought against existing brain content using both SHA-256 content fingerprinting and pgvector semantic similarity. Four reconciliation actions: add, skip, append_evidence, create_revision.

Key Features

Dry-run mode — preview extractions without writing to the database
Job execution — commit dry-run results via /execute endpoint
Quality gate — minimum 30 chars, minimum importance 3
Fingerprint + semantic dedup — 0.85 match threshold, 0.92 skip threshold
Source metadata threading — import_key session dedup, capture provenance
Text chunking — handles long documents (5000 word limit per LLM call)
Sensitivity pre-flight — blocks restricted content from cloud processing
Entity extraction trigger — optional, best-effort (non-fatal if worker not deployed)

OB1 Adaptations

OpenRouter-first LLM provider order (reversed from ExoCortex)
Wildcard CORS for generic deployments
Model constants from _shared/config.ts (consistent with enhanced-mcp)
_shared/ helpers copied from enhanced-mcp (PR [integrations] Enhanced MCP server with alpha tool suite #6) for consistency

Files

All within integrations/smart-ingest/:

File	Lines	Purpose
`index.ts`	1094	Edge function with extraction, dedup, and execution logic
`_shared/helpers.ts`	770	Shared utilities (embedding, fingerprint, sensitivity, payload prep)
`_shared/config.ts`	204	Constants, types, prompts
`README.md`	225	Setup guide with prerequisites, steps, API reference, troubleshooting
`metadata.json`	18	OB1 contribution metadata
`deno.json`	5	Deno import map

Test plan

Verify all 15 gate checks pass via gh pr checks
Validate metadata.json against .github/metadata.schema.json
Confirm README contains: "prerequisites", numbered steps, "expected outcome"
Confirm "05-tool-audit" string appears in README
Confirm all relative links resolve (../../docs/01-getting-started.md, ../../docs/05-tool-audit.md)
Confirm no files outside integrations/smart-ingest/
Deploy to test Supabase project and smoke-test dry-run + execute flow

🤖 Generated with Claude Code

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 661fe55dc6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-06T16:19:59Z

+    if (!response.ok) {
+      throw new Error(`OpenRouter embedding failed (${response.status}): ${await response.text()}`);
+    }


Fall back to OpenAI when OpenRouter embedding fails

embedText advertises OpenRouter-primary/OpenAI-fallback behavior, but this branch throws immediately on any OpenRouter non-2xx response, so the OpenAI branch is never attempted when both keys are configured. In production, transient OpenRouter 429/5xx errors will cause ingestion reconciliation to fail (or lose embeddings) even though a healthy fallback provider is available; catch this failure and continue to the OpenAI path instead of hard-failing here.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-06T16:19:59Z

+    : null;
+
+  for (const item of items) {
+    if (item.action === "skip") { skippedCount++; continue; }


Mark skipped dry-run items executed in /execute

In handleExecuteJob, skip actions are counted and immediately continued, but the corresponding ingestion_items row is never updated. Because dry-run persistence stores pending items as ready, these rows stay ready even after the job is marked complete, leaving job state inconsistent and potentially misleading any UI/automation that interprets ready as unprocessed. Update skipped rows to executed before continuing (as the immediate-execution path already does).

Useful? React with 👍 / 👎.

Keep the public contribution contract unchanged but make the maintainer-local execution layer legible to future agents, and stop tracking local-only overlays that have no business landing upstream. - CLAUDE.md: add Local GSD Execution Layer section pointing at .planning/ - .gitignore: add .local/, .agent/, .claude.json, __pycache__/

- dry_run now uses peekQueueItems() (read-only SELECT) instead of claimQueueItems(), so items stay "pending" during preview runs - claimQueueItems() returns only rows actually claimed via .select(), preventing race conditions where concurrent workers see stale results - markError() clears started_at and worker_version when resetting to "pending" so retryable items don't appear stale in monitoring Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Why: Schema stores thoughts.id / entity_extraction_queue.thought_id as UUID (gen_random_uuid()), not BIGINT. TypeScript types were declaring number, which is a load-bearing lie: PostgREST returns UUIDs as strings, and any arithmetic or Number-coerce on a consumer path would produce NaN. Updated claimQueueItems, peekQueueItems, markComplete, markError, linkThoughtEntity signatures to string. Entity IDs remain number (BIGSERIAL).

Why: The worker had no upper bound on LLM spend — a misconfigured cron on a large queue could mint unbounded OpenRouter/OpenAI/Anthropic cost before anyone noticed. Added ENTITY_EXTRACTION_MAX_CALLS env (default 10000, 0 = unlimited), a module-scoped llmCallCount counter, a pre-call gate that throws ExtractionCostCapError when the cap is reached, and graceful abort in the main loop that returns remaining claimed rows to 'pending' so the next invocation can resume. Summary now reports truncated / truncated_reason / llm_calls so callers can observe the cap firing.

Add entities, edges, thought_entities, entity_extraction_queue, and consolidation_log tables for automatic entity/relationship extraction from thoughts. A trigger on the thoughts table enqueues new/updated rows for an async worker (shipped separately in integrations/entity-extraction-worker/). Positioned as the extraction-side complement to recipes/ob-graph/ — the two schemas are independent; ob-graph is a manual 2-table graph, this is an automatic extraction pipeline with evidence-bearing links. Part of the OB1 alpha milestone.

Base OB1 thoughts.id is UUID (gen_random_uuid()), not BIGINT. Fixed thought_entities.thought_id, entity_extraction_queue.thought_id, consolidation_log.survivor_id, and consolidation_log.loser_id. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…mparison Why: The comparison table advertised the UUID/BIGSERIAL mismatch as a design choice, but it was the BLOCKER-1 bug -- thought_id FKs into upstream thoughts.id (UUID) could not be BIGINT. With the UUID alignment restored via 495a183 (cherry-picked), the comparison row is stale and misleading; drop it.

…correct MCP posture Why: The anon/authenticated SELECT GRANTs with no RLS exposed the entire entity graph, evidence excerpts, queue errors, and consolidation audit trail to anyone with the project URL -- same pattern as the Wave 1 PR-A BLOCKER-2. MCP tools on stock OB1 use the service-role key from an Edge Function server-side, so the "MCP needs anon read access" justification (HIGH-4) is factually wrong. Changes: - Drop all GRANT SELECT ... TO anon on the five tables. - Enable RLS on entities, edges, thought_entities, entity_extraction_queue, consolidation_log. - Add explicit service_role FOR ALL policies (legibility; role bypasses RLS anyway) and a minimum authenticated SELECT USING (true) policy as a multi-tenant scaffold. No anon policy. - Keep service_role full GRANTs and narrow authenticated to SELECT. - Rewrite README Expected Outcome to describe the actual RLS + service-role-only posture, removing the incorrect MCP claim.

…eader Why: HIGH-1 -- the trigger reads NEW.content_fingerprint unconditionally, so on a brain that skipped Step 2.6 the migration would install silently and then crash the first INSERT INTO thoughts, breaking the primary write path. Adding an information_schema precheck at the top of the file hard-fails at install time with a clear remediation pointer, which also covers MEDIUM-4 (the backfill reads the same column). INFO-1 -- the top-of-file comment was a renaming artifact; rename "Knowledge Graph Tables" to "Entity Extraction Tables" while we're editing the header block.

Why: entity_extraction_queue and consolidation_log are both unbounded with no documented retention strategy. The README positions this as a production schema for long-running brains, which need operational guidance. Add a "Pruning and Retention" section with terminal-state DELETEs for the queue (safe because the trigger re-queues on edit) and a simple age-based DELETE for the consolidation log. No schema change -- retention stays an operator choice.

Why: The security posture of this schema depends on the RLS pattern documented in primitives/rls/. Declaring the dependency in metadata makes the gate's Rule 10 linkage visible and tells downstream indexers that this schema consumes the rls primitive.

Why: Deno's global fetch has no body-read timeout on Supabase Edge. A stuck OpenRouter / OpenAI / Anthropic upstream would hang the invocation until the 150s platform wall-clock killed it, leaving claimed rows in 'processing' with no status update. Added fetchWithTimeout() helper that wraps AbortController around fetch, defaulting to 60s and overridable via FETCH_TIMEOUT_MS. All three LLM call sites in extractEntities now route through it. Timeouts surface as thrown errors, caught by the per-item try/catch, and flow through markError — so the standard retry-then-fail path handles them (reset to 'pending' on attempt < 5, 'failed' on cap).

Why: Thought content was interpolated directly into the LLM prompt, giving any captured text (emails, browser history, Slack dumps) a direct channel to override the extraction instructions — which would then flow unescaped into entities.canonical_name, a TEXT column rendered by dashboards and MCP tools (stored XSS vector). Four layered defenses: 1. Wrap content in <thought_content>...</thought_content> tags and tell the model explicitly that content inside is untrusted data, not instructions. 2. Escape literal tag occurrences in content so an attacker can't forge a close-tag and break out of the wrapper. 3. Enforce response_format: { type: "json_object" } on OpenRouter (OpenAI already had it) so prose wrapping doesn't crash the JSON parser. 4. sanitizeEntityName() strips control chars and clips to 200 chars before entities land in the DB — caps the blast radius of a surviving injection.

Why: Supabase Edge Functions hard-kill at 150s. With limit=50 and LLM calls averaging 3s each, cumulative latency alone could exceed the budget — killing the invocation mid-loop and leaving rows stuck in 'processing' with no recovery except the manual SQL from the README. Added startTime at invocation, INVOCATION_BUDGET_MS = 140000 (30s headroom), and a per-item gate that releases remaining claimed rows back to 'pending' so the next invocation picks them up. Surfaces as summary.truncated=true with truncated_reason='wall_clock_budget', plus elapsed_ms for monitoring.

Why: The knowledge-graph schema's queue_entity_extraction trigger re-queues a thought when its content changes. The worker then re-runs extraction but only upserted new thought_entities rows — it never deleted links from the prior extraction. So editing a thought from {Alice, Bob, PostgreSQL} to {Alice, Redis} ended up with the thought linked to all four entities. Over time this silently corrupts the graph: edges.support_count inflates with thoughts that no longer mention the underlying entity. Delete our own prior links (scoped to source='entity_worker' so we don't clobber links from other sources) before re-writing. Non-fatal if DELETE fails — we still attempt the upserts, because missed extraction is worse than drift.

Why: The code fixes for BLOCKER-3 (ENTITY_EXTRACTION_MAX_CALLS), BLOCKER-4 (FETCH_TIMEOUT_MS), and WARNING-2 (wall-clock budget) added env knobs and summary fields (truncated, truncated_reason, llm_calls, elapsed_ms) that weren't surfaced anywhere users would see. Also documented the 'skipped' queue state per INFO-1 — the worker marks system-generated thoughts (metadata.generated_by) as skipped, and until now only the schema comment knew that. Added queue status reference and a short note that dry-run leaves the queue untouched.

Scheduled script that queries the past N days of thoughts, paginates and ranks by importance, synthesizes with an LLM, and delivers to Telegram or stdout. Filters restricted/personal by default; opt in with --include-personal.

Generates per-entity markdown wiki pages by aggregating thought_entities links and synthesizing with an LLM. Three output modes (file, entity-metadata, thought) let users choose between filesystem, graph metadata, or thought store — with the pollution trade-offs of the last documented.

Adds thought_edges table (supports, contradicts, supersedes, evolved_into, depends_on, related_to) plus valid_from/valid_until/decay_weight columns on the existing entity edges table. Documents open design questions around the supersedes overlap with provenance-chains.

…aware-routing-openrouter-provider docs(schema-aware-routing): document OpenRouter as alternative provider

…er-agent-identity [schemas] Add per-agent identity primitive

…er/enhanced-thoughts-status-columns [schemas] Make enhanced-thoughts self-contained for status columns

…ions/telegram-capture [integrations] Add Telegram capture bot

…wicegood/dashboard-next-bump [dashboards] open-brain-dashboard-next: bump next to 16.2.4

…estone/ob1-gate-v2 [docs] Refresh OB1 PR gate workflow registration

…estone/openclaw-tool-schemas [integrations] Add OpenClaw Agent Memory tool schemas

…enclaw-memory-host-hooks fix(openclaw-agent-memory): wire memory-host hooks for auto-recall (NateBJones-Projects#279)

…it/local-brain-no-mcp [recipes] Add local-brain-no-mcp recipe + ob1-local-http skill

…napsynapse/canonical-landing-page [dashboards] Canonical landing page for openbrain.fyi

…egram-markdownlint-fix [docs] Fix Telegram README Markdownlint

…ment-syntax

…rib/spiritualsystems/fix-typed-edges-comment-syntax [schemas] Fix invalid || concatenation in typed-reasoning-edges COMMENT

…-agent-memory feat(integrations): add hermes-agent-memory native provider for OB1

github-actions Bot added the integration label Apr 6, 2026

chatgpt-codex-connector Bot reviewed Apr 6, 2026

View reviewed changes

github-actions Bot added documentation Improvements or additions to documentation recipe labels Apr 6, 2026

Reb-Elle-Art and others added 18 commits April 17, 2026 14:56

[integrations] Add Telegram capture bot

41be46e

Merge remote-tracking branch 'origin/main'

28a3dc5

[integrations] Entity extraction worker

221ad58

alanshurafa force-pushed the contrib/alanshurafa/smart-ingest branch from be2136a to 2452cb9 Compare April 18, 2026 02:44

github-actions Bot added dashboard extension primitive labels Apr 18, 2026

alanshurafa added 3 commits April 17, 2026 23:32

justfinethanku and others added 29 commits June 4, 2026 13:52

docs: add community contribution credit

3f8521e

Merge remote-tracking branch 'origin/main' into pr-321-update

0144dec

docs: add community contribution credit

2c0bfe4

Merge pull request NateBJones-Projects#273 from jjshanks/docs/schema-…

17afc50

…aware-routing-openrouter-provider docs(schema-aware-routing): document OpenRouter as alternative provider

Merge pull request NateBJones-Projects#321 from jeremylahners/codex/p…

b782065

…er-agent-identity [schemas] Add per-agent identity primitive

Merge remote-tracking branch 'origin/main' into pr-190-refresh

ede09bf

Merge remote-tracking branch 'origin/main' into pr-305-update

a1bfcb4

Merge pull request NateBJones-Projects#305 from lucifer/contrib/lucif…

14c39d2

…er/enhanced-thoughts-status-columns [schemas] Make enhanced-thoughts self-contained for status columns

Merge pull request NateBJones-Projects#190 from Reb-Elle-Art/integrat…

004951b

…ions/telegram-capture [integrations] Add Telegram capture bot

Merge branch 'main' into contrib/tswicegood/dashboard-next-bump

24d8109

Merge pull request NateBJones-Projects#247 from tswicegood/contrib/ts…

3f18e27

…wicegood/dashboard-next-bump [dashboards] open-brain-dashboard-next: bump next to 16.2.4

Merge branch 'main' into contrib/humestone/ob1-gate-v2

3f04ffa

Merge pull request NateBJones-Projects#308 from Humestone/contrib/hum…

f1b71d6

…estone/ob1-gate-v2 [docs] Refresh OB1 PR gate workflow registration

Merge pull request NateBJones-Projects#309 from Humestone/contrib/hum…

6087b12

…estone/openclaw-tool-schemas [integrations] Add OpenClaw Agent Memory tool schemas

Merge pull request NateBJones-Projects#281 from MicScalise/fix-279-op…

6acdff4

…enclaw-memory-host-hooks fix(openclaw-agent-memory): wire memory-host hooks for auto-recall (NateBJones-Projects#279)

Merge pull request NateBJones-Projects#304 from dhanjit/contrib/dhanj…

8eb8baf

…it/local-brain-no-mcp [recipes] Add local-brain-no-mcp recipe + ob1-local-http skill

Merge pull request NateBJones-Projects#258 from snapsynapse/contrib/s…

fa405fd

…napsynapse/canonical-landing-page [dashboards] Canonical landing page for openbrain.fyi

[docs] Fix Telegram README Markdownlint

74628a6

Update Alan maintainer credit

170ad6d

Merge branch 'main' into codex/telegram-markdownlint-fix

6844e13

Merge pull request NateBJones-Projects#345 from alanshurafa/codex/tel…

4ba44a3

…egram-markdownlint-fix [docs] Fix Telegram README Markdownlint

docs: add community credit to smart ingest

624e080

docs: add community credit to Hermes Agent Memory

0e8249e

Merge branch 'main' into contrib/spiritualsystems/fix-typed-edges-com…

0319a6f

…ment-syntax

Merge pull request NateBJones-Projects#227 from spiritualsystems/cont…

4731d65

…rib/spiritualsystems/fix-typed-edges-comment-syntax [schemas] Fix invalid || concatenation in typed-reasoning-edges COMMENT

docs: remove stale Hermes license reference

4216d46

Merge branch 'main' into add-hermes-agent-memory

11b6407

Merge pull request NateBJones-Projects#280 from MicScalise/add-hermes…

da47d25

…-agent-memory feat(integrations): add hermes-agent-memory native provider for OB1

Merge branch 'main' into contrib/alanshurafa/smart-ingest

472b533

alanshurafa merged commit 5d3f6ab into main Jun 12, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[integrations] Smart ingest edge function#10

[integrations] Smart ingest edge function#10
alanshurafa merged 246 commits into
mainfrom
contrib/alanshurafa/smart-ingest

alanshurafa commented Apr 6, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 6, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

alanshurafa commented Apr 6, 2026

Summary

What It Does

Key Features

OB1 Adaptations

Files

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants