Skip to content

Adaptive Knowledge Ingestion Pipeline #33

@Steake

Description

@Steake

GödelOS Adaptive Ingestion, Emphasized Chunking, Mid-Range CPU Efficiency, Selectable Analysis Levels

Role: Senior systems agent.
Mission: Deliver an adaptive, CPU-only ingestion pipeline optimized for mid-range hardware (≈8 cores, 16 GB RAM) that ingests large PDFs and diverse text files, embeds them into an existing custom vector DB, and builds a categorized knowledge graph used by the frontend. Ship a redesigned, persistent Jobs UI with granular progress and predictive ETAs. Make the custom vector DB effective (no duplicated ANN/search logic in the app).


Priorities (in order)

  1. Frontend Jobs UX (highest): persistent jobs (beyond modal), highly granular progress, predictive ETAs before starting, full job management, responsive across all viewports, clean visual design.

  2. Emphasized chunking strategy: layout/sentence-aware, semantically stable chunks tuned for downstream retrieval; parameterized by user-selectable analysis levels.

  3. Mid-range CPU efficiency: autotune threads/batches/queues for 4–16 cores and ~16 GB RAM; memory-safe; sustained throughput.

  4. Custom vector DB effectiveness: embeddings live and are searched in the DB; tighten schema & APIs; avoid re-implementing search/ANN client-side.

  5. End-to-end semantic integrity: PDF in → sensible chunking/embeddings → vectors stored → graph built from vector neighbors → frontend renders categorized nodes with labeled edges.


System Overview

graph TD
  U[User] --> UI[Frontend: Jobs & Graph Views]
  UI <-->|REST + WS| API[Ingestion & Graph API]
  API --> SCH[Scheduler + Autotuner]
  SCH --> EX[Extractor + Chunker]
  EX --> EMB[Embedder CPU]
  EMB --> VDB[(Custom Vector DB)]
  VDB --> KGB[Graph Builder kNN → edges]
  KGB --> KG[(Knowledge Graph Store)]
  KG --> UI
Loading

Selectable Analysis Levels (user picks before start)

Level Chunk Tokens Overlap Model (CPU) k (Top-K) Dedup Threshold Extra Processing Typical Use
Fast 650–800 60–90 all-MiniLM-L6-v2 (ONNX/Int8) 10 simhash ≥ 0.92 basic metadata quick loads
Balanced 750–900 100–120 all-MiniLM-L6-v2 (or MPNet if RAM > 12 GB) 15 simhash ≥ 0.88 heading/keyword tags → lightweight concepts most docs
Deep 500–700 120–160 all-mpnet-base-v2 (only if Autotuner OK) 20 simhash ≥ 0.85 richer tagging; tighter neighbor threshold high recall
  • Frontend: level selector in preflight; show ETA p50/p90 per level so users can choose speed vs depth.

  • Backend: level drives chunk sizing, embedding model, dedup, and kNN parameters.


Chunking Strategy (emphasis)

  1. Layout/sentence aware: PDFs: prefer block/heading/paragraph segmentation; fall back to text/sentence windows.

  2. Token windows: apply level-specific Chunk Tokens and Overlap (table above).

  3. Stability under edits: avoid straddling headings across chunks; keep references with their paragraph.

  4. Deduplication: simhash/minhash before embedding; skip duplicates, upsert metadata instead.

  5. Quality signals: per-chunk token count, punctuation ratio, heading proximity; store as metadata.

  6. Batching: dynamic batch sizing (16→64) via Autotuner.

flowchart TD
  A[File] --> B[Layout & Sentence Parse]
  B --> C{Chunk Windowing level params}
  C --> D[Dedup simhash/minhash]
  D -->|unique| E[Embed CPU]
  E --> F[(Vector DB Upsert)]
  F --> G[Top-K per Chunk]
  G --> H[Graph Builder nodes/edges]
Loading

Mid-Range CPU Efficiency (autotune)

  • Inputs: logical cores, OS/cgroup limits, free RAM, I/O, stage latencies.

  • Controls: workers, batch size, queue depth, spill thresholds.

  • Policy:

    • Start num_workers = min(cores−2, 8); adjust ±1 every 5–10 s.

    • Keep working set ≤12 GB; spill at 85% RSS.

    • Grow/shrink batch size dynamically to maintain throughput.

    • Backpressure producer queues if memory pressure high.


Custom Vector DB — make it effective

  • Contract: fixed dim & metric validated on connect, idempotent upsert (hash_sha1), batch upsert, Top-K search with filters, stats endpoints.

  • Performance: memory-mapped vectors, contiguous arrays, adjustable search params, thread-pool aware.

  • No duplication: ANN/search logic stays inside the DB.


Knowledge Graph from vector neighbors

  • Nodes: Document, Chunk, Concept.

  • Edges: CONTAINS, SIMILAR_TO, TAGGED_AS.

  • Threshold τ: adapt from similarity distribution.

  • Expose via: GET /api/graph/{docId}.


Frontend — Jobs UI

  • Jobs page: status pills, overall & per-stage bars, ETA, actions.

  • Job detail: preflight predictions (per level), live telemetry, outputs.

  • Responsive: grid layout ≥1024px, stacked cards <768px.

Preflight prediction

  • Tokenize sample (1–2 MB) to estimate tokens/MB and chunk count.

  • Micro-benchmark embedding to estimate chunks/sec and ETA p50/p90.

  • Present ETA for all levels before starting.


APIs

  • POST /api/import/preflight, POST /api/import/jobs, pause|resume|cancel, GET /api/import/jobs, GET /api/graph/{docId}, GET /api/health.

Testing & Acceptance

  • Import ≥300 MB PDF on 8-core/16 GB host completes without OOM.

  • Vectors stored in DB, graph built, frontend shows categorized nodes/edges.

  • Preflight ETA within ±25% after 2 min.

  • Jobs UI persists across reloads, responsive across devices.

  • Vector DB is the single source of embeddings/search.

  • Semantic sanity: self-query MRR@10 ≥ 0.6; spot-check relevant results.


Deliverables

  • Adaptive ingestion workers + Autotuner.

  • Level-driven chunking.

  • Tightened vector DB contract.

  • Graph builder using DB neighbors.

  • Persistent, responsive Jobs UI.

  • Markdown docs with mermaid diagrams.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdocumentationImprovements or additions to documentationenhancementNew feature or request

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions