Skip to content

RAG integration – Data & Embedding pipeline architecture #152

@Kadajett

Description

@Kadajett

Design the pipeline to turn Semfora's existing outputs (toon, sqlite, jsonl) into embeddings for Retrieval‑Augmented Generation.

Goals:

  • Use lightweight outputs to generate embeddings on client machines of unknown power.
  • Handle massive codebases via chunking, on‑disk vector stores, and incremental updates.
  • Keep embeddings up‑to‑date when files change or re‑indexing occurs.

Deliverables:

  • Architecture diagram (Mermaid) linking Semfora indexing, chunking, embedding model, and vector DB.
  • Recommended embedding models (open‑source sentence‑transformers, OpenAI embeddings, etc.) and fallback strategies.
  • Strategy for incremental updates (hash‑based change detection, delta indexing).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions