Chat with your data. Forge a living, trainable corpus that makes any model smarter.
- Chat with your data. Build a living corpus. Point corpus-forge at your notes, code, PDFs, chat history, audio, and video, and you get a searchable index that grows as you (or an AI assistant) curate it. The corpus is the product — and it's the upstream of every training run.
- Training data is the deliverable. A HuggingFace-Datasets-format export of your text + chat sources, deduplicated by content-hash, ready to feed a fine-tuning run. The living corpus is the way you get there.
- Human-in-the-loop curation. Your model finds the weakest entries — low classifier confidence, thin metadata, missing labels — and you fortify them in a chat with Claude, Gemini, or OpenCode. Edits commit back through MCP, so the next training run starts from stronger data. See
AGENTS.mdfor the vendor-neutral playbook. - Universal multi-format ingest. Markdown, PDF (digital + VLM OCR escalation), HTML, EPUB, Office (
.docx/.pptx/.xlsx), Jupyter notebooks, CSV, structured data (JSON/YAML/TOML), subtitles, 45+ source-code languages via tree-sitter, images via a VLM, and audio/video via Whisper — all behind a singlefilesystemsource plugin. - Content-defined chunking + classification + enrichment. Documents are classified into a 9-value content-class taxonomy (rule classifier → optional LLM escalation), chunked by class (FastCDC for prose, AST-aware for code, conversation-aware for chat), and code chunks are optionally enriched with LLM-synthesised docstrings + summaries + symbol references.
- Multi-embedder by design. Register as many text embedders as you want — local sentence-transformers, OpenAI, anything served via an OpenAI-compatible endpoint (Ollama, vLLM). Multi-modal embedders (CLIP family) cover the image lane. Backfill new embedders without re-chunking.
- Local-or-remote, end to end. Every model client (VLM, classifier, Whisper, code enricher, reranker) accepts a configurable HTTP URL — default is a local Ollama daemon, swap to a hosted endpoint with a one-line config change and no code edit.
- Predictable storage.
corpus-forge estimate <path>predicts the Postgres footprint of syncing a tree before you sync. Same surface available to any MCP-connected assistant viaestimate_sync_size. - Multi-machine corpus. Run the same ingest across a desktop and a laptop — alias sources by
logical_name, deduplicate bycontent_hash, and let stale-run reclamation handle crashes automatically. See docs/architecture.md#multi-machine-ingest.
The installer walks you through a short prompt-tree, picks the right
pip extras for the components you want, runs uv tool install, and
hands off to the corpus-forge setup wizard to render
~/.config/corpus-forge/config.toml. Works on macOS, Linux, and
Windows.
curl -sSf https://raw.githubusercontent.com/ulmentflam/corpus-forge/main/install.sh | bash# Windows (run from an elevated PowerShell if you also want the daemon service)
Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass -Force; iwr -useb https://raw.githubusercontent.com/ulmentflam/corpus-forge/main/install.ps1 -OutFile $env:TEMP\install.ps1; & $env:TEMP\install.ps1CI / unattended installs — set CF_NON_INTERACTIVE=1 plus the
CF_* env vars documented in corpus_forge/setup/questions.toml:
CF_NON_INTERACTIVE=1 CF_BACKEND=sqlite CF_MCP=yes CF_HF=yes \
curl -sSf https://raw.githubusercontent.com/ulmentflam/corpus-forge/main/install.sh | bashAlready living in Claude Code, Claude Desktop, or another MCP client? Skip the manual steps — let the assistant do the whole install and the MCP wiring. Paste this prompt and answer the two questions it asks:
Following this repo's CLAUDE.md end to end, do the full corpus-forge install + MCP wiring for me.
1. Install corpus-forge, picking the right method for my platform.
2. Run `corpus-forge setup`, then `corpus-forge migrate`.
3. Wire it up as an MCP server in whatever client I'm running
right now — Claude Code (.mcp.json), Claude Desktop config, or the
Anthropic Agent SDK — using the exact server block from CLAUDE.md.
4. Register the `corpus-forge-search` and `corpus-curate` skills.
5. Run the first-run sanity sequence (migrate → doctor → estimate →
ingest → embed → search) on the small corpus path I give you.
If you're not a Claude client, follow AGENTS.md instead.
Ask me for my backend (postgres or sqlite) and my corpus path before
you start if I haven't already told you.
CLAUDE.md is the canonical guide the assistant follows;
AGENTS.md (any MCP client) and GEMINI.md
(Gemini CLI / Code Assist) are the equivalents for non-Claude clients.
Once a primary host is running, a second box (your beefy Linux/Windows
GPU rig, a laptop, a spare Mac mini) can join the same fleet with one
command. The installer skips its question tree, pulls the shared
scope (embedder choices, retrieval tuning, classifier chains) from the
primary's published config, registers the new host in corpus.hosts,
and renders a local config.toml. It does not run migrate — the
primary owns the schema lifecycle.
# macOS / Linux / WSL
curl -sSf https://raw.githubusercontent.com/ulmentflam/corpus-forge/main/install.sh \
| bash -s -- --join 'postgresql://primary.fleet:5432/corpus'# Windows (PowerShell) — env-var form, single line, always paste-safe:
Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass -Force; $env:CF_JOIN_DSN = 'postgresql://primary.fleet:5432/corpus'; iwr -useb https://raw.githubusercontent.com/ulmentflam/corpus-forge/main/install.ps1 -OutFile $env:TEMP\install.ps1; & $env:TEMP\install.ps1Or with the -Join parameter form (same chained pattern):
Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass -Force; iwr -useb https://raw.githubusercontent.com/ulmentflam/corpus-forge/main/install.ps1 -OutFile $env:TEMP\install.ps1; & $env:TEMP\install.ps1 -Join 'postgresql://primary.fleet:5432/corpus'Why
-OutFile + &and notiwr | iex?Invoke-Expressiondoesn't reliably handle scripts with top-levelparam()blocks (which install.ps1 has, for-Join) — its parser stumbles on the preceding comment-based-help block. Downloading to a file and running via the call operator routes the script through PowerShell's normal.ps1loader, which handlesparam()and<# #>blocks correctly.
Tailnet operators can pass a ts:// DSN —
corpus-forge will resolve it via the Tailscale API so the DSN is
portable across the mesh.
What comes next on the joiner:
corpus-forge doctoralready ran (smoke check, tolerates transient failures). Re-run it once the primary is reachable if it warned.corpus-forge bench embed --all— records this host's embedder throughput so the fleet's claim loop knows what lanes it's best at.corpus-forge service install— runs the daemon as a managed service (launchd / systemd-user / Windows Service).
corpus-forge update # auto-detects channel (uv-tool / pipx / brew / docker / source / pip)
corpus-forge doctor # post-install health check (Python, system deps, config)
corpus-forge --version # prints version; daily PyPI ping surfaces newer releasesThe --version ping is strictly anonymous (User-Agent
corpus-forge/<version>, no install-id) and caches the result for
24 h. Opt out with CF_NO_VERSION_CHECK=1.
- Homebrew tap —
brew install ulmentflam/tap/corpus-forge(formula scaffold atpackaging/distribution/corpus-forge.rb) - Scoop bucket — manifest scaffold at
packaging/distribution/corpus-forge.json - Docker —
docker run -it ghcr.io/ulmentflam/corpus-forge:latest --help(seeDockerfile;:fulltag bundles every extra) - PyPI —
pip install corpus-forgeoruv tool install corpus-forge
Linux
# 1. Install the package + the extras you need.
pip install 'corpus-forge[sqlite,hf]'
# common adds:
# [code] tree-sitter code chunker + 45+ language extractor
# [multi-format] PDF / HTML / EPUB / Office / Notebook / CSV + FastCDC chunker
# [ocr] VLM OCR for sparse-text PDFs + image extractor
# [whisper] audio + video transcription (faster-whisper / OpenAI / Groq)
# [mcp] Model Context Protocol stdio server for Claude / Agent SDK
# [rerank] cross-encoder reranker (BGE default)
# [eval] retrieval-evaluation harness
# 2. (Optional) Register a systemd user unit for the daemon.
bash scripts/linux/install.sh
# Writes ~/.config/systemd/user/corpus-forge.service and starts it
# via `systemctl --user enable --now corpus-forge.service`.
# 3. Configure + smoke-test.
cp config.example.toml ~/.config/corpus-forge/config.toml
cp secrets.env.example ~/.config/corpus-forge/secrets.env
corpus-forge migrate
corpus-forge ingest --oncemacOS
# 1. Install (same as Linux).
pip install 'corpus-forge[sqlite,hf]'
# 2. (Optional) Register a launchd agent for the daemon.
bash scripts/macos/install.sh
# Renders ~/Library/LaunchAgents/com.${USER}.corpus-forge.plist and
# prints the `launchctl load` / `launchctl kickstart` commands.
# 3. Configure + smoke-test.
cp config.example.toml ~/.config/corpus-forge/config.toml
cp secrets.env.example ~/.config/corpus-forge/secrets.env
corpus-forge migrate
corpus-forge ingest --onceApple Silicon: device = "mps" in the embedder config uses the GPU.
Windows
pip install corpus-forge[sqlite,hf] works under Python 3.11/3.12/3.13 on Windows. We don't ship a Windows service-installer script for beta — wrap corpus-forge daemon with NSSM or Task Scheduler:
# Example with NSSM
nssm install corpus-forge "C:\Path\To\Python\python.exe" -m corpus_forge daemon
nssm set corpus-forge AppDirectory "%USERPROFILE%\.config\corpus-forge"
nssm start corpus-forgePostgreSQL integration tests require Docker Desktop; SQLite-only setups work natively.
git clone https://github.com/ulmentflam/corpus-forge
cd corpus-forge
make dev # uv sync --all-extras --group dev + pre-commit install
make ci # full local gate (format / lint / typecheck / tests)pip install corpus-forge[sqlite,hf]
# 1. Drop in a config (edit paths + embedder choices).
mkdir -p ~/.config/corpus-forge
cp $(python -c "import corpus_forge, pathlib; print(pathlib.Path(corpus_forge.__file__).parent.parent / 'config.example.toml')") \
~/.config/corpus-forge/config.toml
$EDITOR ~/.config/corpus-forge/config.toml
# 2. Initialize the database (SQLite or PostgreSQL).
corpus-forge migrate
# 3. (Optional) Drop a .corpusignore at the scan root to skip noisy
# files/dirs. A vendor-neutral starter ships at .corpusignore.example.
# User-global rules can live at ~/.config/corpus-forge/ignore.
# 4. Estimate the Postgres footprint *before* you sync. No I/O, no model calls.
corpus-forge estimate ~/Notes
# 5. Run a one-shot ingestion pass.
# No corpus of your own yet? Point a source at examples/sample-corpus/
# — a ready-made mini knowledge base — and follow its README.
corpus-forge ingest --once
# 6. Backfill embeddings for the active embedder(s).
corpus-forge embed -e qwen3_8b
# 7. (Optional) Classify documents into the 9-value content-class taxonomy.
corpus-forge classify --dry-run --json
corpus-forge classify
# 8. (Optional) Re-chunk classified prose with FastCDC + AST-aware code.
corpus-forge rechunk
# 9. Search the corpus end-to-end.
corpus-forge search "how does the SQLite lock work" --k 5
# 10. Curate weak entries with an AI assistant (Claude / Gemini / OpenCode).
# Wire the MCP server (see "For AI assistants" below), then in your chat:
# /corpus-curate → next_curation_target → chat → commit_curation
# 11. Export to HuggingFace Datasets format.
corpus-forge export chat --dataset claude-code --out ./chat.jsonl --template chatmlThe headline payoff. Two views map directly to HuggingFace columns. The
Python API is the supported surface; the corpus-forge export chat CLI
covers the most common chat-side path (with chat-template + ShareGPT
shaping):
from corpus_forge.exports.huggingface import export_to_hf_dataset, push_to_hub
# Text view — one row per chunk, suitable for instruction-tuning prep.
ds = export_to_hf_dataset("corpus_text_export")
# Chat view — one row per conversation, ShareGPT-shaped `messages` list.
ds_chat = export_to_hf_dataset("corpus_chat_export")
push_to_hub(ds, "username/my-personal-corpus")For chat exports with chat-template rendering (ChatML, Llama-3, Gemma, custom Jinja):
corpus-forge export chat --dataset claude-code \
--out ./chat.jsonl --template chatml --format jsonl
corpus-forge export feedback-pairs --dataset claude-code \
--out ./feedback.jsonl| View | Columns |
|---|---|
corpus_text_export |
id, text, source, title, heading, role, metadata, labels |
corpus_chat_export |
id, source, title, messages (ShareGPT format), metadata |
| Platform | Backend | Embedder device |
|---|---|---|
| Linux + CUDA | postgres (pgvector) or sqlite (sqlite-vec) |
device = "cuda" |
| macOS Apple Silicon | postgres or sqlite |
device = "mps" |
| Linux/Windows CPU | either | device = "cpu" |
| Anywhere | sqlite-only, no GPU | device = "cpu" |
Set device = "auto" to let sentence-transformers pick.
corpus-forge is multi-embedder by design — declare several [[embedders]]
blocks and backfill them independently. The picks below are the condensed
headline of docs/embedding-models.md, which
carries the full per-lane survey, tradeoffs, and sources. These rankings are
literature-grounded (HF Hub + live MTEB/MMTEB/CoIR/ViDoRe); empirical
on-machine rankings against your corpus are pending the
corpus-forge eval embedders run, which should break any close tie.
| Lane | Default (quality) | Fast / local | API |
|---|---|---|---|
| English text | Qwen3-Embedding-8B | Qwen3-Embedding-0.6B / BGE-large-en-v1.5 | OpenAI text-embedding-3-large |
| Code | nomic-embed-code | potion-code-16M (static fast tier) | Voyage voyage-code-3 |
| Multilingual | Qwen3-Embedding-8B | multilingual-e5-large-instruct | Cohere embed-multilingual-v3 |
| Multimodal | SigLIP 2 (so400m) | nomic-embed-vision-v1.5 | Voyage voyage-multimodal-3 |
The Qwen3 family (Apache-2.0) is the safe local default for prose, code, and
100+ languages alike; potion-code-16M is a CPU-instant fast tier you wire
in front of a dense embedder ([retrieval].fast_tier_embedder_name), not a
standalone index. Watch licensing: jinaai/jina-embeddings-v3 and
jinaai/jina-clip-v2 are CC-BY-NC-4.0 (non-commercial) — flag before
shipping a product. Copy-paste blocks for the headline picks (field format
matches config.example.toml):
# English / multilingual quality default — needs a GPU (~16 GB+).
[[embedders]]
name = "qwen3_8b"
provider = "sentence_transformers"
model_id = "Qwen/Qwen3-Embedding-8B"
dimension = 4096
normalize = true
distance = "cosine"
active = true
# Static fast tier — CPU-instant, ~16 MB. Point
# `[retrieval].fast_tier_embedder_name` at this name; not a sole index.
[[embedders]]
name = "potion_code_16m"
provider = "model2vec"
model_id = "minishlab/potion-code-16M"
dimension = 256
normalize = true
distance = "cosine"
active = true
# Managed API — set OPENAI_API_KEY (or `base_url` for any OpenAI-compatible host).
[[embedders]]
name = "openai_3l"
provider = "openai"
model_id = "text-embedding-3-large"
dimension = 3072
normalize = true
distance = "cosine"
active = false
api_key_env = "OPENAI_API_KEY"Add a block, keep existing embedders active, then
corpus-forge embed --embedder <name> encodes only the missing vectors.
Pair a text embedder with a code embedder and route chunks deterministically
by file extension — code goes to the code lane, everything else to text. Add
an extensions allow-list (leading-dot, lowercase) to any [[embedders]]
block to mark it a specialist; embedders without the field are
catchalls. The routing rule is:
- Iterate active embedders in declaration order.
- First specialist whose allow-list matches the chunk's source URI
(case-insensitive
endswith) wins. - Else the first catchall claims it.
- No active catchall + any active specialist → config-load error
(
EmbedderRoutingError); add a catchall to fix.
[[embedders]]
name = "nomic" # text catchall — no `extensions` field
provider = "llama-cpp"
model_id = "nomic-embed-text:v1.5"
dimension = 768
active = true
[[embedders]]
name = "nomic-code" # code specialist
provider = "llama-cpp"
model_id = "nomic-embed-code:7b" # Qwen2.5-Coder-7B, Apache-2.0
dimension = 3584
active = true
extensions = [".py", ".ts", ".tsx", ".go", ".rs", ".java", ".cpp"]A .py chunk lands in embeddings_nomic_code only; a .md chunk lands in
embeddings_nomic only. corpus-forge embed -e nomic-code filters its
pending pool to chunks the route assigns to it; the complementary
corpus-forge embed -e nomic drains the rest. Backwards-compat: when no
embedder declares extensions, every active embedder still embeds every
chunk (today's single-tower behaviour). See
config.example.toml for the full annotated block.
pip install 'corpus-forge[sqlite,hf,tokens,retrieval,rerank,mcp,eval,code,multi-format,ocr,whisper]'The openai SDK is a base dependency (not an extra) — corpus-forge
uses it for every OpenAI-compatible endpoint, including local Ollama at
:11434/v1, vLLM, llama.cpp's server, and LM Studio. The
[openai] extra is kept as a no-op back-compat alias so existing
install scripts don't break, but you do not need to add it.
| Extra | What it enables |
|---|---|
[sqlite] |
sqlite-vec virtual table for ANN search on SQLite. |
[hf] |
datasets library for HF export. |
[tokens] |
tiktoken for token-aware chunking. |
[retrieval] |
NumPy-backed retrieval-evaluation primitives. |
[rerank] |
sentence-transformers cross-encoder rerankers (BGE default). |
[mcp] |
Model Context Protocol stdio server for Claude / Agent SDK clients. |
[eval] |
Bundled gold-set evaluation harness (NDCG / MRR / Recall). |
[code] |
tree-sitter + tree-sitter-language-pack for the CodeChunker and language-aware code ingest. Apache-2.0 / MIT. |
[multi-format] |
PDF / HTML / EPUB / Office / Notebook / CSV / FastCDC chunker — includes AGPL-3.0 components. See Distribution / licensing. |
[ocr] |
VLM OCR HTTP clients (requests) + PDF rasterisation (pdf2image, pillow). Needs system poppler-utils (see "Distribution / licensing"). Permissive. |
[whisper] |
Audio + video transcription via faster-whisper (local) or any OpenAI-compatible /audio/transcriptions endpoint (remote). Bundles imageio-ffmpeg. Permissive. |
[fast-tier] |
Static-embedding fast tier (model2vec / minishlab/potion-code-16M) for the Phase N Wave 3 candidate-generator front-end of HybridRetriever. MIT. ~16 MB model weights downloaded on first use. Default search behaviour unchanged until the user opts in via SearchOptions.fast_tier_mode. |
[llama-cpp] |
In-process llama.cpp embeddings via llama-cpp-python for GGUF models (qwen3-embedding, nomic-embed, …). MIT. The installer auto-selects the accelerated wheel: install.sh / install.ps1 detect the host accelerator (nvidia-smi → CUDA, Apple Silicon → Metal, else CPU) and install the matching prebuilt llama-cpp-python wheel from the per-backend extra-index, so a CUDA box gets GPU offload without hand-editing CMAKE_ARGS. Override with --llama-backend {auto|cuda|cudaNNN|metal|cpu|none} (-LlamaBackend on PowerShell) or CF_LLAMA_BACKEND; none skips this extra. The embedder=auto setup choice pulls this extra automatically (its recommended lanes are all provider="llama-cpp"). For a manual source build (e.g. a platform with no prebuilt wheel), CMAKE_ARGS="-DGGML_METAL=on" pip install 'corpus-forge[llama-cpp]' stays the escape hatch — see the llama-cpp-python install docs. GGUF weights are NOT bundled; the resolver locates a pre-installed ollama pull <model> blob or an explicit gguf_path. Added to unblock qwen3-embedding on hosts where Ollama's OpenAI-shape /v1/embeddings endpoint returns HTTP 500 with failed to encode response: json: unsupported value: NaN for ~30 % of code chunks (in-process avoids the Ollama JSON encoder entirely). Gotchas: llama-cpp-python clamps the per-sequence context as n_ctx_seq = n_ctx / n_seq_max, and the embedding-mode initialiser silently sets n_seq_max = min(n_batch, llama_max_parallel_sequences()) (often 256), so big chunks crash with decode: failed to find a memory slot. Corpus-forge ships n_seq_max = 1 by default so each chunk gets the full n_ctx window, AND pre-truncates each input client-side to n_ctx_seq tokens before the C call. Tune n_seq_max / n_batch / n_ubatch in [[embedders]] — see config.example.toml. |
[analyze] |
Phase O EDA + corpus-cleaning ML stack: scikit-learn, hdbscan, umap-learn, bertopic, datasketch, fasttext-langdetect, langdetect. All permissive (BSD-3/MIT/Apache-2.0). Lazy-imported inside corpus_forge/analyze/ — does NOT widen the AGPL surface and does not affect cold-start time. |
Corpus-forge's core is permissively licensed (Apache-2.0), but two of the Phase D multi-format extractors depend on AGPL-3.0 libraries. The license posture of an installed copy depends on which extras you pull in:
| Install | Effective license | Notes |
|---|---|---|
pip install corpus-forge |
Apache-2.0 | Pure core. Markdown vault + chat history sources only; no PDF / EPUB / Office ingest. |
pip install corpus-forge[code] |
Apache-2.0 + MIT | Adds the CodeChunker and the CodeExtractor. Dependencies (tree-sitter, tree-sitter-language-pack) are Apache-2.0 / MIT — no copyleft contamination. |
pip install 'corpus-forge[multi-format]' |
AGPL-3.0 (effective) | Pulls in pymupdf4llm (AGPL-3.0) for digital PDF extraction and ebooklib (AGPL-3.0) for EPUBs. AGPL's network-use clause binds your application if you redistribute or expose it as a service. |
pip install 'corpus-forge[ocr]' |
Apache-2.0 + permissive HTTP clients | Adds the Ollama / Mistral OCR HTTP clients (requests, Apache-2.0), the rasterisation step (pdf2image, MIT) and pillow (HPND). No further copyleft entanglement on top of [multi-format]. Requires a system poppler-utils install — see "System requirements for [ocr]" below. |
pip install 'corpus-forge[whisper]' |
Apache-2.0 + MIT + BSD-2 | Adds faster-whisper (MIT) for the local backend, imageio-ffmpeg (BSD-2) which bundles an ffmpeg binary invoked as a subprocess (the documented LGPL boundary), and requests for the remote OpenAI-compatible path. No AGPL widening. |
Practical guidance. If you plan to redistribute corpus-forge or a derived
application, stay on pure-core or pure-core + [code] — both are
Apache-2.0-clean. If you are using it personally or inside your organisation,
[multi-format] is fine; the AGPL surface only matters once you ship the binary
to someone else or expose it as a network service.
The [multi-format] choice was made deliberately on 2026-05-14 to keep the
quality-of-extraction story competitive (Docling for Office, pymupdf4llm for
PDFs with text layers, ebooklib for EPUBs). The alternatives that would have
kept the install Apache-2.0 — marker-pdf, MinerU — are themselves GPL/AGPL,
so the trade-off is not avoidable today.
The [ocr] extra adds a single non-Python system dependency:
poppler-utils (BSD-licensed), used by
pdf2image to rasterise PDF pages for the VLM OCR escalation path. Install it
once per machine:
| Platform | Command |
|---|---|
| macOS (Homebrew) | brew install poppler |
| Debian / Ubuntu | sudo apt-get install -y poppler-utils |
| Fedora / RHEL | sudo dnf install -y poppler-utils |
| Windows | Download a build from the GnuWin32 page and add it to PATH. |
When poppler-utils is missing the PDF extractor degrades gracefully back to
the digital-only Tier 1 path with an ERROR-level log entry pointing here —
ingest does not break.
The [ocr] extra is intentionally light — requests (Apache-2.0),
pdf2image (MIT), pillow (HPND, permissive). It does not vendor or
bundle any model weights. Both OCR backends communicate over HTTP: the local
path talks to your Ollama daemon (e.g. qwen2.5vl:7b, pulled separately via
ollama pull), and the remote path talks to the Mistral OCR API
(MISTRAL_API_KEY in secrets.env). Adding [ocr] does not widen the AGPL
surface introduced by [multi-format].
Every model client in corpus-forge accepts an arbitrary HTTP URL via config.
The default is http://localhost:11434 (a local Ollama daemon for Ollama-shape
clients) or https://api.openai.com/v1 (for OpenAI-shape clients), but the
same backends work against any compatible endpoint — hosted Ollama, vLLM,
llama.cpp's OpenAI shim, Groq, Together, DeepInfra, Fireworks, or a self-hosted
mirror. Five clients follow this rule today:
| Surface | Config field | API shape |
|---|---|---|
| VLM (PDF Tier-2 OCR + image extractor) | vlm.ollama_url / vlm.mistral_base_url |
Ollama /api/generate or Mistral /v1/ocr |
| Document classifier (LLM half) | classifier.llm_url |
Ollama /api/generate |
| Whisper transcription (remote) | whisper.remote_base_url |
OpenAI-compat /audio/transcriptions |
| Multi-modal embedder (remote) | constructor arg on ClipRemoteEmbedder |
OpenAI-compat /v1/embeddings |
| Code enricher (remote) | code_enricher.remote_url + code_enricher.remote_api_shape |
Ollama /api/generate OR OpenAI /chat/completions |
The local default keeps every ingest run self-contained; pointing at a remote URL is a one-line config change with no code edit required. Useful when classification, OCR, transcription, or enrichment should run on a beefier host than the laptop doing the ingest.
Phase E (corpus-forge classify) walks every ingested document and attaches
a content-class strong label from a nine-value enum — code, chat,
book, textbook, paper, article, reference, note, other. The
label powers subset selection at training time ("give me all chat docs",
"hold out textbook for eval") and is persisted on corpus.document_labels
with source = 'classifier:rule' | 'classifier:llm' | 'user'.
| value | what it covers |
|---|---|
code |
source code, scripts, build files (Makefile, Dockerfile), config-as-code |
chat |
conversation transcripts (Claude Code, OpenCode, generic dialogue) |
book |
long-form non-pedagogical — fiction, memoir, popular non-fiction |
textbook |
long-form pedagogical — academic textbook, course notes, exercises |
paper |
research / academic papers (PDFs with abstract + citations) |
article |
blog posts, magazine articles, news, opinion writing |
reference |
API docs, schema specs, manifests, JSON/YAML/TOML/CSV |
note |
personal notes — Obsidian vault, markdown jottings, journals |
other |
fallback when no signal is strong enough to commit |
The default chain is ["rule", "llm"]: a stdlib rule classifier
(microseconds/doc) short-circuits high-confidence documents, and the LLM
classifier (Ollama qwen2.5:7b-instruct by default; ~5–10 s/doc on M-series)
picks up the weak / ambiguous cases. The escalation threshold defaults to
0.4 — rule confidences below that bar trigger the LLM call.
The LLM classifier follows the local-or-remote principle described
above: classifier.llm_url defaults to http://localhost:11434 and accepts
any Ollama-compatible URL. Tune the chain, threshold, and endpoint in the
[classifier] block of config.toml.
corpus-forge classify --dry-run --json # preview the plan, one JSON line per doc
corpus-forge classify # apply labels
corpus-forge classify --classifier rule # bypass the LLM (rule classifier only)The CLI prints a cost-guard preflight with a worst-case LLM-call estimate
before the run starts; --limit N and --dataset NAME are available for
quick smoke tests.
Phase F replaces positional chunk slicing for prose classes (book,
textbook, paper, article, note, other) with FastCDC
content-defined boundaries. Mid-document edits ripple at most 2-3
chunks instead of shifting every downstream boundary, and the
Phase C chunks.content_hash embedding-reuse path achieves its
design potential — most chunks survive a small edit byte-identical.
corpus-forge rechunk re-runs the chunker pass against documents that
already carry a class=* label (run corpus-forge classify first). The
class-mapped chunker resolves to:
| class | chunker | notes |
|---|---|---|
code |
CodeChunker |
tree-sitter AST when available, byte-line fallback otherwise |
chat |
ConversationChunker |
per-message or sliding-window |
reference |
PassthroughChunker |
structured docs round-trip as-is |
book / textbook / paper / article / note / other |
CDCChunker |
FastCDC rolling hash |
The rechunk pass is idempotent on chunk-text and chunker signature
(metadata.cdc_fingerprint, metadata.byte_range) — re-running after
a green pass is a no-op.
Phase G P0 routes .mp3/.wav/.m4a/.ogg/.flac and
.mp4/.mov/.webm/.mkv/.avi files through a Whisper-family
transcription model. Output is markdown (with timestamp anchors on the
local backend), folded into the same documents row family as any
other extractor.
Two backends ship behind the [whisper] extra:
backend = "local"→ in-processfaster-whisper(tiny / base / small / medium / large;smalldefault). Bundlesimageio-ffmpegfor the audio extraction step.backend = "remote"→ any OpenAI-compatible/audio/transcriptionsendpoint (OpenAIwhisper-1, Groqwhisper-large-v3, self-hosted whisper.cpp via HTTP). Same local-or-remote URL principle — swapwhisper.remote_base_urlto a different provider with no code change.
Default backend = "none" keeps existing configs untouched: audio /
video files are silently skipped until the user opts in via the
[whisper] config block.
Phase G P1 adds a separate MultiModalEmbedder protocol alongside
the text Embedder. Image chunks (metadata.image_path or
metadata.image_b64) get vectorised into a dedicated
image_embeddings_<name> per-embedder table that mirrors the existing
text family.
# Backfill the default CLIP local embedder against image chunks.
corpus-forge embed -e clip_local --imageTwo backends ship out of the box:
ClipLocalEmbedder— sentence-transformersclip-ViT-B-32(512 d, ~150 MB, MIT). Default.ClipRemoteEmbedder— any OpenAI-compatible/v1/embeddingsendpoint that accepts base64 data-URL image input (Voyage AI'svoyage-multimodal-3, Cohereembed-v3-multimodal, or a self-hosted CLIP service).
Cross-modal cosine similarity is pinned at ≥ 0.20 on the live e2e
suite — text and image vectors live in a shared space, so a text query
can recall image chunks via the same HybridRetriever.
Phase H (corpus-forge enrich) layers an LLM-generated enrichment record
onto every chunk of a class=code document. Each enrichment carries:
| field | what it is |
|---|---|
docstring |
synthesised docstring for the construct (or null when the existing docstring is adequate) |
summary |
1–2 sentence semantic summary in domain language |
symbols |
flat list of referenced symbol names (functions / types this chunk depends on) |
model |
the model tag that produced the enrichment (used for idempotency) |
confidence |
self-reported [0.0, 1.0] |
The enrichment lands in chunks.metadata.enrichment next to the existing
{kind, name, language, byte_range} keys from Phase D's CodeChunker — no
schema change needed. Downstream retrievers can boost on enrichment text,
do natural-language code search, and surface dependency edges via the flat
symbols array.
The default model is qwen3.6:35b-a3b-instruct — an MoE (35B total /
~3B active) that runs ~3-8 s/chunk on M-series hardware. Phase H ships
two backends to satisfy the local-or-remote URL principle:
backend = "local"→QwenCoderLocalagainst a local Ollama daemon (local_urldefaults tohttp://localhost:11434).backend = "remote"→QwenCoderRemoteagainst either a hosted Ollama endpoint (remote_api_shape = "ollama") or any OpenAI-compatible chat-completions endpoint (remote_api_shape = "openai"). Pair with the env-var name inremote_api_key_envfor bearer auth.
Wire both endpoints in the [code_enricher] block of config.toml; the
default backend = "none" keeps legacy configs untouched.
corpus-forge enrich --dry-run --json # preview the plan, one JSON line per chunk
corpus-forge enrich --dataset notes -l 5 # smoke against 5 chunks of one dataset
corpus-forge enrich --backend qwen-remote # force the remote backend (bypass config)Idempotency: chunks whose metadata.enrichment.model already matches the
configured model tag are skipped. Change the model tag (or pass
--reclassify-on-model-change) to force a full re-enrichment pass.
Source ─▶ Extractor ─▶ Chunker ─▶ Backend ─▶ per-embedder tables
│
classifier (post-ingest) ◀──────────┤
enricher (post-classify) ◀──────────┤
VLM / Whisper feed Extractor
corpus-forge is composed of small protocols. Each is a plug-in seam — adding a new format, classifier, embedder, or backend means writing one new file and registering it.
| Protocol | Where | What it does |
|---|---|---|
Source |
sources/base.py |
Discover + parse raw data into RawDocument / RawConversation. |
Extractor |
extractors/base.py |
Read a file off disk, emit ExtractedDocument(text, chunker_hint, metadata, labels). Phase D. |
Chunker |
chunkers/base.py |
Split a document into TextChunks. MarkdownChunker / ConversationChunker / CodeChunker (Phase D) / CDCChunker (Phase F) / PassthroughChunker. |
Embedder |
embedders/base.py |
Map texts → vectors. Symmetric encode + asymmetric encode_query. |
MultiModalEmbedder |
embedders/multimodal.py |
Map images and text into a shared vector space. Phase G P1. |
StorageBackend |
backends/base.py |
Persist chunks + vectors. Search dense + lexical. Cross-host sync. |
Classifier |
classifiers/base.py |
Map a ClassifiableDocument to a ClassLabel. Ordered chain via ClassifierRegistry. Phase E. |
VLMBackend |
vlm/base.py |
Image → text (OCR + description). Phase D P1. |
WhisperBackend |
whisper/base.py |
Audio/video → text transcription. Phase G P0. |
CodeEnricher |
enrichers/base.py |
Code chunk → {docstring, summary, symbols, model, confidence}. Phase H. |
Common machinery lives in base classes: WatchedSource (file watching +
debounce + identity + hash short-circuit), ChunkerBase (size-bounding +
overlap with forward-progress invariant), BaseEmbedder / BaseBackend.
See docs/architecture.md for the full reference.
See config.example.toml for the full reference (every field carries an inline
comment + a commented-out remote example for every *_url). Key sections:
[backend]—kindis"postgres"or"sqlite";dsnis the Postgres connection string OR the SQLite file path.schema = "corpus"for Postgres; ignored on SQLite. For provisioning a fresh Postgres host seedocs/deployment/postgres.md(bare-metal Debian/Ubuntu via thescripts/postgres-bootstrap.shhelper),docs/deployment/docker.md(self-contained pgvector Compose stack), ordocs/deployment/lxc.md(Proxmox LXC sizing + Tailscale + backups).[daemon]—debounce_seconds,log_level,log_format,sync_poll_interval_s,trash_dir,conflict_dir,host_id.[[datasets]]— repeated.name,kind(text|chat),description,sync_enabled(Postgres only — SQLite rejectssync_enabled = trueat config-load).[[datasets.sources]]— repeated.plugin(markdown_vault|claude_code|opencode|filesystem|chatgpt_export|codex_cli|gemini_cli|jsonl_chat), source-specific paths,chunker,chunker_config. An optional[datasets.sources.extraction]block tunes the Phase D extractor registry (enable_pdf,enable_office,csv_max_rows,max_bytes,ocr_enabled,ocr_dpi, …).[[embedders]]— repeated.name,provider(sentence_transformers|openai|model2vec|llama-cpp),model_id,dimension,normalize,distance,active,batch_size,device,api_key_env(OpenAI only).[retrieval]—fusion(rrf|alpha),alpha,default_k,rerank_top_n,rerank_enabled,reranker.{kind, model_id, device, ...}.[vlm]— Phase D P1.backend ∈ {none, ollama, mistral},ollama_url,mistral_base_url,timeout_s.[classifier]— Phase E.chain = ["rule", "llm"],escalation_threshold,llm_model,llm_url,llm_temperature,llm_excerpt_chars.[whisper]— Phase G P0.backend ∈ {none, local, remote},model,local_compute_type,remote_base_url,remote_api_key_env,language.[code_enricher]— Phase H.backend ∈ {none, local, remote},local_url,remote_url,remote_api_shape ∈ {ollama, openai},temperature.
| OS | Script | Service manager |
|---|---|---|
| Linux | scripts/linux/install.sh |
systemd user unit |
| macOS | scripts/macos/install.sh |
launchd agent |
| Windows | (manual) | NSSM / Task Scheduler |
Inspect the rendered unit / plist under packaging/ for reference. make stop and make logs dispatch on uname -s.
To add an embedder to an existing corpus:
# 1. Add to config.toml — keep existing embedders active.
[[embedders]]
name = "new-embedder"
provider = "sentence_transformers"
model_id = "new/model"
dimension = 1024
active = true# 2. Backfill just the new embedder against existing chunks.
corpus-forge embed --embedder new-embedder
# Or all active embedders in one pass:
corpus-forge embedChunks already have content-hashes; the backfill encodes only what's missing.
corpus-forge runs as a fleet — many machines pointed at one shared
Postgres, draining the same embedding backlog and sharing the same
corpus-shaped config. One machine owns the schema (it ran
corpus-forge migrate); every other machine joins.
# On the new machine — one command takes it from installed to a
# registered fleet host with the shared config pulled.
corpus-forge setup --join postgresql://user@pg-host:5432/corpus
# Then add your local source roots (the join renders shared datasets
# as commented-out blocks — each machine ingests its own directories)
# and start working the backlog:
corpus-forge ingest --once
corpus-forge embed # claims chunks; multiple hosts drain safelyDistributed embedding (no babysitting). Every host's
corpus-forge embed / embed-worker claims chunks before encoding,
so N machines drain the same lane with zero duplicated GPU work —
crash-safe via lease expiry, no broker. Pin a machine to specific
lanes when its hardware suits one model:
[embed]
lanes = ["qwen3-4096"] # this CUDA box only works the big-dim lane;
# absent ⇒ all active embedders (single-host default)Shared config that can't silently fork. Dataset names/kinds, embedder definitions, and retrieval settings are shared scope; DSNs, source roots, devices, and API-key env names stay local. Publish and pull the shared subset — comment-preserving, version-guarded:
corpus-forge config publish # push this host's shared scope
corpus-forge config pull # dry-run: show the diff
corpus-forge config pull --apply # rewrite local config.toml (backs up .bak)
corpus-forge config diff # local vs published, no writespublish refuses when the corpus has a newer version than you last
pulled ("pull first"), so two machines can't clobber each other. With
[federation] enabled = true, the daemon WARNs on drift — it never
auto-applies; a human runs config pull.
Tailscale-native addressing (optional). Point every box at a
stable MagicDNS name instead of a hand-pinned 100.x IP — ts:// is
accepted anywhere a host URL/DSN appears:
[backend]
dsn = "ts://pg-host:5432/corpus"
[tailscale]
enabled = true # default false; ts:// errors clearly when offcorpus-forge setup offers a live-peer picker for the Postgres host
and remote embedder endpoints when Tailscale is up; corpus-forge hosts list shows a ●/○ online marker per fleet host; corpus-forge doctor resolves every ts:// name and TCP-probes its port.
Requires the postgres backend — SQLite is single-machine by
construction. See docs/ and the rfc-fleet-* design notes
under .planning/rfcs/ for internals.
corpus-forge ships ready-to-use setup guides for every major coding assistant. Hand one of these to your assistant (or read it yourself) and you'll be ingesting + searching + curating within a few commands:
CLAUDE.md— Claude Code, Claude Desktop, Anthropic API / Managed MCP.GEMINI.md— Gemini CLI, Gemini Code Assist, Vertex AI.AGENTS.md— vendor-neutral recipe for OpenCode, Cursor, Zed, Continue, Cline, and anything else that speaks MCP.
Each guide walks an assistant from install → configure → migrate → MCP wire-up → skill registration → first-run sanity → curation-loop playbook → troubleshooting. The same canonical MCP launch block (corpus-forge mcp serve --transport stdio) works across every client.
corpus-forge ships a stdio Model Context Protocol server that exposes the following tools:
| Tool | Use | Gate |
|---|---|---|
search |
Hybrid (dense + lexical) search with optional rerank. Returns {hits: [...]} with chunk_id, score, text, source_uri, title, dataset_id. |
read-only |
get_chunk |
Fetch a chunk by id. | read-only |
list_datasets |
Enumerate datasets with chunk_count / document_count. |
read-only |
estimate_sync_size |
Predict the Postgres footprint of syncing a directory tree. No I/O, no model calls. | read-only |
next_curation_target / next_curation_batch |
Ranker-driven "what entry most needs my help right now?" Returns a CurationTarget (or a cohesive batch) with text, current labels, missing fields, and a score breakdown. |
read-only |
commit_curation |
Atomic multi-write covering label adds/removes, metadata, description, feedback — for a single chunk or a batch. Composes the lower-level write tools below. | writes_enabled |
add_label / remove_label / set_metadata / set_description / add_feedback / list_labels |
Direct curation writes. Available stand-alone or wrapped by commit_curation. |
writes_enabled |
append_conversation / append_message / render_conversation / list_chat_templates / register_template / register_session |
Chat-corpus authoring + templated rendering for export. | writes_enabled |
pip install 'corpus-forge[mcp]'
corpus-forge mcp serve # stdio transport (only transport in beta)Drop-in MCP config snippets live under examples/mcp-config/:
claude-code.mcp.json— for Claude Code (~/.config/claude-code/mcp.json or.mcp.jsonper-project).claude-desktop.json— for Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.jsonon macOS).
{
"mcpServers": {
"corpus-forge": {
"command": "corpus-forge",
"args": ["mcp", "serve"],
"env": { "CORPUS_FORGE_CONFIG": "~/.config/corpus-forge/config.toml" }
}
}
}Run ingester daemons on multiple machines against a single central Postgres.
See docs/deployment-satellite.md for the
step-by-step satellite setup guide.
Both shipped under the repo and mirrored across the three supported clients:
corpus-forge-search— search-and-cite. Files:.claude/skills/corpus-forge-search/SKILL.md,.opencode/command/corpus-forge-search.md,.gemini/agents/corpus-forge-search.md.corpus-curate— the data-improvement chat loop. Files:.claude/skills/corpus-curate/SKILL.md,.opencode/command/corpus-curate.md,.gemini/agents/corpus-curate.md.- Research-librarian subagent —
.claude/agents/corpus-forge-researcher.md— Anthropic Agent SDK delegate scoped to the search-and-cite tools. - Full walkthrough —
docs/claude-integration.md.
Rerank (rerank=true) triggers a one-time ~600 MB BAAI/bge-reranker-v2-m3 download. Opt-in only for top-of-list precision needs. The corpus-curate selector reuses the same reranker for its "elevation potential" score, so it inherits the same local-or-remote URL choice you set in [reranker].
The same retrieval surface is available as a CLI:
corpus-forge search "how does the SQLite lock work" --k 5
corpus-forge search "phase B retrieval" --dataset planning --rerank --jsonThe retrieval-eval harness doubles as a corpus-quality signal. Run NDCG@10 / MRR@10 / Recall@20 on a bundled gold set:
corpus-forge eval retrieval --dataset forge_self --k 10,20
corpus-forge eval corpus-quality --dataset /path/to/held-out-qa.jsonlA drop in recall@20 on your own held-out QA pairs is an early-warning signal that your chunking / embedder config regressed before you export the corpus for training.
make dev # install dev deps + pre-commit hooks
make ci # format-check + lint + typecheck + unit + fuzz + smoke
make test-unit # parallel unit tests, coverage-gated ≥ 89%
make test-integration # Docker-backed pgvector
make test-fuzz # Hypothesis property tests
make test-smoke # end-to-end happy pathsSee CONTRIBUTING.md for branching + commit conventions + the PR gate.
- License: Apache 2.0
- Contributing:
CONTRIBUTING.md - Code of Conduct:
CODE_OF_CONDUCT.md(Contributor Covenant 2.1) - Security:
SECURITY.md— do not open public issues for vulnerabilities; emailevan@jwo3.io. - Changelog:
CHANGELOG.md
