A complete, reproducible blueprint for building a private, multimodal Retrieval-Augmented Generation (RAG) system over a large, heterogeneous document corpus — using Gemini Embedding 2, LanceDB, Google Cloud Run, and the Model Context Protocol (MCP) for direct integration with Claude.ai.
This repository documents an actual production deployment: ~20,000 files, 30 GB, 145 client folders, indexed in a single overnight run for ~$190 in API costs, queryable from Claude.ai with citations back to original source documents.
If you have a large, messy archive of mixed-format documents — PDFs, Word docs, spreadsheets, emails, scanned images — and you want to be able to ask natural-language questions about it from a modern chat interface with citations, this is a working pattern that gets you there.
It is not a packaged product. It is a thoroughly commented, opinionated reference implementation that you can read, understand, fork, and adapt.
- End-to-end search over a real legal corpus containing pleadings, depositions, contracts, correspondence, exhibits, and images — across 145 distinct matters.
- Multimodal embedding: PDFs go in as PDFs (preserving layout, signatures, exhibits, scanned content), not stripped to plain text first. Word docs, Excel sheets, images, and emails are similarly preserved.
- Multi-turn conversational querying from Claude.ai with the full 200K context window.
- Cited results: every answer Claude generates includes 15-minute signed URLs that link back to the exact source document. Click → original PDF opens in the browser.
- Serverless deployment: the query server scales to zero when idle. Operating cost approaches $0 when not in use; per-query cost is fractions of a cent.
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ Local corpus (30 GB) │
│ ──────────────────── │
│ /My_Corpus/ │
│ ├── Client_A/ (PDFs, .doc, .docx, .xlsx, .eml, scans, etc.) │
│ ├── Client_B/ │
│ └── ...145 client folders │
│ │
└──────────────────────────┬──────────────────────────────────────────┘
│
│ Step 1: Preprocessing (scripts/convert.py)
│ Normalize formats: .doc → .docx, .msg → .eml, etc.
│ Hash files for dedup.
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Staged corpus (normalized) │
└──────────────────────────┬──────────────────────────────────────────┘
│
│ Step 2: Ingestion (scripts/ingest.py)
│ - Modality-aware chunking (PDFs by page windows,
│ DOCX by token, XLSX by sheet, images whole)
│ - Multimodal embedding via Gemini Embedding 2
│ (3072-dim vectors, native PDF/image input)
│ - Upload originals to GCS for citation links
│ - Resumable, deduplicating, cost-capped
▼
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ ┌────────────────────────┐ ┌────────────────────────────┐ │
│ │ LanceDB index │ │ GCS raw bucket │ │
│ │ (vectors + metadata) │ │ (original files by hash) │ │
│ │ on GCS │ │ │ │
│ └────────────────────────┘ └────────────────────────────┘ │
│ │
└──────────────────────────┬──────────────────────────────────────────┘
│
│ Step 3: Serving (server/mcp_server.py)
│ FastMCP server on Cloud Run.
│ Exposes `query` and `list_clients` tools.
│ Generates v4 signed URLs for citations.
▼
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ Claude.ai Custom Connector │
│ ─────────────────────────── │
│ Natural-language queries → MCP tool calls → cited answers │
│ │
└─────────────────────────────────────────────────────────────────────┘
This implementation makes specific tradeoffs. Each is annotated here so you know what to keep, what to change, and why.
The killer feature of Gemini Embedding 2 is native multimodal input. You hand it a PDF page or an image and it produces a single vector. No OCR pipeline, no separate image embedder, no plain-text reduction. For a corpus with handwritten signatures, scanned exhibits, complex tables, and image attachments, this preserves information that text-only embedding pipelines silently discard.
The cost is real (~$0.0033 per chunk at 3072 dimensions), but the quality gain on legal/financial corpora — where layout and visual artifacts carry semantic weight — is dramatic.
LanceDB is an open-source columnar vector database with native GCS support. The index is just a directory of Parquet-like files. You can rsync it to GCS and read it back from Cloud Run with no separate database to provision, no monthly minimums, no connection pooling.
Compared to managed alternatives (Pinecone, Vertex Vector Search, etc.), LanceDB on GCS is:
- Cheaper at this scale — storage is GCS storage; query compute is whatever container is reading it.
- More portable — your data is just files. Move clouds, run locally, archive — no lock-in.
- Slightly less performant at extreme scale — for tens of millions of vectors you'd want something purpose-built. For < 1M, LanceDB is more than sufficient.
A naive RAG pipeline chunks everything by character or token count. That destroys structure. This implementation uses different strategies per modality:
| Modality | Chunking strategy |
|---|---|
| 6-page sliding windows with 1-page overlap, embedded as native PDF (preserves layout) | |
| DOCX | ~6000-token windows with 500-token overlap (text content only) |
| XLSX | One chunk per sheet, rendered as markdown table |
| Image | Single chunk, embedded as image |
| Headers + body as one chunk | |
| TXT/MD/HTML | Standard token-based chunks |
The PDF strategy in particular matters for legal work: a 6-page window typically covers a complete argument, a deposition exchange, or a contract section. 1-page overlap ensures concepts spanning windows aren't lost.
The query server is genuinely stateless. Embed query → search LanceDB → return results. It has no need for persistent state. Cloud Run scales to zero when idle, which means: the server costs $0/hour when no one is using it. First request triggers a ~3 second cold start; subsequent requests in the same minute are warm.
Versus running a VM (always-on, ~$15/month) or Kubernetes (overkill for a single service), Cloud Run is the right answer for a personal or single-team deployment.
The MCP server is reachable at:
https://<host>/<64-character-secret-token>/mcp
The secret is part of the URL path. Without the correct path, requests get a 404 — the server doesn't even acknowledge that an MCP endpoint exists. With it, requests pass through to the MCP handler.
This works around a current limitation of Claude.ai's Custom Connector UI (which exposes OAuth fields but not static bearer auth). The security properties are equivalent to a bearer header: 64 random characters provide the same entropy whether they're in a path or a header, and HTTPS encrypts URL paths in transit just as it encrypts headers.
For multi-tenant or enterprise use, you would replace this with proper OAuth 2.0 + Dynamic Client Registration. For a single-user setup, this is materially equivalent and significantly less code.
The same document often appears in multiple folders (a brief filed in two related cases, a contract attached to multiple emails, etc.). Embedding it multiple times wastes API spend and creates duplicate hits in search results.
The ingest pipeline hashes every file's contents (SHA-256) and uses the hash as the primary key. Multiple folder paths can point to the same hash; the file is embedded once. Search results carry all the paths the file appears under, letting Claude give context-aware citations ("this contract appears in both the Smith and Jones matters…").
.
├── README.md ← You are here
├── scripts/
│ ├── convert.py ← Step 1: normalize the local corpus
│ ├── ingest.py ← Step 2: chunk, embed, write index
│ └── requirements-scripts.txt ← Deps for the local pipeline
├── server/
│ ├── mcp_server.py ← Step 3: query server
│ ├── Dockerfile ← Cloud Run build
│ ├── requirements.txt ← Server deps
│ └── .gcloudignore ← Files to exclude from upload
├── docs/
│ ├── 01-environment-setup.md ← GCP project, buckets, secrets, IAM
│ ├── 02-corpus-preparation.md ← Running convert.py, troubleshooting
│ ├── 03-ingestion.md ← Running ingest.py, monitoring, costs
│ ├── 04-deployment.md ← Cloud Run deploy, smoke tests
│ ├── 05-claude-integration.md ← Custom Connector setup
│ └── 06-operations.md ← Costs, rotation, updates, troubleshooting
├── .env.example ← Template for required env vars (no secrets)
├── .gitignore ← Aggressively excludes secrets and bulk data
└── LICENSE
If you've already read the design and want the executive summary of what to run:
# 1. Set up GCP (see docs/01-environment-setup.md)
gcloud projects create my-rag-project
gcloud storage buckets create gs://my-rag-raw --location=us-east4
gcloud storage buckets create gs://my-rag-index --location=us-east4
# ...create service account, secrets, IAM bindings
# 2. Convert your corpus (see docs/02-corpus-preparation.md)
python scripts/convert.py
# Output: ./staged/ with normalized files + manifest.csv
# 3. Run ingestion (see docs/03-ingestion.md)
python scripts/ingest.py --dry-run # cost estimate
python scripts/ingest.py --limit 100 # validation run
python scripts/ingest.py # full run (8h, ~$190 for 17K files)
python scripts/ingest.py --sync-to-gcs # push index to GCS
# 4. Deploy the MCP server (see docs/04-deployment.md)
cd server/
gcloud run deploy oldfirm-mcp --source . ...
# 5. Add the connector to Claude.ai (see docs/05-claude-integration.md)
# Settings → Connectors → Add Custom Connector → paste URLHonest disclaimers, since this is a reference implementation people will likely fork:
- Not a multi-tenant system. Auth is a shared secret embedded in the URL. Adequate for one user; inadequate for several.
- Not enterprise-grade. No audit logging beyond Cloud Run's defaults. No data residency controls. No automated key rotation. No backup-and-restore tooling.
- Not legal advice software. This is a search and retrieval tool. The LLM can summarize and cite, but it cannot replace human judgment on legal questions.
- Not optimized for very small corpora. If you have 50 documents, you don't need any of this — just use Claude's file upload. The complexity here pays off at thousands of files.
If your needs go beyond these limits, the components are still useful as starting points, but expect to invest more engineering on top.
For the 17,000-file, 30 GB legal corpus the original was built on:
| Item | One-time | Recurring |
|---|---|---|
| Gemini Embedding 2 (ingest) | ~$190 | — |
| GCS storage (~32 GB raw + 1 GB index) | — | ~$0.70/month |
| Cloud Run compute (idle most of the time) | — | < $1/month |
| Gemini Embedding 2 (per-query embeddings) | — | ~$0.0001/query |
| Cloud Build (deploys) | — | negligible |
Total recurring cost: under $5/month for a single-user, occasional-use system.
The ingest cost scales linearly with corpus size — figure roughly $0.01 per file. For a 100,000-file corpus expect ~$1,000 in embedding costs.
This implementation was built in collaboration with Claude (Opus 4.7) over several sessions. The architectural decisions, code, and documentation reflect a working dialogue, not a one-shot generation. Every script was validated end-to-end on real data.
MIT — see LICENSE.