Multimodal Legal RAG — Reference Implementation

A complete, reproducible blueprint for building a private, multimodal Retrieval-Augmented Generation (RAG) system over a large, heterogeneous document corpus — using Gemini Embedding 2, LanceDB, Google Cloud Run, and the Model Context Protocol (MCP) for direct integration with Claude.ai.

This repository documents an actual production deployment: ~20,000 files, 30 GB, 145 client folders, indexed in a single overnight run for ~$190 in API costs, queryable from Claude.ai with citations back to original source documents.

What this is

If you have a large, messy archive of mixed-format documents — PDFs, Word docs, spreadsheets, emails, scanned images — and you want to be able to ask natural-language questions about it from a modern chat interface with citations, this is a working pattern that gets you there.

It is not a packaged product. It is a thoroughly commented, opinionated reference implementation that you can read, understand, fork, and adapt.

Concrete outcomes the original deployment achieves

End-to-end search over a real legal corpus containing pleadings, depositions, contracts, correspondence, exhibits, and images — across 145 distinct matters.
Multimodal embedding: PDFs go in as PDFs (preserving layout, signatures, exhibits, scanned content), not stripped to plain text first. Word docs, Excel sheets, images, and emails are similarly preserved.
Multi-turn conversational querying from Claude.ai with the full 200K context window.
Cited results: every answer Claude generates includes 15-minute signed URLs that link back to the exact source document. Click → original PDF opens in the browser.
Serverless deployment: the query server scales to zero when idle. Operating cost approaches $0 when not in use; per-query cost is fractions of a cent.

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│   Local corpus (30 GB)                                              │
│   ────────────────────                                              │
│   /My_Corpus/                                                       │
│     ├── Client_A/  (PDFs, .doc, .docx, .xlsx, .eml, scans, etc.)    │
│     ├── Client_B/                                                   │
│     └── ...145 client folders                                       │
│                                                                     │
└──────────────────────────┬──────────────────────────────────────────┘
                           │
                           │  Step 1: Preprocessing  (scripts/convert.py)
                           │  Normalize formats: .doc → .docx, .msg → .eml, etc.
                           │  Hash files for dedup.
                           ▼
┌─────────────────────────────────────────────────────────────────────┐
│   Staged corpus (normalized)                                        │
└──────────────────────────┬──────────────────────────────────────────┘
                           │
                           │  Step 2: Ingestion  (scripts/ingest.py)
                           │  - Modality-aware chunking (PDFs by page windows,
                           │    DOCX by token, XLSX by sheet, images whole)
                           │  - Multimodal embedding via Gemini Embedding 2
                           │    (3072-dim vectors, native PDF/image input)
                           │  - Upload originals to GCS for citation links
                           │  - Resumable, deduplicating, cost-capped
                           ▼
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│   ┌────────────────────────┐      ┌────────────────────────────┐    │
│   │  LanceDB index         │      │  GCS raw bucket            │    │
│   │  (vectors + metadata)  │      │  (original files by hash)  │    │
│   │  on GCS                │      │                            │    │
│   └────────────────────────┘      └────────────────────────────┘    │
│                                                                     │
└──────────────────────────┬──────────────────────────────────────────┘
                           │
                           │  Step 3: Serving  (server/mcp_server.py)
                           │  FastMCP server on Cloud Run.
                           │  Exposes `query` and `list_clients` tools.
                           │  Generates v4 signed URLs for citations.
                           ▼
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│   Claude.ai Custom Connector                                        │
│   ───────────────────────────                                       │
│   Natural-language queries → MCP tool calls → cited answers         │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Design decisions (and why)

This implementation makes specific tradeoffs. Each is annotated here so you know what to keep, what to change, and why.

1. Gemini Embedding 2 as the embedding model

The killer feature of Gemini Embedding 2 is native multimodal input. You hand it a PDF page or an image and it produces a single vector. No OCR pipeline, no separate image embedder, no plain-text reduction. For a corpus with handwritten signatures, scanned exhibits, complex tables, and image attachments, this preserves information that text-only embedding pipelines silently discard.

The cost is real (~$0.0033 per chunk at 3072 dimensions), but the quality gain on legal/financial corpora — where layout and visual artifacts carry semantic weight — is dramatic.

2. LanceDB for the vector store

LanceDB is an open-source columnar vector database with native GCS support. The index is just a directory of Parquet-like files. You can rsync it to GCS and read it back from Cloud Run with no separate database to provision, no monthly minimums, no connection pooling.

Compared to managed alternatives (Pinecone, Vertex Vector Search, etc.), LanceDB on GCS is:

Cheaper at this scale — storage is GCS storage; query compute is whatever container is reading it.
More portable — your data is just files. Move clouds, run locally, archive — no lock-in.
Slightly less performant at extreme scale — for tens of millions of vectors you'd want something purpose-built. For < 1M, LanceDB is more than sufficient.

3. Modality-aware chunking

A naive RAG pipeline chunks everything by character or token count. That destroys structure. This implementation uses different strategies per modality:

Modality	Chunking strategy
PDF	6-page sliding windows with 1-page overlap, embedded as native PDF (preserves layout)
DOCX	~6000-token windows with 500-token overlap (text content only)
XLSX	One chunk per sheet, rendered as markdown table
Image	Single chunk, embedded as image
Email	Headers + body as one chunk
TXT/MD/HTML	Standard token-based chunks

The PDF strategy in particular matters for legal work: a 6-page window typically covers a complete argument, a deposition exchange, or a contract section. 1-page overlap ensures concepts spanning windows aren't lost.

4. Cloud Run for serving (not a VM, not Kubernetes)

The query server is genuinely stateless. Embed query → search LanceDB → return results. It has no need for persistent state. Cloud Run scales to zero when idle, which means: the server costs $0/hour when no one is using it. First request triggers a ~3 second cold start; subsequent requests in the same minute are warm.

Versus running a VM (always-on, ~$15/month) or Kubernetes (overkill for a single service), Cloud Run is the right answer for a personal or single-team deployment.

5. URL-embedded token for authentication

The MCP server is reachable at:

https://<host>/<64-character-secret-token>/mcp

The secret is part of the URL path. Without the correct path, requests get a 404 — the server doesn't even acknowledge that an MCP endpoint exists. With it, requests pass through to the MCP handler.

This works around a current limitation of Claude.ai's Custom Connector UI (which exposes OAuth fields but not static bearer auth). The security properties are equivalent to a bearer header: 64 random characters provide the same entropy whether they're in a path or a header, and HTTPS encrypts URL paths in transit just as it encrypts headers.

For multi-tenant or enterprise use, you would replace this with proper OAuth 2.0 + Dynamic Client Registration. For a single-user setup, this is materially equivalent and significantly less code.

6. File hash as the canonical identifier

The same document often appears in multiple folders (a brief filed in two related cases, a contract attached to multiple emails, etc.). Embedding it multiple times wastes API spend and creates duplicate hits in search results.

The ingest pipeline hashes every file's contents (SHA-256) and uses the hash as the primary key. Multiple folder paths can point to the same hash; the file is embedded once. Search results carry all the paths the file appears under, letting Claude give context-aware citations ("this contract appears in both the Smith and Jones matters…").

Repository layout

.
├── README.md                         ← You are here
├── scripts/
│   ├── convert.py                    ← Step 1: normalize the local corpus
│   ├── ingest.py                     ← Step 2: chunk, embed, write index
│   └── requirements-scripts.txt      ← Deps for the local pipeline
├── server/
│   ├── mcp_server.py                 ← Step 3: query server
│   ├── Dockerfile                    ← Cloud Run build
│   ├── requirements.txt              ← Server deps
│   └── .gcloudignore                 ← Files to exclude from upload
├── docs/
│   ├── 01-environment-setup.md       ← GCP project, buckets, secrets, IAM
│   ├── 02-corpus-preparation.md      ← Running convert.py, troubleshooting
│   ├── 03-ingestion.md               ← Running ingest.py, monitoring, costs
│   ├── 04-deployment.md              ← Cloud Run deploy, smoke tests
│   ├── 05-claude-integration.md      ← Custom Connector setup
│   └── 06-operations.md              ← Costs, rotation, updates, troubleshooting
├── .env.example                      ← Template for required env vars (no secrets)
├── .gitignore                        ← Aggressively excludes secrets and bulk data
└── LICENSE

Quick start (the 10-minute version)

If you've already read the design and want the executive summary of what to run:

# 1. Set up GCP (see docs/01-environment-setup.md)
gcloud projects create my-rag-project
gcloud storage buckets create gs://my-rag-raw    --location=us-east4
gcloud storage buckets create gs://my-rag-index  --location=us-east4
# ...create service account, secrets, IAM bindings

# 2. Convert your corpus (see docs/02-corpus-preparation.md)
python scripts/convert.py
# Output: ./staged/ with normalized files + manifest.csv

# 3. Run ingestion (see docs/03-ingestion.md)
python scripts/ingest.py --dry-run         # cost estimate
python scripts/ingest.py --limit 100       # validation run
python scripts/ingest.py                   # full run (8h, ~$190 for 17K files)
python scripts/ingest.py --sync-to-gcs     # push index to GCS

# 4. Deploy the MCP server (see docs/04-deployment.md)
cd server/
gcloud run deploy oldfirm-mcp --source . ...

# 5. Add the connector to Claude.ai (see docs/05-claude-integration.md)
# Settings → Connectors → Add Custom Connector → paste URL

What this is NOT

Honest disclaimers, since this is a reference implementation people will likely fork:

Not a multi-tenant system. Auth is a shared secret embedded in the URL. Adequate for one user; inadequate for several.
Not enterprise-grade. No audit logging beyond Cloud Run's defaults. No data residency controls. No automated key rotation. No backup-and-restore tooling.
Not legal advice software. This is a search and retrieval tool. The LLM can summarize and cite, but it cannot replace human judgment on legal questions.
Not optimized for very small corpora. If you have 50 documents, you don't need any of this — just use Claude's file upload. The complexity here pays off at thousands of files.

If your needs go beyond these limits, the components are still useful as starting points, but expect to invest more engineering on top.

Costs (realistic numbers)

For the 17,000-file, 30 GB legal corpus the original was built on:

Item	One-time	Recurring
Gemini Embedding 2 (ingest)	~$190	—
GCS storage (~32 GB raw + 1 GB index)	—	~$0.70/month
Cloud Run compute (idle most of the time)	—	< $1/month
Gemini Embedding 2 (per-query embeddings)	—	~$0.0001/query
Cloud Build (deploys)	—	negligible

Total recurring cost: under $5/month for a single-user, occasional-use system.

The ingest cost scales linearly with corpus size — figure roughly $0.01 per file. For a 100,000-file corpus expect ~$1,000 in embedding costs.

Acknowledgments

This implementation was built in collaboration with Claude (Opus 4.7) over several sessions. The architectural decisions, code, and documentation reflect a working dialogue, not a one-shot generation. Every script was validated end-to-end on real data.

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Legal RAG — Reference Implementation

What this is

Concrete outcomes the original deployment achieves

Architecture

Design decisions (and why)

1. Gemini Embedding 2 as the embedding model

2. LanceDB for the vector store

3. Modality-aware chunking

4. Cloud Run for serving (not a VM, not Kubernetes)

5. URL-embedded token for authentication

6. File hash as the canonical identifier

Repository layout

Quick start (the 10-minute version)

What this is NOT

Costs (realistic numbers)

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
scripts		scripts
server		server
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Multimodal Legal RAG — Reference Implementation

What this is

Concrete outcomes the original deployment achieves

Architecture

Design decisions (and why)

1. Gemini Embedding 2 as the embedding model

2. LanceDB for the vector store

3. Modality-aware chunking

4. Cloud Run for serving (not a VM, not Kubernetes)

5. URL-embedded token for authentication

6. File hash as the canonical identifier

Repository layout

Quick start (the 10-minute version)

What this is NOT

Costs (realistic numbers)

Acknowledgments

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages