DocMirror

Turn a doc site URL into a clean markdown corpus optimised for speedy and context-efficient LLM consumption.

Why

Agents need docs. They need them locally, instantly, without web fetches — and they need them clean, not bloated with nav chrome and cookie banners that waste tokens. DocMirror caches any doc site as stripped markdown optimised for LLM consumption. When you work with a tool intensively, its docs are always available for agents to grep or pass through the CLI — no fumbling on the website trying to find the right section, no noisy or incomplete WebFetch, no stale search results, no hallucinated API signatures.

The whole thing runs for free, in seconds for most doc sites. LLM condensing step is optional for scenarios where your agents access the docs with high enough frequency that shaving 20-30% of their size makes a difference. If you're not sure if that's you, that means it's not.

The consumer is the LLM, not you. --smart and --max-pages are escape hatches for when you only need a slice.

How It Works

Discovery — 5-stage pipeline, cheapest first:

Stage	Method	Cost
1	`/llms-full.txt`	Free (1 req) — entire docs in one shot
2	`/llms.txt`	Free (1 req) — structured URL index
3	`sitemap.xml`	Free — URL enumeration
4	Link crawl from root	Free — follows one level deep
5	Firecrawl `/map`	API credits — last resort

Fetch — per-page, 4 methods in priority order:

Content negotiation (Accept: text/markdown)
.md suffix on URL
ReadTheDocs /_sources/*.md.txt
Jina Reader (universal fallback, rate-limited to 500 RPM)

Strip — platform-detected noise removal. Platform-specific strategies for GitBook, Docusaurus, Sphinx/RTD, Mintlify, and MkDocs. Also detects VitePress, Fumadocs, and Nextra by signature but uses the generic strategy for them. Falls back to generic for unknown platforms.

Condense (optional, --condense) — LLM-powered per-page compression that removes filler prose while preserving code blocks, headings, and technical content. Three structural validators (code block count, heading count, word count) catch gross failures deterministically. Pages failing validation automatically fall back to the uncondensed version. Average reduction recorded during calibration: 20-30%. See Condense for details.

Smart (optional, --smart <query>) — BM25 lexical relevance scoring keeps only the top N pages most relevant to your query. Useful when a doc site has 500+ pages but you only need the auth section. Note: BM25 is keyword-based — it may miss pages that use synonyms or different phrasing than your query. Dropped pages are listed in stderr output.

Validate — JSON reports for cleanliness (residual noise), fidelity (content preservation), coverage (fetch completeness).

Compile — single markdown file with TOC, page separators, and token estimate. A deterministic post-compile step then deduplicates across pages, normalises headings, and builds a grouped TOC — runs in milliseconds regardless of corpus size.

Usage

# Full mirror
bun docmirror.ts https://docs.example.com

# Mirror + condense (turns verbose blah into LLM-friendly efficient language, preserves all code/headings)
bun docmirror.ts https://docs.example.com --condense

# Smart subset — top 20 pages relevant to "authentication"
bun docmirror.ts https://docs.example.com --smart "authentication" --top 20

# Filter via Firecrawl search (requires FIRECRAWL_API_KEY)
bun docmirror.ts https://docs.example.com --filter "API reference"

# Single-run model override for Condense option (default: haiku)
bun docmirror.ts https://docs.example.com --condense --condense-model {model}

# Resume a failed run
bun docmirror.ts resume ./output/mylib-docs-20260515

# Inspect run state
bun docmirror.ts inspect ./output/mylib-docs-20260515

Options

Flag	Purpose
`--name <name>`	Override default slug name
`--smart <query>`	Select top N pages by BM25 relevance to query
`--filter <query>`	Discover pages via Firecrawl search (requires API key)
`--condense`	LLM-condense each page (requires `DOCMIRROR_LLM_API_KEY` or Claude CLI)
`--condense-model <model>`	Model override for condensing (default: `DOCMIRROR_LLM_MODEL` env, or `haiku`)
`--condense-concurrency <n>`	Parallel condense workers (default: 3)
`--fabric <pattern>`	Pipe compiled output through a Fabric pattern
`--top <n>`	Page count for --smart (default: 30)
`--lang <code>`	Language code (default: en)
`--exclude-path <paths>`	Comma-separated paths to exclude
`--force-map`	Force Firecrawl /map even if free methods work
`--max-pages <n>`	Hard cap on pages (opt-in only)

Environment

Copy .env.example to .env and fill in your keys:

Variable	Purpose
`DOCMIRROR_OUTPUT`	Output directory (default: `./output`)
`JINA_API_KEY`	Jina Reader — 500 RPM (vs 20 without). Free tier.
`FIRECRAWL_API_KEY`	Firecrawl — used for `--filter` search and last-resort discovery
`DOCMIRROR_LLM_API_KEY`	API key for `--condense`. Works with any provider. Falls back to `ANTHROPIC_API_KEY`.
`DOCMIRROR_LLM_BASE_URL`	LLM API base URL. Default: Anthropic. Set to any OpenAI-compatible endpoint (OpenAI, Groq, ollama, Together, etc.)
`DOCMIRROR_LLM_MODEL`	Model ID for condensing. Default: `haiku`. Set once, applies to all runs. `--condense-model` overrides per-run.

Condense

The --condense flag runs each page through an LLM with a purpose-built prompt that removes filler prose while preserving technical content. Three deterministic validators run on every page:

Code block count — output must have ≥ input's ``` fence pairs
Heading count — output must have ≥ input's ## heading lines
Word count — output must not expand beyond 105% of input (expansion = hallucination)

Pages failing any validator automatically use the (stripped but) uncondensed version. The prompt is shipped at prompts/condense-page.md.

Important limitation: These validators check structure, not meaning. A condensed page that drops a caveat or inverts a negation ("do not use" → "use") will pass all three structural checks. Condense is a lossy compression — treat it accordingly.

LLM setup (one-time): Condense works with any LLM provider. Configure once in .env:

# Anthropic (default — just set the key)
DOCMIRROR_LLM_API_KEY=sk-ant-...

# OpenAI
DOCMIRROR_LLM_API_KEY=sk-...
DOCMIRROR_LLM_BASE_URL=https://api.openai.com/v1
DOCMIRROR_LLM_MODEL=gpt-4o-mini

# Local ollama
DOCMIRROR_LLM_BASE_URL=http://localhost:11434/v1
DOCMIRROR_LLM_MODEL=llama3.1

Provider is auto-detected from the base URL: anthropic.com → Anthropic Messages format, everything else → OpenAI Chat Completions format. If no API key is set, condense falls back to Claude Code CLI (claude --print). If neither is available, pages fall back to uncondensed output.

Model selection: The eval was done on Anthropic models (see below). Haiku is the default and cheapest. --condense-model overrides per-run; DOCMIRROR_LLM_MODEL sets the default. Fallback rates depend on content and model — the structural validators catch failures regardless of provider.

Post-Compile Processing

After compilation, a deterministic post-compile step runs automatically: drops duplicate pages (exact content matches), removes duplicate content blocks across pages, normalises heading levels, and builds a grouped table of contents from URL path structure. No LLM needed — runs in milliseconds on any corpus size.

Tested on a 251-page corpus: 99 duplicate pages dropped, 18 cross-page blocks deduped, 3131 headings normalised in 35ms.

Fabric

--fabric <pattern> pipes the final compiled output through a Fabric pattern as an optional post-processing step. Requires fabric CLI installed separately.

The original compiled file is preserved as .orig before fabric runs. If fabric fails or produces empty output, the original is restored automatically.

Note: Cross-page assembly (dedup, TOC, heading normalisation) is now handled by the built-in post-compile step — no need for an LLM-based assembly pattern. --fabric is for custom transforms you want to apply on top.

Output

output/{name}-docs-YYYYMMDD-HHMMSS/
  run.json                  ← per-page status, platform detection, timing
  {name}-docs-compiled.md   ← full corpus with TOC
  pages/                    ← raw fetched markdown
  clean/                    ← noise-stripped pages
  sections/                 ← by URL path segment
  reports/                  ← cleanliness.json, fidelity.json, coverage.json

Condense Eval Results

The condense pipeline was calibrated through a structured eval harness: 20-case structural eval, GPT-4o AI judge scoring (fidelity + hallucination detection), full 194-page corpus run, concurrency calibration, and an independent cross-vendor code audit.

Winning config: Haiku + v2.0.0-haiku prompt, concurrency 3.

Metric	Haiku v2.0.0	Sonnet v2.0.0
Structural pass (20 cases)	20/20 (100%)	19/20 (95%)
AI fidelity (GPT-4o judge)	4.90/5	4.75/5
Hallucination pass	100%	100%
Avg reduction	30%	31%

Full corpus (194 Mintlify pages, c=3): 90% pass rate, 23.4% avg reduction on passing pages. 19 fallbacks caught by validators (7 hallucination/expansion, 5 content gutting, 7 minor structural loss) — all fell back to uncondensed, none served broken. Pages under ~150 words skip condensation automatically (density gate).

Prompt evolution: v1.0.0 baseline (30% fallback) → v1.1.0 anti-consolidation rules (5%) → v2.0.0 heading preservation + self-check (0% on Haiku). The breakthrough was explicit structural preservation rules mirroring the code block fix, plus a self-check instruction that asks the model to count headings before outputting.

Concurrency: On Anthropic OAuth, c=3 gave 2.15x speedup with no quality loss.

Architecture

docmirror.ts          CLI entry (Commander.js)
  ├── discover.ts     5-stage URL discovery + Firecrawl search (--filter)
  ├── fetch.ts        Content acquisition + Jina rate limiter
  ├── detect.ts       Confidence-scored platform detection
  ├── strip.ts        Strategy orchestrator + quality gate
  │   └── strategies/ Generic + GitBook, Docusaurus, Sphinx, Mintlify, MkDocs
  ├── smart.ts        BM25 relevance pruning (--smart)
  ├── condense.ts     LLM compression + structural validators (--condense)
  │   └── prompts/    Condense system prompt
  ├── validate.ts     Cleanliness, fidelity, coverage reports
  ├── compile.ts      Markdown assembly + TOC + token estimate
  ├── postcompile.ts  Cross-page dedup, heading normalisation, grouped TOC
  ├── flags.ts        Per-page quality flag detectors
  ├── types.ts        Shared type definitions
  └── state.ts        run.json manifest for resumability

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocMirror

Why

How It Works

Usage

Options

Environment

Condense

Post-Compile Processing

Fabric

Output

Condense Eval Results

Architecture

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
prompts		prompts
strategies		strategies
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bun.lock		bun.lock
compile.ts		compile.ts
condense.ts		condense.ts
detect.ts		detect.ts
discover.ts		discover.ts
docmirror.ts		docmirror.ts
fetch.ts		fetch.ts
flags.ts		flags.ts
package.json		package.json
postcompile.ts		postcompile.ts
smart.ts		smart.ts
state.ts		state.ts
strip.ts		strip.ts
tsconfig.json		tsconfig.json
types.ts		types.ts
validate.ts		validate.ts

Folders and files

Latest commit

History

Repository files navigation

DocMirror

Why

How It Works

Usage

Options

Environment

Condense

Post-Compile Processing

Fabric

Output

Condense Eval Results

Architecture

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages