Turn any website's sitemap into a structured, searchable wiki — in minutes, not months.
Point it at a sitemap URL. It scrapes every page, converts to markdown, classifies content into topics, and compiles structured wiki articles with summaries, key takeaways, and cross-references.
Built for Claude Code.
| Metric | Value |
|---|---|
| Pages processed | 10,782 |
| Wiki articles created | 10,475 |
| Topics auto-classified | 11 |
| Total time | ~90 minutes |
| Manual equivalent | 898 hours (112 working days) |
| Speedup | ~600x |
sitemap.xml
|
v
scrape.py (concurrent, rate-limited)
|
v
raw/<domain>/*.md
|
v
/compile-wiki (parallel Haiku sub-agents)
|
v
wiki/<topic>/*.md (structured articles + indexes)
-
Scrape — Fetches all pages from a sitemap XML (supports nested sitemaps). Concurrent with rate limiting and retry logic. Converts HTML to clean markdown.
-
Ingest — Converts local files (docx, pdf, txt) dropped in
input/to markdown. -
Compile — Spawns parallel Haiku sub-agents that read raw files, classify topics, and write structured wiki articles. Each article gets a summary, key takeaways, detailed content, and related concepts.
-
Index — Builds a master index, per-topic indexes, and a compiled log. All indexes are idempotent and self-healing.
-
Cross-link — Validates that articles reference related content.
- Claude Code installed
- Python 3.12+
- uv package manager
git clone https://github.com/promptgtm/knowledge-base-builder.git
cd knowledge-base-builder
uv syncIn Claude Code:
/scrape-and-compile https://example.com/sitemap.xml
/compile-wiki raw/example.com/
Drop .md, .txt, .docx, or .pdf files into input/, then:
/scrape-and-compile
uv run python scrape.py https://example.com/sitemap.xml
# With options
uv run python scrape.py https://example.com/sitemap.xml \
-c 50 \
--min-delay 0.1 \
--max-delay 0.5uv run python ingest.pyknowledge-base-builder/
scrape.py # Sitemap scraper (async, concurrent)
ingest.py # File converter (docx, pdf, txt -> markdown)
pyproject.toml # Dependencies
.claude/
skills/
compile-wiki/ # Wiki compilation skill (parallel Haiku agents)
scrape-and-compile/ # End-to-end pipeline skill
input/ # Drop files here for ingestion
raw/ # Scraped/ingested markdown (intermediate)
wiki/ # Compiled knowledge base (output)
_master-index.md # Topic overview with article counts
_compiled-log.md # Processing log
<topic>/
_index.md # Topic article listing
<article>.md # Structured wiki article
Every compiled article follows this structure:
# Descriptive Title
**Source:** [original-file.md](../../raw/example.com/original-file.md)
**Created:** 2026-04-14
**Topics:** Topic Name
## Summary
2-3 sentence overview.
## Key Takeaways
- Takeaway 1
- Takeaway 2
- Takeaway 3
## Details
Expanded content. 200-500 words. No links.
## Related
- Related concept 1
- Related concept 2| Flag | Default | Description |
|---|---|---|
-c, --concurrency |
10 | Max concurrent requests |
--min-delay |
0.5 | Min delay between requests (seconds) |
--max-delay |
2.0 | Max delay between requests (seconds) |
--max-retries |
3 | Max retries per page |
-o, --output-dir |
raw |
Output directory |
Edit .claude/skills/compile-wiki/SKILL.md to customize:
- Batch size — Files per sub-agent (default: ~40)
- Parallel agents — Max concurrent agents (default: 5)
- Topic taxonomy — Add/remove topic categories
- Article format — Modify the wiki article template
- Minimum file size — Skip files smaller than N bytes (default: 200)
-
Use absolute paths in agent prompts. Sub-agents can lose working directory context.
-
Some agents process entire directories. When given a batch of 100 dossier files, one agent compiled all 5,000+ in the directory. This is beneficial — let greedy agents run.
-
Watch for multiple file prefixes per category.
integrations-action-*andintegrations-data-points-*are both "integrations" but need separate batch handling. -
Indexes are self-healing. Run the index builder after any compile pass. It scans the filesystem and rebuilds from scratch.
-
Increase concurrency for large sites. Default scraper concurrency (10) is conservative. For 10K+ pages, bump to 50 with reduced delays.
- httpx — Async HTTP client with HTTP/2
- beautifulsoup4 — HTML parsing
- html2text — HTML to markdown
- rich — Terminal UI
- pydantic — Config validation
- python-docx — DOCX conversion
- pymupdf — PDF extraction
MIT