doc_getter

Mirror documentation sites into Markdown, either as a local CLI run or as a FastAPI job service.

What it does

Crawls docs sites from a seed URL
Discovers pages from robots.txt and sitemap.xml
Rewrites internal links to local Markdown paths
Writes a manifest with crawl metadata
Supports a service mode with job polling, cancellation, archives, and file download

Installation

uv sync

CLI usage

uv run python main.py https://docs.commonstack.ai/ --output ./docs --workers 10

Useful flags:

--scope-prefix /docs/
--max-pages 200
--max-depth 10
--timeout 30
--delay 0.5
--no-sitemaps
--verbose

Service mode

Run the API server:

uv run python service.py

or:

uv run uvicorn service:app --host 0.0.0.0 --port 8000

The service stores each crawl under runs/<job_id>/ by default.

Public deployment recommendation

If your goal is “let other people try it quickly”, the most practical setup is:

Frontend: Next.js on Vercel
Backend: this FastAPI service on Railway / Render / Fly.io

Why not Vercel-only for the current Python service?

the crawl can run for a while
jobs are tracked in memory
output files are written to disk under runs/

That works well on a normal container service, but is a poor fit for stateless serverless functions.

This repo now includes a minimal web/ Next.js frontend that can call the API directly.

Backend env vars

PORT=8000
DOC_GETTER_DATA_DIR=runs
DOC_GETTER_CORS_ORIGINS=http://localhost:3000,https://your-vercel-app.vercel.app

Frontend env vars

Create web/.env.local:

NEXT_PUBLIC_API_BASE_URL=https://your-backend-url.example.com

Deploy flow

Deploy the root Python service to Railway/Render/Fly.io.
Set DOC_GETTER_CORS_ORIGINS to your Vercel domain.
Deploy web/ to Vercel.
Set NEXT_PUBLIC_API_BASE_URL to the backend URL.

Create a crawl job

curl -X POST http://127.0.0.1:8000/v1/crawls \
  -H 'Content-Type: application/json' \
  -d '{
    "start_url": "https://openrouter.ai/docs/",
    "scope_prefix": "/docs/",
    "workers": 10
  }'

Poll the job

curl http://127.0.0.1:8000/v1/crawls/<job_id>

Download outputs

Manifest: GET /v1/crawls/<job_id>/manifest
Zip archive: GET /v1/crawls/<job_id>/archive
File list: GET /v1/crawls/<job_id>/files
One file: GET /v1/crawls/<job_id>/files/<path>
Cancel job: POST /v1/crawls/<job_id>/cancel

Skill package

A reusable Codex skill is included at:

skills/doc-getter-service/SKILL.md

To install it, copy that folder into your Codex skills directory, for example:

~/.codex/skills/doc-getter-service

or on Windows:

$env:USERPROFILE\.codex\skills\doc-getter-service

Output structure

Each crawl still writes Markdown files and _mirror-manifest.json to disk, so the service and CLI share the same crawl engine and artifact format.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
skills/doc-getter-service		skills/doc-getter-service
web		web
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
main.py		main.py
pyproject.toml		pyproject.toml
service.py		service.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

doc_getter

What it does

Installation

CLI usage

Service mode

Public deployment recommendation

Backend env vars

Frontend env vars

Deploy flow

Create a crawl job

Poll the job

Download outputs

Skill package

Output structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

doc_getter

What it does

Installation

CLI usage

Service mode

Public deployment recommendation

Backend env vars

Frontend env vars

Deploy flow

Create a crawl job

Poll the job

Download outputs

Skill package

Output structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages