A web tool that parses man pages and explains command-line arguments by matching each argument to its help text.
- Python 3.12, Flask, SQLite, bashlex, OpenAI SDK, Google Gemini SDK, LiteLLM (fallback)
- Linting: ruff (Python), biome (JS)
- Testing: pytest (unit + doctests), JS Playwright Test (e2e)
- Dependencies:
requirements.txt(main),package.json(Playwright e2e)
Before finishing any task, always:
- Run
make format - Run tests — choose the right suite based on what changed:
make tests-quick(lint + unit) — use when changes clearly cannot affect what the web app serves (e.g., extraction pipeline, CLI tooling, tests themselves)make tests-all(lint + unit + e2e) — use when changes might affect the web serving path (rendering, matching, storage, templates, static assets, config)- When in doubt, run
make tests-all - If e2e tests fail due to snapshot diffs, assess whether the diff is expected, and get user confirmation before running
make e2e-update
- Update README.md if the change adds/removes/renames CLI commands, env vars, or user-facing features
- Update AGENTS.md if the change affects structure, convention, workflow, etc.
Use the LLM eval (tests/evals/llm/llm_eval.py) to compare before/after metrics when making changes to the LLM extractor. It runs extraction on the corpus listed in tests/evals/llm/corpus.txt (paths into the manpages/ submodule) and writes summary.json plus per-page artifacts under markdown/, prompts/, and responses/ to a timestamped directory under tests/evals/llm/runs/. Summaries include git metadata, model, label, description, aggregate metrics (extracted/failed files, total options, zero-option pages, multi-chunk pages, token usage), and per-page metrics keyed by repo-relative path.
run requires --label <tag> (folded into the run dir name) and accepts -d "..." for a longer description. Always pass a meaningful label and description inferred from context — e.g. the task you're working on, or "baseline" for a pre-change run — so list and compare output stay self-explanatory.
Workflow for code changes (API, prompt, chunking, post-processing):
# 1. Stash your changes to get a clean baseline
git stash push -- explainshell/extraction/llm/
# 2. Run on the old code
python tests/evals/llm/llm_eval.py run --label baseline --model codex/gpt-5.4/medium --jobs 10 -d "baseline before <short summary of change>"
# 3. Restore your changes
git stash pop
# 4. Run on the new code
python tests/evals/llm/llm_eval.py run --label change --model codex/gpt-5.4/medium --jobs 10 -d "<short summary of change>"
# 5. Compare the two run directories (oldest first)
python tests/evals/llm/llm_eval.py compare tests/evals/llm/runs/<baseline-run> tests/evals/llm/runs/<change-run>Usage:
# Run on the default corpus, parallelizing realtime calls
python tests/evals/llm/llm_eval.py run --label smoke --model codex/gpt-5.4/medium --jobs 10
# Run on specific files (overrides --corpus)
python tests/evals/llm/llm_eval.py run --label probe --model codex/gpt-5.4/medium --jobs 10 path/to/file.1.gz
# Use --batch <size> instead of --jobs to route through the provider's batch API
# (cheaper, but minutes-to-hours of queue latency; pays off only on much larger corpora).
# Compare two run directories
python tests/evals/llm/llm_eval.py compare tests/evals/llm/runs/<baseline-run> tests/evals/llm/runs/<current-run>
# List all saved runs
python tests/evals/llm/llm_eval.py list- Use Python type annotations on all new code (function signatures, return types, and non-obvious variables). Do not retroactively annotate existing code unless you are already modifying it.
- Python virtualenv: repo-local
.venv - CRITICAL: Every Bash tool call runs in a fresh shell with NO venv active. You MUST prefix every Python/pip/pytest/ruff/make command with
source .venv/bin/activate &&. Example:source .venv/bin/activate && make tests. Never run barepython,pytest,ruff,pip, ormakewithout activating first.
# Run unit tests + doctests (excludes e2e)
make tests
# Run a single test file
pytest tests/test_matcher.py -v
# Run a single test method
pytest tests/test_matcher.py::test_matcher::test_no_options -v
# Lint
make lint
# Format
make format
# Run e2e tests (requires playwright)
make e2e
# Update e2e snapshots
make e2e-update
# Run LLM integration test (requires API key in .env)
make test-llm
# Run quick tests (lint + unit, no e2e)
make tests-quick
# Run all tests (lint + unit + e2e)
make tests-all
# Run DB integrity checks
make db-check
# Run web server locally
make serve
# Generate Ubuntu manpage archive (requires Go)
make ubuntu-archive UBUNTU_RELEASE=resolute
# Generate Arch Linux manpage archive (requires manned.org dump)
make arch-archive
# Process a man page into the database
python -m explainshell.manager extract --mode llm:codex/gpt-5.4/medium /path/to/manpage.1.gzexplainshell/- Main packagemanager.py- CLI entry point for man page processing (python -m explainshell.manager <command>)db_check.py- Database integrity checks (used bymanager.py db-check)matcher.py- Core logic: walks bash AST and matches tokens to help textmodels.py- Core domain types (Option, ParsedManpage, RawManpage) as Pydantic/dataclass modelsstore.py- SQLite storage layercaching_store.py- Read-only size-aware cached Store variant for production web serving; whenDEBUG=false, the Flask app stores one per worker process inapp.extensionserrors.py- Exception hierarchy (ProgramDoesNotExist, DuplicateManpage, InvalidSourcePath, ExtractionError, SkippedExtraction, FatalExtractionError)diff.py- Man page comparison and diff formattingroff_parser.py- Roff macro parser (used byroff_utils.pyfor detection helpers)roff_utils.py- Roff source detection (dashless opts, nested cmd)manpage.py- Man page reading and HTML conversionhelp_constants.py- Shell constant definitions for help textutil.py- Shared utilities (group_continuous, Peekable, name_section)config.py- Configuration (DB_PATH, HOST_IP, DEBUG, MANPAGE_URLS)extraction/- Man page option extraction pipeline__init__.py- Public API:make_extractor(mode)factorytypes.py- Shared types (ExtractionResult, ExtractionStats, BatchResult, ExtractorConfig, Extractor protocol)runner.py- Execution orchestration (sequential, parallel, batch)common.py- Shared metadata assembly for all extractorsprefilter.py- Pre-extraction classification of input .gz files (size, symlink, --filter-db, already-stored, content-dup)postprocess.py- Extractor-agnostic option post-processingllm/- LLM-based extraction subpackageextractor.py- LLM extractor orchestrationprompt.py- Prompt constructionresponse.py- LLM response parsingtext.py- Man page text preparation and chunkingproviders/- LLM provider implementations (OpenAI, Gemini, LiteLLM fallback)
web/views.py- Flask routes with URL-based distro/release routing
tools/- Standalone scriptsfetch_manned.py- Fetch man pages from manned.org weekly dumpmandoc-md- Custom mandoc binary with markdown output support
tests/- Unit tests (test_*.py), fixturestests/e2e/- Playwright e2e tests, snapshots, and dedicatede2e.dbtests/evals/- Manual review-oriented evals (not inmake tests-all)_common.py- Shared helpers (corpus reading, summary loading, metric lookup)llm/- LLM extractor eval (llm_eval.py,corpus.txt,runs/)render/- Mandoc markdown render eval (render_eval.py,corpus.txt,runs/)
runserver.py- Flask app entry pointmanpages/- Git submodule (explainshell-manpages)ubuntu-manpages-operator/- Go pipeline that fetches Ubuntu.debpackages, extracts manpages, and converts them to markdown
manager.py orchestrates: raw .gz → parse → extract options → store in SQLite.
The CLI uses subcommands. Most commands require a database path, set via DB_PATH env var or --db <path>. Commands that don't need a database (e.g. extract --dry-run, diff extractors) work without it. Main commands:
extract --mode <mode> [options] files...— Extract options from manpages and store in DBdiff db --mode <mode> files...— Diff fresh extraction against the databasediff extractors <A..B> files...— Compare two extractors head-to-headshow {manpage,distros,sections,manpages,mappings,stats}— Query the databasedb-check— Run database integrity checks
Extraction modes (passed via --mode to extract or diff db):
llm:<provider/model>- Sends man page text to an LLM (e.g.,llm:openai/gpt-5-mini,llm:azure/my-deployment). Supports Gemini, OpenAI, Azure OpenAI, and LiteLLM (fallback) providers. Forazure/..., the model suffix is the Azure deployment name and requiresAZURE_OPENAI_API_KEYplus eitherAZURE_OPENAI_BASE_URLorAZURE_OPENAI_ENDPOINT.
Extract flags: --overwrite, --filter-db <spec> (conditional overwrite; requires --overwrite; same syntax as --mode), --dry-run, --debug, --drop, -j/--jobs <int> (parallel extraction, default 1), --batch <int> (provider batch API). All run output (logs, debug artifacts, manifests) goes to logs/{timestamp}/.
SQLite with two tables:
- manpage - source (unique basename), name, synopsis, options (JSON), aliases, flags
- mapping - command name → manpage id lookup (many-to-one, with score for preference)
Key classes (Pydantic models in models.py):
Option- text, short/long flag lists, has_argument, positional, prefix (literal sigil a token must start with for a positional to claim it, e.g.@in dig's[@server]; restricted to theOPTION_PREFIX_SIGILSallowlist@/+/:), nested_cmdParsedManpage- container with options/positionals/prefixed_positionals properties andfind_option(flag)lookup;positionalsexcludes prefix-bearing options, which are exposed viaprefixed_positionals(name → (prefix, text))
Uses bashlex AST visitor pattern:
Matcherinherits frombashlex.ast.nodevisitorvisitcommand()- looks up man page, handles multi-command (e.g.,git commit)visitword()- matches tokens to options (exact match, then fuzzy split for combined short flags like-abc)- Positional operands use two pools: prefixed positionals are claimed only by tokens starting with their sigil (e.g.
@8.8.8.8→ dig'sserver); remaining tokens consume the non-prefixed positionals in order, reusing the last one (variadic). A token whose sigil no positional declares falls through to ordered consumption; if all positionals are prefixed and none match, the token is unknown. - Produces
MatchResult(start, end, text, match)where start/end are character positions in the original string
Hermetic setup: uses a dedicated tests/e2e/e2e.db and random port selection. Server is started fresh per run (reuseExistingServer: false).
The web app uses CachingStore only when DEBUG=false (production and e2e). The cached store is created lazily per worker process and stored in app.extensions. Local dev (DEBUG=true, the default for make serve) uses a per-request plain Store so DB rebuilds are visible without restarting the server. Tooling such as explainshell.manager and tests/evals/llm/llm_eval.py should continue using plain Store, not CachingStore.
The app is deployed to DigitalOcean App Platform. The SQLite database is baked into the Docker image at build time (downloaded as .zst from the GitHub release, decompressed during docker build).
Production infrastructure:
- Domain:
explainshell.com→ Cloudflare (orange cloud proxy) → DigitalOcean App Platform - Cloudflare: DNS + proxy, SSL mode set to Full (Strict)
- App spec:
prod/digitalocean/app.yaml— region, instance size/count, env vars, custom domain.doctl apps update --specdoes a full replace, so anything configured out-of-band gets wiped on the next deploy; check it in here instead. - Container artifacts:
prod/docker/(Dockerfile, Caddyfile, start.sh)
Deploy code changes:
Deploys are driven by CI: merging to master triggers .github/workflows/do-deploy.yml, which resolves the newest db-latest asset name, renders the spec via envsubst with DB_NAME and GIT_SHA, applies it with doctl apps update --spec, then forces a fresh build with doctl apps create-deployment --force-rebuild --wait. The force-rebuild step is load-bearing: deploy_on_push is off, so without it DO deploys from its cached (stale) branch head instead of the current commit.
Update the database:
make upload-live-db— uploads anexplainshell-{date}.db.zstasset to thedb-latestrelease (skipped if digest matches the current newest).- Push to
master— the deploy pipeline resolves the newest asset name, passes it as the DockerDB_NAMEbuild-arg, and the download layer cache-busts to fetch it.