feat: PageIndex-style corpus tools + MCP server + deploy hardening by emeraldtarek · Pull Request #1 · emeraldtarek/from-zero

emeraldtarek · 2026-05-11T13:19:39Z

Summary

Lifts the PageIndex pattern (table-of-contents tree + 3 navigation tools, no vector DB) into the app as a TS-native SQLite resident. Adds five new chat tools (`get_corpus_outline`, `get_section`, `search_corpus`, `list_glossary`, `list_concepts`) so Claude can navigate the whole 22-file curriculum instead of being limited to whatever page the learner is on. The same tools are also exposed as a standalone MCP server so any MCP client (Claude Code, Claude Desktop, the inspector) can navigate the corpus directly.

Plus four supporting changes:

Haiku-written summaries for every corpus node, cached by `content_hash` so re-runs are idempotent. ~$0.30 to fully populate.
FTS5 + BM25 + Porter stemming for `search_corpus` (replaces the old substring scan; `isotopes`/`isotopic` now match the same lemma).
Auto-continue on truncation for many-tool-call chat turns — stress-tested with a 70-tool-call response (one auto-continue at element ~58, all 70 landed).
`.env.example` pattern for user-data MD files — `glossary.md` / `knowledge-tracker.md` / `questions-and-answers.md` / `progress-log.md` are now gitignored runtime state, auto-bootstrapped from committed `*.example.md` templates on first run. VPS starts with a clean slate by default.

Deploy script (`deploy/deploy.sh`) now: bootstraps the user-data files, builds the corpus tree synchronously (no first-request race), and soft-calls `npm run summarize` (skips if no auth, never blocks the deploy on transient errors).

Deploy hardening

Production posture for from-zero.emeraldlake.io (Hetzner CX23):

Cloudflare Origin CA cert (RSA 2048, 15-year validity) replaces Let's Encrypt at the origin. Caddy serves it from /etc/caddy/tls/origin.{pem,key} and stops attempting ACME entirely.
Caddyfile now points at the Origin CA paths instead of doing auto-TLS, so renewals are unneeded.
deploy/install-origin-cert.sh: scps the cert + key to the box, verifies the pair via pubkey hash first, installs with correct ownership + perms, validates and reloads Caddy. Idempotent.
deploy/lock-origin-to-cloudflare.sh: replaces wide-open ufw allow 80/443 with allow-lists scoped to Cloudflare's published IPv4 + IPv6 ranges. SSH on :22 stays open. Idempotent via /var/lib/cloudflare-ufw.list state file.
Cloudflare Full (strict) + orange-cloud proxy: origin IP is hidden in DNS, visitors see Cloudflare's edge cert, edge verifies the Origin CA cert at the box.
Cloudflare Access (Zero Trust) sits in front of the app: visitors hit a one-time-PIN email login at emeraldtarek.cloudflareaccess.com before any request reaches the origin. 30-day session, allowlist scoped to a single email.
bootstrap.sh no longer installs the Caddyfile or restarts Caddy (the Origin CA paths don't exist on a fresh box). Operator installs them via the harden playbook in step 7. Also fixes the chown gap that left the deploy SSH key's authorized_keys root-owned.
deploy.sh drops sudo from the diagnostic systemctl status call (extra args didn't match sudoers, caused a password prompt mid-deploy).

New skill: `cloudflare-harden`

.claude/skills/cloudflare-harden/SKILL.md is a project-scoped Claude Code skill that captures the full 8-step hardening sequence with API calls, verification commands, rollback recipes, and the actual gotchas we hit (private-key-clobbered-by-Write, sudoers arg mismatch, NXDOMAIN cache, MCP scope quirks). Future-Claude can rerun the playbook on a fresh box without rediscovering them.

Test plan

🤖 Generated with Claude Code

…tools Build a hierarchical index over zero/**/*.md by parsing # headers and slicing each section's line range; persist 446 nodes (22 H1 / 221 H2 / 181 H3 / 22 H4) to a new `corpus_node` SQLite table. Each node carries a placeholder summary (first sentence) for outline-time rendering. Wire five new tools into the tutor-chat tool set so Claude can navigate the corpus instead of guessing or being limited to whatever page the learner is on: - `get_corpus_outline({phase_id?, max_level?})` — H1+ tree as Markdown - `get_section(slug)` — verbatim Markdown of any node, with `did_you_mean` suggestions on miss - `search_corpus(query)` — substring + token-overlap ranker - `list_glossary(prefix?)` — read-side companion to add_glossary_term - `list_concepts({phase_id?, status?})` — ditto for mark_concept_status System prompt now injects the lean H1-only outline (~2 KB) alongside the current page's full text and the available concept slugs. The model is instructed to use search/get_section for cross-phase questions rather than hand-waving. End-to-end verified: cross-phase query "how does ion exchange in water treatment relate to crown-ether LLX in lithium-6 separation?" produced a precise multi-source answer in 3 tool calls (outline → search → get_section) with citations from both phases. Bad-slug path verified: get_section('atomic-hypothesis') returns not_found + 5 did_you_mean suggestions; the model self-corrects to the real concept slug. Why: the session simulation surfaced four friction points (whole-page injection brittleness, hallucinated identifiers, no TOC reasoning, glossary blindness). Research summarized at web/docs/pageindex-research.md recommended lifting the PageIndex *pattern* (TOC tree + 3 nav tools) into a TS-native, SQLite-resident reimplementation rather than running their Python pipeline or hosted MCP. This commit ships the half-day MVP. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The session simulations kept running into the same failure mode: when Claude tried to emit ~7+ tool_use blocks in a single response, the rest silently disappeared. Diagnosed two compounding causes via stop_reason logging: 1. The SDK's `maxTurns` was set to 8 — but each tool_use round-trip counts as a turn, so a 7-call batch ate the budget. Fix: bump the per-call cap to 60. 2. When the SDK ends with `error_max_turns` (or `error_max_tokens`), that's a recoverable signal, not an error. Fix: detect it on the result message and re-fire `query()` with a continuation prompt that includes the user's original ask + every tool call already made (with inputs) + any prose already streamed. Up to 3 auto-continues. Also tightened the tutor system prompt with explicit "tool calls first, prose after" guidance, since long prose followed by many tools is the exact pattern that hits the cap. Verified with two stress tests: - 18 glossary terms in one ask → completed in zero auto-continues (the bumped maxTurns alone handles this). - 70 glossary terms (one per element, H–Yb) → completed cleanly with one auto-continue at element ~58. UI shows `_(continuing… 1/3)_`. The Anthropic-API path's max_tokens was also bumped from 4096 to 16384 for parity, though we never observed truncation there in practice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t_hash The placeholder first-sentence summaries gave the outline a coherent shape but read as fragments ("None. This is the floor."). Now each node with word_count >= 50 gets a one-sentence Haiku-written summary that leads with the load-bearing claim, definition, formula, or named effect. Pipeline: - `npm run summarize` reads each node's section text, calls Haiku 4.5 via Anthropic SDK (auth: ANTHROPIC_API_KEY or CLAUDE_CODE_OAUTH_TOKEN with the oauth-2025-04-20 beta header), and persists the result. - Cache: new `summary_cache` table keyed by content_hash. Identical content across rebuilds = no re-call. `--force` re-runs everything. - corpus-builder now consults the cache on insert, so re-running `npm run build-index` after a markdown edit keeps the existing summary for unchanged sections and falls back to the first-sentence placeholder for genuinely new ones. Quality: - 339 summaries generated in 91s with concurrency 5; 0 errors. - 47 first-pass outputs leaked Markdown headers / preamble despite the system prompt; tightened the prompt with explicit rules + good and bad examples + a `cleanSummary()` post-processor that strips leading `#`, "Summary:", "I understand…", quote/backtick wrappers, and `**bold**`. Re-running on the cleaned subset produced clean one-liners. Outline impact: H1-only outline grew 2.2 KB → 4.3 KB. Still fits the system prompt comfortably. Each phase-3 page summary now reads like a self-contained abstract. Auth verification: confirmed the Anthropic SDK accepts CLAUDE_CODE_OAUTH_TOKEN as `authToken` plus the `anthropic-beta: oauth-2025-04-20` header against api.anthropic.com. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors the in-app chat tools as a stdio Model Context Protocol server so any MCP client (Claude Code, Claude Desktop, the official inspector, custom clients) can navigate the corpus directly. Reads the same SQLite DB the Next.js app writes to, so glossary + concept-tracker state are always in sync. No re-implementation of helpers — the server is a thin wrapper over `corpus-index.ts` and `repos.ts`. - `web/scripts/mcp-server.ts` — stdio server with 5 registered tools - `web/scripts/test-mcp-server.ts` — spawns it via the SDK Client and verifies every tool with realistic args (good slug, bad slug, cross-phase search, etc.) - `web/docs/mcp-server.md` — Claude Code registration steps + tool table Smoke-tested: list_tools returns 5 tools; get_corpus_outline, get_section (valid + bad slug → did_you_mean), search_corpus ("crown ether"), list_glossary, list_concepts(status=solid) all return the expected payloads. Why: the corpus-index is genuinely useful outside the browser app — Claude Code can look up sections while editing the markdown source, or other Claude Code projects (paper drafts, slides) can read the corpus without spinning up Next. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… Porter stemming The substring + token-overlap approach was adequate at 446 nodes but ranked poorly: it gave equal weight to a stray match in a slug and a strong match in a section's body, and missed morphological variants ("isotopes" didn't match "isotopic"). This commit swaps in a proper full-text index: - New `corpus_fts` virtual table (FTS5, porter+unicode61 tokenizer) with columns slug (UNINDEXED), title, summary, content. Rebuilt wholesale by `corpus-builder.ts` alongside `corpus_node` so they never drift. - `searchCorpus()` now runs `MATCH ?` with a sanitized OR-of-tokens query and ranks by `bm25(corpus_fts, 8.0, 4.0, 1.0)` (title weighted highest, then summary, then body). Score is normalized so callers can keep "higher = better". - Falls back to the old substring scan if the FTS query is empty (all stop-words/punctuation) or the virtual table is missing. Verified on representative queries: - "crown ether liquid extraction" → top hits are crown-ether LLX sections in both 02-water-treatment and 03-lithium-isotope-separation - "mercury amalgam isotope" → top hits are COLEX principle + Y-12 legacy - "fusion blanket tritium" → top hit is "The tritium breeding ratio (TBR)" - "Avogadro Brownian Einstein" → top hit is "How we *know* atoms exist" - Porter stemming check: searchCorpus("isotopes") and searchCorpus("isotopic") return identical top hits (same lemma) Tool descriptions in tutor-prompt.ts and mcp-server.ts updated to note FTS5 capabilities (stemming, phrase support) so the model picks better keywords. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Caddyfile: serve Cloudflare Origin CA cert (valid until 2041) instead of ACME. Required so Caddy stops attempting HTTP-01 renewals after the hostname is fronted by Cloudflare's proxy. - lock-origin-to-cloudflare.sh: restrict ufw 80/443 to Cloudflare IP ranges only. SSH on 22 stays open. Idempotent via /var/lib/cloudflare-ufw.list. - deploy.sh: drop sudo from the diagnostic `systemctl status` call. The extra `--no-pager --lines=10` args don't match the sudoers rule, which caused the deploy to hang on a password prompt. - bootstrap.sh: chown the lithium user's authorized_keys after the root-shell redirect (the `>>` ran as root, not as the sudo'd user). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The four tracking files (glossary, knowledge-tracker, questions-and- answers, progress-log) are per-environment runtime state — the app rewrites them on every chat turn. Keeping them in git produced noisy diffs and meant the VPS started with whatever happened to be the last local session's residue. Now mirrors the `.env` / `.env.local` pattern: - *.example.md files (committed) — the seed templates - *.md files (gitignored) — the live runtime files, auto-copied from the templates on first run by bootstrapUserData() in content-loader bootstrapUserData runs from ensureSeeded(), which already fires on every API call, so a fresh VPS picks up the templates automatically on the first request — no manual setup beyond `npm install` + auth env vars. Seed templates: empty-stub glossary + Q&A, all-todo knowledge tracker, kickoff-only progress log (drops the two simulation entries that had been committed by accident). Verified: rm-ing all four live files and hitting /api/concepts re-creates them from the templates with identical byte counts to the .example.md siblings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The deploy was missing two things: 1. `npm run ingest` only synced pages + concepts and regenerated two mirror files. It did NOT build the corpus_node + corpus_fts tree that the chat tools navigate, and did NOT bootstrap the live user-data .md files from their .example.md templates. The app technically self-healed via the lazy build in ensureSeeded(), but that fires async-fire-and-forget on the first API call, which races against incoming chat traffic. Fixes: ingest now calls bootstrapUserData() first, buildCorpusIndex() after pages/concepts, and also regenerates the Q&A mirror (not just glossary + knowledge-tracker). 2. summary_cache was never populated on the VPS. Without it, the chat outline uses placeholder first-sentence summaries instead of Haiku-written ones. Adds a soft `npm run summarize` step to deploy.sh, gated on auth presence and skippable via LITHIUM_SKIP_SUMMARIZE=1. Idempotent by content_hash — subsequent deploys re-summarize only changed sections. Auth-missing or transient errors warn but don't fail the deploy. Verified locally: `rm zero/{04-learning,05-meta}/*.md && npm run ingest` re-bootstraps all four files from their templates, populates pages + concepts (88 concepts seeded), builds 446 corpus nodes, regenerates the three live mirror files. Idempotent on second run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Captures the production hardening sequence we just applied (from-zero.emeraldlake.io) as a reusable, project-scope Claude Code skill so it can be rerun on a fresh box without rediscovering the gotchas. - .claude/skills/cloudflare-harden/SKILL.md: 8-step playbook (Origin CA cert -> Full strict TLS -> orange-cloud proxy -> ufw lock to CF IPs -> Zero Trust Access). Auto-discoverable; triggers on "harden the box", "lock origin to cloudflare", "add cloudflare access", etc. Documents the actual bugs we hit (private-key-clobbered-by-Write, sudoers arg mismatch on systemctl status, NXDOMAIN cache, MCP scope quirks) plus rollback recipes for every step. - deploy/install-origin-cert.sh: reusable origin-side installer. Takes cert + key + IP, verifies the pair via pubkey hash before sending anything to the box (catches the file-overwrite bug), scps cert + key + the project Caddyfile, installs with caddy ownership, validates and reloads. Idempotent. - deploy/bootstrap.sh: stop copying the Caddyfile or restarting Caddy. The TLS-enabled Caddyfile references /etc/caddy/tls/origin.{pem,key} which don't exist on a fresh box, so installing it here crashed Caddy. Caddy stays on its default config until the harden playbook installs the cert + Caddyfile together. Next-steps text now points at the skill. - deploy/README.md: adds step 7 (Cloudflare hardening), updates step 1 to start in grey cloud (proxy gets flipped on as part of step 7), lists the new scripts in the file table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

emeraldtarek and others added 9 commits May 10, 2026 11:58

emeraldtarek merged commit e5f488a into main May 11, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: PageIndex-style corpus tools + MCP server + deploy hardening#1

feat: PageIndex-style corpus tools + MCP server + deploy hardening#1
emeraldtarek merged 9 commits into
mainfrom
feat/corpus-index

emeraldtarek commented May 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

emeraldtarek commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Deploy hardening

New skill: cloudflare-harden

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

emeraldtarek commented May 11, 2026 •

edited

Loading

New skill: `cloudflare-harden`