Skip to content

feat: PageIndex-style corpus tools + MCP server + deploy hardening#1

Merged
emeraldtarek merged 9 commits into
mainfrom
feat/corpus-index
May 11, 2026
Merged

feat: PageIndex-style corpus tools + MCP server + deploy hardening#1
emeraldtarek merged 9 commits into
mainfrom
feat/corpus-index

Conversation

@emeraldtarek
Copy link
Copy Markdown
Owner

@emeraldtarek emeraldtarek commented May 11, 2026

Summary

Lifts the PageIndex pattern (table-of-contents tree + 3 navigation tools, no vector DB) into the app as a TS-native SQLite resident. Adds five new chat tools (`get_corpus_outline`, `get_section`, `search_corpus`, `list_glossary`, `list_concepts`) so Claude can navigate the whole 22-file curriculum instead of being limited to whatever page the learner is on. The same tools are also exposed as a standalone MCP server so any MCP client (Claude Code, Claude Desktop, the inspector) can navigate the corpus directly.

Plus four supporting changes:

  • Haiku-written summaries for every corpus node, cached by `content_hash` so re-runs are idempotent. ~$0.30 to fully populate.
  • FTS5 + BM25 + Porter stemming for `search_corpus` (replaces the old substring scan; `isotopes`/`isotopic` now match the same lemma).
  • Auto-continue on truncation for many-tool-call chat turns — stress-tested with a 70-tool-call response (one auto-continue at element ~58, all 70 landed).
  • `.env.example` pattern for user-data MD files — `glossary.md` / `knowledge-tracker.md` / `questions-and-answers.md` / `progress-log.md` are now gitignored runtime state, auto-bootstrapped from committed `*.example.md` templates on first run. VPS starts with a clean slate by default.

Deploy script (`deploy/deploy.sh`) now: bootstraps the user-data files, builds the corpus tree synchronously (no first-request race), and soft-calls `npm run summarize` (skips if no auth, never blocks the deploy on transient errors).

Deploy hardening

Production posture for from-zero.emeraldlake.io (Hetzner CX23):

  • Cloudflare Origin CA cert (RSA 2048, 15-year validity) replaces Let's Encrypt at the origin. Caddy serves it from /etc/caddy/tls/origin.{pem,key} and stops attempting ACME entirely.
  • Caddyfile now points at the Origin CA paths instead of doing auto-TLS, so renewals are unneeded.
  • deploy/install-origin-cert.sh: scps the cert + key to the box, verifies the pair via pubkey hash first, installs with correct ownership + perms, validates and reloads Caddy. Idempotent.
  • deploy/lock-origin-to-cloudflare.sh: replaces wide-open ufw allow 80/443 with allow-lists scoped to Cloudflare's published IPv4 + IPv6 ranges. SSH on :22 stays open. Idempotent via /var/lib/cloudflare-ufw.list state file.
  • Cloudflare Full (strict) + orange-cloud proxy: origin IP is hidden in DNS, visitors see Cloudflare's edge cert, edge verifies the Origin CA cert at the box.
  • Cloudflare Access (Zero Trust) sits in front of the app: visitors hit a one-time-PIN email login at emeraldtarek.cloudflareaccess.com before any request reaches the origin. 30-day session, allowlist scoped to a single email.
  • bootstrap.sh no longer installs the Caddyfile or restarts Caddy (the Origin CA paths don't exist on a fresh box). Operator installs them via the harden playbook in step 7. Also fixes the chown gap that left the deploy SSH key's authorized_keys root-owned.
  • deploy.sh drops sudo from the diagnostic systemctl status call (extra args didn't match sudoers, caused a password prompt mid-deploy).

New skill: cloudflare-harden

.claude/skills/cloudflare-harden/SKILL.md is a project-scoped Claude Code skill that captures the full 8-step hardening sequence with API calls, verification commands, rollback recipes, and the actual gotchas we hit (private-key-clobbered-by-Write, sudoers arg mismatch, NXDOMAIN cache, MCP scope quirks). Future-Claude can rerun the playbook on a fresh box without rediscovering them.

Test plan

  • `npx tsc --noEmit` clean
  • All 11 routes return 200 (`/`, `/progress`, `/glossary`, `/qa`, `/chat`, `/settings`, reader pages, all API endpoints)
  • In-page chat works with no extra round-trips for page-resident questions
  • Cross-phase chat exercises new corpus tools — verified with "how does ion exchange in water treatment relate to crown-ether LLX in lithium-6 separation?" (calls `search_corpus` → `get_section` and cites both phases)
  • Bad-slug path returns `did_you_mean` suggestions; model self-corrects
  • Reader-page sticky chat sidebar at scroll y=1800
  • MCP server smoke test (`web/scripts/test-mcp-server.ts`) — all 5 tools return expected shapes
  • FTS5 ranking verified on 5 representative queries; Porter stemming confirmed
  • Bootstrap test: `rm zero/{04-learning,05-meta}/.md && hit /api/concepts` → all 4 live files reappear from `.example.md` templates
  • Local `npm run ingest` does the full job end-to-end (bootstrap + DB seed + corpus tree + mirror regen)

🤖 Generated with Claude Code

emeraldtarek and others added 9 commits May 10, 2026 11:58
…tools

Build a hierarchical index over zero/**/*.md by parsing # headers and
slicing each section's line range; persist 446 nodes (22 H1 / 221 H2 /
181 H3 / 22 H4) to a new `corpus_node` SQLite table. Each node carries
a placeholder summary (first sentence) for outline-time rendering.

Wire five new tools into the tutor-chat tool set so Claude can navigate
the corpus instead of guessing or being limited to whatever page the
learner is on:

- `get_corpus_outline({phase_id?, max_level?})` — H1+ tree as Markdown
- `get_section(slug)` — verbatim Markdown of any node, with
  `did_you_mean` suggestions on miss
- `search_corpus(query)` — substring + token-overlap ranker
- `list_glossary(prefix?)` — read-side companion to add_glossary_term
- `list_concepts({phase_id?, status?})` — ditto for mark_concept_status

System prompt now injects the lean H1-only outline (~2 KB) alongside the
current page's full text and the available concept slugs. The model is
instructed to use search/get_section for cross-phase questions rather
than hand-waving.

End-to-end verified: cross-phase query "how does ion exchange in water
treatment relate to crown-ether LLX in lithium-6 separation?" produced
a precise multi-source answer in 3 tool calls (outline → search →
get_section) with citations from both phases.

Bad-slug path verified: get_section('atomic-hypothesis') returns
not_found + 5 did_you_mean suggestions; the model self-corrects to the
real concept slug.

Why: the session simulation surfaced four friction points (whole-page
injection brittleness, hallucinated identifiers, no TOC reasoning,
glossary blindness). Research summarized at web/docs/pageindex-research.md
recommended lifting the PageIndex *pattern* (TOC tree + 3 nav tools)
into a TS-native, SQLite-resident reimplementation rather than running
their Python pipeline or hosted MCP. This commit ships the half-day MVP.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The session simulations kept running into the same failure mode: when
Claude tried to emit ~7+ tool_use blocks in a single response, the rest
silently disappeared. Diagnosed two compounding causes via stop_reason
logging:

1. The SDK's `maxTurns` was set to 8 — but each tool_use round-trip
   counts as a turn, so a 7-call batch ate the budget. Fix: bump the
   per-call cap to 60.
2. When the SDK ends with `error_max_turns` (or `error_max_tokens`),
   that's a recoverable signal, not an error. Fix: detect it on the
   result message and re-fire `query()` with a continuation prompt that
   includes the user's original ask + every tool call already made
   (with inputs) + any prose already streamed. Up to 3 auto-continues.

Also tightened the tutor system prompt with explicit "tool calls first,
prose after" guidance, since long prose followed by many tools is the
exact pattern that hits the cap.

Verified with two stress tests:
- 18 glossary terms in one ask → completed in zero auto-continues
  (the bumped maxTurns alone handles this).
- 70 glossary terms (one per element, H–Yb) → completed cleanly with
  one auto-continue at element ~58. UI shows `_(continuing… 1/3)_`.

The Anthropic-API path's max_tokens was also bumped from 4096 to
16384 for parity, though we never observed truncation there in
practice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t_hash

The placeholder first-sentence summaries gave the outline a coherent
shape but read as fragments ("None. This is the floor."). Now each
node with word_count >= 50 gets a one-sentence Haiku-written summary
that leads with the load-bearing claim, definition, formula, or named
effect.

Pipeline:
- `npm run summarize` reads each node's section text, calls Haiku 4.5
  via Anthropic SDK (auth: ANTHROPIC_API_KEY or CLAUDE_CODE_OAUTH_TOKEN
  with the oauth-2025-04-20 beta header), and persists the result.
- Cache: new `summary_cache` table keyed by content_hash. Identical
  content across rebuilds = no re-call. `--force` re-runs everything.
- corpus-builder now consults the cache on insert, so re-running
  `npm run build-index` after a markdown edit keeps the existing
  summary for unchanged sections and falls back to the first-sentence
  placeholder for genuinely new ones.

Quality:
- 339 summaries generated in 91s with concurrency 5; 0 errors.
- 47 first-pass outputs leaked Markdown headers / preamble despite
  the system prompt; tightened the prompt with explicit rules + good
  and bad examples + a `cleanSummary()` post-processor that strips
  leading `#`, "Summary:", "I understand…", quote/backtick wrappers,
  and `**bold**`. Re-running on the cleaned subset produced clean
  one-liners.

Outline impact: H1-only outline grew 2.2 KB → 4.3 KB. Still fits
the system prompt comfortably. Each phase-3 page summary now reads
like a self-contained abstract.

Auth verification: confirmed the Anthropic SDK accepts
CLAUDE_CODE_OAUTH_TOKEN as `authToken` plus the
`anthropic-beta: oauth-2025-04-20` header against api.anthropic.com.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the in-app chat tools as a stdio Model Context Protocol server so
any MCP client (Claude Code, Claude Desktop, the official inspector,
custom clients) can navigate the corpus directly.

Reads the same SQLite DB the Next.js app writes to, so glossary +
concept-tracker state are always in sync. No re-implementation of
helpers — the server is a thin wrapper over `corpus-index.ts` and
`repos.ts`.

- `web/scripts/mcp-server.ts` — stdio server with 5 registered tools
- `web/scripts/test-mcp-server.ts` — spawns it via the SDK Client and
  verifies every tool with realistic args (good slug, bad slug,
  cross-phase search, etc.)
- `web/docs/mcp-server.md` — Claude Code registration steps + tool table

Smoke-tested: list_tools returns 5 tools; get_corpus_outline,
get_section (valid + bad slug → did_you_mean), search_corpus
("crown ether"), list_glossary, list_concepts(status=solid) all return
the expected payloads.

Why: the corpus-index is genuinely useful outside the browser app —
Claude Code can look up sections while editing the markdown source,
or other Claude Code projects (paper drafts, slides) can read the
corpus without spinning up Next.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… Porter stemming

The substring + token-overlap approach was adequate at 446 nodes but
ranked poorly: it gave equal weight to a stray match in a slug and
a strong match in a section's body, and missed morphological variants
("isotopes" didn't match "isotopic").

This commit swaps in a proper full-text index:

- New `corpus_fts` virtual table (FTS5, porter+unicode61 tokenizer)
  with columns slug (UNINDEXED), title, summary, content. Rebuilt
  wholesale by `corpus-builder.ts` alongside `corpus_node` so they
  never drift.
- `searchCorpus()` now runs `MATCH ?` with a sanitized OR-of-tokens
  query and ranks by `bm25(corpus_fts, 8.0, 4.0, 1.0)` (title weighted
  highest, then summary, then body). Score is normalized so callers
  can keep "higher = better".
- Falls back to the old substring scan if the FTS query is empty
  (all stop-words/punctuation) or the virtual table is missing.

Verified on representative queries:
- "crown ether liquid extraction" → top hits are crown-ether LLX
  sections in both 02-water-treatment and 03-lithium-isotope-separation
- "mercury amalgam isotope" → top hits are COLEX principle + Y-12 legacy
- "fusion blanket tritium" → top hit is "The tritium breeding ratio (TBR)"
- "Avogadro Brownian Einstein" → top hit is "How we *know* atoms exist"
- Porter stemming check: searchCorpus("isotopes") and
  searchCorpus("isotopic") return identical top hits (same lemma)

Tool descriptions in tutor-prompt.ts and mcp-server.ts updated to
note FTS5 capabilities (stemming, phrase support) so the model picks
better keywords.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Caddyfile: serve Cloudflare Origin CA cert (valid until 2041) instead of
  ACME. Required so Caddy stops attempting HTTP-01 renewals after the
  hostname is fronted by Cloudflare's proxy.
- lock-origin-to-cloudflare.sh: restrict ufw 80/443 to Cloudflare IP ranges
  only. SSH on 22 stays open. Idempotent via /var/lib/cloudflare-ufw.list.
- deploy.sh: drop sudo from the diagnostic `systemctl status` call. The
  extra `--no-pager --lines=10` args don't match the sudoers rule, which
  caused the deploy to hang on a password prompt.
- bootstrap.sh: chown the lithium user's authorized_keys after the
  root-shell redirect (the `>>` ran as root, not as the sudo'd user).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The four tracking files (glossary, knowledge-tracker, questions-and-
answers, progress-log) are per-environment runtime state — the app
rewrites them on every chat turn. Keeping them in git produced noisy
diffs and meant the VPS started with whatever happened to be the last
local session's residue.

Now mirrors the `.env` / `.env.local` pattern:

- *.example.md files (committed) — the seed templates
- *.md files (gitignored) — the live runtime files, auto-copied from
  the templates on first run by bootstrapUserData() in content-loader

bootstrapUserData runs from ensureSeeded(), which already fires on every
API call, so a fresh VPS picks up the templates automatically on the
first request — no manual setup beyond `npm install` + auth env vars.

Seed templates: empty-stub glossary + Q&A, all-todo knowledge tracker,
kickoff-only progress log (drops the two simulation entries that had
been committed by accident).

Verified: rm-ing all four live files and hitting /api/concepts re-creates
them from the templates with identical byte counts to the .example.md
siblings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The deploy was missing two things:

1. `npm run ingest` only synced pages + concepts and regenerated two
   mirror files. It did NOT build the corpus_node + corpus_fts tree
   that the chat tools navigate, and did NOT bootstrap the live
   user-data .md files from their .example.md templates. The app
   technically self-healed via the lazy build in ensureSeeded(), but
   that fires async-fire-and-forget on the first API call, which races
   against incoming chat traffic.

   Fixes: ingest now calls bootstrapUserData() first, buildCorpusIndex()
   after pages/concepts, and also regenerates the Q&A mirror (not just
   glossary + knowledge-tracker).

2. summary_cache was never populated on the VPS. Without it, the chat
   outline uses placeholder first-sentence summaries instead of
   Haiku-written ones.

   Adds a soft `npm run summarize` step to deploy.sh, gated on
   auth presence and skippable via LITHIUM_SKIP_SUMMARIZE=1. Idempotent
   by content_hash — subsequent deploys re-summarize only changed
   sections. Auth-missing or transient errors warn but don't fail the
   deploy.

Verified locally: `rm zero/{04-learning,05-meta}/*.md && npm run ingest`
re-bootstraps all four files from their templates, populates pages +
concepts (88 concepts seeded), builds 446 corpus nodes, regenerates
the three live mirror files. Idempotent on second run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the production hardening sequence we just applied
(from-zero.emeraldlake.io) as a reusable, project-scope Claude Code skill
so it can be rerun on a fresh box without rediscovering the gotchas.

- .claude/skills/cloudflare-harden/SKILL.md: 8-step playbook (Origin CA
  cert -> Full strict TLS -> orange-cloud proxy -> ufw lock to CF IPs ->
  Zero Trust Access). Auto-discoverable; triggers on "harden the box",
  "lock origin to cloudflare", "add cloudflare access", etc. Documents
  the actual bugs we hit (private-key-clobbered-by-Write, sudoers arg
  mismatch on systemctl status, NXDOMAIN cache, MCP scope quirks) plus
  rollback recipes for every step.

- deploy/install-origin-cert.sh: reusable origin-side installer. Takes
  cert + key + IP, verifies the pair via pubkey hash before sending
  anything to the box (catches the file-overwrite bug), scps cert + key
  + the project Caddyfile, installs with caddy ownership, validates and
  reloads. Idempotent.

- deploy/bootstrap.sh: stop copying the Caddyfile or restarting Caddy.
  The TLS-enabled Caddyfile references /etc/caddy/tls/origin.{pem,key}
  which don't exist on a fresh box, so installing it here crashed Caddy.
  Caddy stays on its default config until the harden playbook installs
  the cert + Caddyfile together. Next-steps text now points at the skill.

- deploy/README.md: adds step 7 (Cloudflare hardening), updates step 1 to
  start in grey cloud (proxy gets flipped on as part of step 7), lists
  the new scripts in the file table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@emeraldtarek emeraldtarek merged commit e5f488a into main May 11, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant