Skip to content

RFC: Optimize documentation as AI-first ground truth #54

@marc0olo

Description

@marc0olo

Background

Docs are authored by humans but increasingly consumed by AI agents on behalf of
developers. This RFC asks: are we optimizing for the right consumer, and what
concrete changes would close the gap?

Two converging use cases require the same foundation — accurate, structured,
discoverable content:

  1. AI that reads docs directly (retrieval) — coding tools like Cursor and
    Claude Code fetching pages via llms.txt and .md endpoints
  2. AI that generates explanations on demand (generation grounding) — the
    source of truth AI grounds its answers in

We plan to integrate with Kapa.AI (as on the current docs.internetcomputer.org)
which handles the conversational AI layer — ingestion, indexing, retrieval, and
the chat interface for developers. The action items below directly serve Kapa.AI
ingestion quality, not just abstract "AI optimization."

This is not a proposal to build AI tooling. It's about ensuring this repo is the
best possible ground truth for both use cases.

Research summary

agentdocsspec.com defines 22 specific checks across 7 categories with
concrete thresholds — not just "have an llms.txt." Notable checks we likely
fail today:

  • Coverage: llms.txt must link to ≥95% of pages (stub pages may degrade this)
  • Content negotiation: serve Content-Type: text/markdown for
    Accept: text/markdown requests — currently not implemented
  • Cache hygiene: markdown endpoints need max-age < 3600 or must-revalidate
    with ETag
  • Platform truncation limits are documented: Claude Code ~100KB, MCP Fetch 5KB
    default, Claude API web_fetch ~20.7KB — relevant for page size decisions

Stripe's instructions section in llms.txt is the one structurally novel
pattern worth adopting.
They encode semantic directives for AI directly in
llms.txt: preferred APIs, deprecated alternatives, behavioral guidance. No
infrastructure required. Directly applicable — we could encode "always use
icp CLI, never dfx", preferred patterns, deprecation signals. Currently only
Stripe does this publicly.

The llms.txt + .md endpoints pattern is already correct for this site's
scale. The "nested llms.txt" variant (section-level index files) is a scaling
solution for sites where the root index exceeds 50KB — not a concern at current
page count. llms-full.txt (full content concatenated) is served by some
framework docs auto-generated by Starlight/VitePress, but is too large for any
current AI fetch pipeline and primarily useful for humans manually piping docs
into an LLM — not an agent optimization.

Diataxis has real value for AI routing — separating concept / guide /
reference / tutorial pages gives AI a structural signal about the type of answer
a page contains. Its limit is that it was designed around human cognitive modes,
not knowledge structure. It provides no relationship signals AI would benefit
from: prerequisites, related concepts, which APIs a page covers.

GraphRAG is cost-viable at this scale (~$1–5 one-time indexing for ~100
pages) but the query workload has to justify it. Useful for cross-cutting
questions ("how does auth work across the system"); overkill for lookup queries.
With Kapa.AI handling the retrieval layer, a separate GraphRAG implementation
would overlap significantly — revisit if Kapa.AI proves insufficient for
complex cross-cutting queries.

On-demand query-time generation is not mature. DeepWiki (Cognition)
pre-generates then retrieves — it doesn't generate at query time. No shipping
product with documented results exists for pure query-time generation. The
ground truth layer is the right investment now regardless of which model
dominates later.

Current state

The plugin (plugins/astro-agent-docs.mjs) generates llms.txt, clean .md
endpoints, agent signaling blockquote in HTML, and a sitemap alias. Solid
foundation.

Key gaps:

  • cleanMarkdown() strips all YAML frontmatter — agents and Kapa.AI see
    only title + body, no metadata
  • No instructions section in llms.txt
  • No content negotiation (Accept: text/markdown)
  • Stub pages create dead entries in the discovery index
  • No journey-aligned ordering in llms.txt (currently site taxonomy)
  • No relationship signals (prerequisites, related pages, API surface per page)
  • No AI-optimization guidance in the content authoring workflow

Proposed action items

Tier 1 — Low effort, high impact

  • Run the agentdocsspec.com compliance checker and fix failures
  • Add content negotiation (Accept: text/markdownContent-Type: text/markdown)
  • Add an instructions section to llms.txt with ICP-specific AI
    directives: never dfx, preferred APIs, deprecation signals
  • Selectively pass title and description through cleanMarkdown()
    currently stripped entirely; Kapa.AI and agents receive no metadata
  • Exclude stub pages from llms.txt until they have real content — dead
    entries degrade retrieval quality for both agents and Kapa.AI

Tier 2 — Medium effort, medium term

  • Add optional agent-optimized frontmatter fields: prerequisites,
    category, entities — not required authoring overhead, but enables
    richer indexing when populated
  • Reorder llms.txt entries by developer journey rather than site
    hierarchy — ordering functions as a priority signal for models
  • Add explicit AI-optimization guidance to the content authoring workflow:
    reference pages prefer tables over prose, ≤50K characters per page,
    ≤25% generic section headers ("Overview", "Introduction")
  • Generate a per-page JSON sidecar with structured metadata alongside the
    .md endpoint — relationship signals, entities, prerequisites

Tier 3 — Larger investment, revisit later

  • Hybrid GraphRAG layer — evaluate if Kapa.AI proves insufficient for
    cross-cutting queries once content volume is substantial
  • Check whether Starlight auto-generates llms-full.txt — if so, serve
    it passively; useful for humans manually ingesting docs into an LLM
    context window, not an agent optimization priority

Non-goals

  • Building AI tooling (generation systems, MCP server, custom retrieval)
  • Changing the Diataxis content structure or Markdown-first authoring workflow
  • Any change that increases authoring burden for content contributors

Open questions

  1. Should stub pages be excluded from llms.txt entirely until they have real
    content, or is a stub signal (explicit marker) better than absence?
  2. For the instructions section: who owns it, and what's the process for
    keeping AI directives accurate as APIs evolve?
  3. Which Kapa.AI ingestion path will be used — sitemap crawl, .md endpoints,
    or GitHub integration? This determines which Tier 1 items are highest
    priority.
  4. Does the Tier 2 per-page JSON sidecar overlap with what Kapa.AI builds
    internally, or is there a case for exposing it publicly?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions