Skip to content

[Mate] Add Knowledge bridge to crawl official documentation#2069

Draft
wachterjohannes wants to merge 6 commits into
symfony:mainfrom
wachterjohannes:feature/mate-knowledge-bridge
Draft

[Mate] Add Knowledge bridge to crawl official documentation#2069
wachterjohannes wants to merge 6 commits into
symfony:mainfrom
wachterjohannes:feature/mate-knowledge-bridge

Conversation

@wachterjohannes

@wachterjohannes wachterjohannes commented May 10, 2026

Copy link
Copy Markdown
Contributor
Q A
Bug fix? no
New feature? yes
Docs? yes
Issues -
License MIT

Summary

Introduces a new Mate bridge — symfony/ai-knowledge-mate-extension — that gives agents structured access to official documentation through MCP tools (browse TOC, read pages, substring search). The goal is to replace "guessing from training data" with crawling the actual docs.

This bridge does not ship semantic / vector / RAG search. A SearcherInterface extension seam is provided so a future implementation can plug in embedding-based search without changing the tool surface.

The Symfony bridge ships a built-in SymfonyDocsProvider for symfony/symfony-docs that registers itself when the Knowledge bridge is also installed (interface_exists() guard). The cloned docs branch is auto-detected from the host application's installed Symfony version via Composer\InstalledVersions (probing symfony/framework-bundlesymfony/runtimesymfony/http-kernelsymfony/dependency-injection) and can be pinned explicitly with ai_mate_symfony.docs_branch.

Tools

Tool Purpose
knowledge-toc Without arguments: list providers. With a provider: browse its TOC at path (or root).
knowledge-read Read a documentation page split into RST sections (response is capped — see below).
knowledge-search Case-insensitive substring search across a provider's chunks (limit capped at 50).

Behavior

  • The first call for a provider clones the source repo (git clone --depth 1) into the local cache.
  • Subsequent calls hit the cache.
  • The cache auto-refreshes once it's older than ai_mate_knowledge.cache_ttl_seconds (default 24h).
  • Section-based chunking reuses Symfony\AI\Store\Document\Loader\RstLoader so chunking semantics stay aligned with the Store component.
  • Concurrent ensure() calls are safe: a per-provider flock serializes work, and the toc/chunks/metadata JSON files are written atomically (temp + rename).
  • Provider names must match ^[a-z0-9][a-z0-9_-]{0,63}$ so they can't escape the cache dir or be used as shell metacharacters.
  • A metadata.json is written next to the cache artifacts (provider, synced_at, chunk_count, git revision when available) for debugging.

Adding a custom provider

use Symfony\AI\Mate\Bridge\Knowledge\Provider\DocsProviderInterface;
use Symfony\AI\Mate\Bridge\Knowledge\Service\GitFetcher;

final class MyDocsProvider implements DocsProviderInterface
{
    public function __construct(private GitFetcher $fetcher) {}

    public function getName(): string { return 'my-docs'; }
    public function getTitle(): string { return 'My Docs'; }
    public function getDescription(): string { return 'My project documentation'; }
    public function getFormat(): string { return 'rst'; }

    public function sync(string $cacheDir): string
    {
        $repo = $cacheDir.'/docs';
        $this->fetcher->fetch('https://github.com/me/docs.git', 'main', $repo);

        return $repo.'/index.rst';
    }
}

Tag the service ai_mate.knowledge_provider.

Future-proofing

The chunk model (PageChunk) already matches what Symfony\AI\Store\Document\Vectorizer consumes, and a SearcherInterface seam is exposed so an embedding-based searcher can replace KeywordSearcher without changing the tool surface. An indexer seam (over the JSON cache step) would be the other half — left out for now to keep scope tight (YAGNI).

Test plan

  • Knowledge bridge: 39/39 PHPUnit tests pass (services, models, registry, tools, TTL behavior, atomic writes, metadata, provider-name validation, RST edge cases — title aliases, absolute entries, missing files, duplicate entries, glob entries)
  • Symfony bridge: 118/118 tests pass (114 pre-existing + 3 SymfonyDocsProvider unit tests covering auto-detection vs. explicit-branch override + 1 end-to-end integration test wiring SymfonyDocsProviderProviderRegistryKnowledgeCache → all three MCP tools against a local bare git repo)
  • PHPStan clean on Knowledge bridge
  • doctor-rst clean on docs change
  • End-to-end: SymfonyDocsProviderIntegrationTest exercises a real git clone from a local bare repo through every tool (knowledge-tocknowledge-readknowledge-search) and asserts the cache artifacts (toc.json, chunks.json, metadata.json) land on disk with the expected shape.

Introduces a new Mate bridge (`symfony/ai-knowledge-mate-extension`) that
exposes pluggable documentation providers as MCP tools so agents can
crawl official docs structurally instead of guessing from training data.

Tools shipped:

 * `knowledge-toc` — without args lists registered providers; with a
   provider browses its TOC at the given path
 * `knowledge-read` — reads a documentation page split into RST sections
 * `knowledge-search` — case-insensitive substring search across a
   provider's chunks

Providers implement `DocsProviderInterface` and register as services
tagged `ai_mate.knowledge_provider`. The first call clones the source
repository (via `git clone --depth 1`) into the local Mate cache; the
cache is auto-refreshed once it is older than
`ai_mate_knowledge.cache_ttl_seconds` (default 24h).

Section-based chunking reuses `Symfony\AI\Store\Document\Loader\RstLoader`
so chunking semantics stay aligned with the Store component.

The Symfony bridge ships a built-in `SymfonyDocsProvider` for
https://github.com/symfony/symfony-docs that registers itself when the
Knowledge bridge is also installed (guarded via `interface_exists()`).
 * Make `KnowledgeCache::ensure()` safe under concurrent processes via per-provider `flock`; write JSON artifacts atomically (temp + rename)
 * Validate provider names against `^[a-z0-9][a-z0-9_-]{0,63}$` to keep them safe as cache directory components
 * Auto-detect the Symfony docs branch from the host's installed Symfony version (`Composer\InstalledVersions`); expose `ai_mate_symfony.docs_repository_url` and `ai_mate_symfony.docs_branch` (null = auto) for explicit overrides
 * Cap `knowledge-search` results at 50 and `knowledge-read` total response size; report truncation in the response payload
 * Write `metadata.json` next to the cache (provider, `synced_at`, `chunk_count`, git revision when available)
 * Add `SearcherInterface` extension seam so the substring `KeywordSearcher` can be replaced (e.g. with a future vector-search implementation) without changing the tool surface
 * Clarify wording in README/INSTRUCTIONS/composer description: "structured access to official documentation"; explicitly disclaim semantic/RAG search
 * Add edge-case TOC tests covering `Title <path>` aliases, absolute toctree entries, missing files, duplicate entries and glob patterns
 * Add end-to-end integration test wiring `SymfonyDocsProvider` -> `ProviderRegistry` -> `KnowledgeCache` -> all three MCP tools against a local bare git repo
 * Register Knowledge bridge in splitsh.json so the bridge-splitsh validator accepts the new package
 * Quote the value in `GitFetcher`'s sync-failed exception message (Fabbot)
 * Drop the explicit `../Knowledge` path repository from the Symfony bridge composer.json: `.github/build-packages.php` already wires absolute path repos for AI packages, and the relative path no longer resolves after the CI's "Isolate Bridge" step moves the bridge to `tmp/`
 * Add required bridge files (.gitignore, .gitattributes, .github/PULL_REQUEST_TEMPLATE.md, .github/workflows/close-pull-request.yml) so the bridge-files validator passes
 * Update LICENSE copyright year to 2026 (matches `CURRENT_YEAR` check for newly added LICENSE files)
 * Quote the available-providers list in `ProviderRegistry::get()`'s exception message (Fabbot)
 * Tighten `symfony/ai-store` constraint to `^0.8` so the chunk builder's `RstLoader` dependency resolves; mark the Symfony bridge integration test as skipped when `RstLoader` is unavailable so `--prefer-lowest` builds don't fail
@wachterjohannes wachterjohannes marked this pull request as ready for review May 19, 2026 20:11
@carsonbot carsonbot added Feature New feature Mate Issues & PRs about the AI Mate component Status: Needs Review labels May 19, 2026
Pre-existing issue on main, also tracked in symfony#2100 and symfony#2103. Inlined here
so this PR's CI goes green without waiting for either to merge first;
drop this commit if one of them lands first.
@wachterjohannes wachterjohannes marked this pull request as draft May 19, 2026 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feature New feature Mate Issues & PRs about the AI Mate component Status: Needs Review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants