[Mate] Add Knowledge bridge to crawl official documentation#2069
Draft
wachterjohannes wants to merge 6 commits into
Draft
[Mate] Add Knowledge bridge to crawl official documentation#2069wachterjohannes wants to merge 6 commits into
wachterjohannes wants to merge 6 commits into
Conversation
Introduces a new Mate bridge (`symfony/ai-knowledge-mate-extension`) that exposes pluggable documentation providers as MCP tools so agents can crawl official docs structurally instead of guessing from training data. Tools shipped: * `knowledge-toc` — without args lists registered providers; with a provider browses its TOC at the given path * `knowledge-read` — reads a documentation page split into RST sections * `knowledge-search` — case-insensitive substring search across a provider's chunks Providers implement `DocsProviderInterface` and register as services tagged `ai_mate.knowledge_provider`. The first call clones the source repository (via `git clone --depth 1`) into the local Mate cache; the cache is auto-refreshed once it is older than `ai_mate_knowledge.cache_ttl_seconds` (default 24h). Section-based chunking reuses `Symfony\AI\Store\Document\Loader\RstLoader` so chunking semantics stay aligned with the Store component. The Symfony bridge ships a built-in `SymfonyDocsProvider` for https://github.com/symfony/symfony-docs that registers itself when the Knowledge bridge is also installed (guarded via `interface_exists()`).
* Make `KnowledgeCache::ensure()` safe under concurrent processes via per-provider `flock`; write JSON artifacts atomically (temp + rename)
* Validate provider names against `^[a-z0-9][a-z0-9_-]{0,63}$` to keep them safe as cache directory components
* Auto-detect the Symfony docs branch from the host's installed Symfony version (`Composer\InstalledVersions`); expose `ai_mate_symfony.docs_repository_url` and `ai_mate_symfony.docs_branch` (null = auto) for explicit overrides
* Cap `knowledge-search` results at 50 and `knowledge-read` total response size; report truncation in the response payload
* Write `metadata.json` next to the cache (provider, `synced_at`, `chunk_count`, git revision when available)
* Add `SearcherInterface` extension seam so the substring `KeywordSearcher` can be replaced (e.g. with a future vector-search implementation) without changing the tool surface
* Clarify wording in README/INSTRUCTIONS/composer description: "structured access to official documentation"; explicitly disclaim semantic/RAG search
* Add edge-case TOC tests covering `Title <path>` aliases, absolute toctree entries, missing files, duplicate entries and glob patterns
* Add end-to-end integration test wiring `SymfonyDocsProvider` -> `ProviderRegistry` -> `KnowledgeCache` -> all three MCP tools against a local bare git repo
* Register Knowledge bridge in splitsh.json so the bridge-splitsh validator accepts the new package * Quote the value in `GitFetcher`'s sync-failed exception message (Fabbot) * Drop the explicit `../Knowledge` path repository from the Symfony bridge composer.json: `.github/build-packages.php` already wires absolute path repos for AI packages, and the relative path no longer resolves after the CI's "Isolate Bridge" step moves the bridge to `tmp/`
* Add required bridge files (.gitignore, .gitattributes, .github/PULL_REQUEST_TEMPLATE.md, .github/workflows/close-pull-request.yml) so the bridge-files validator passes * Update LICENSE copyright year to 2026 (matches `CURRENT_YEAR` check for newly added LICENSE files) * Quote the available-providers list in `ProviderRegistry::get()`'s exception message (Fabbot) * Tighten `symfony/ai-store` constraint to `^0.8` so the chunk builder's `RstLoader` dependency resolves; mark the Symfony bridge integration test as skipped when `RstLoader` is unavailable so `--prefer-lowest` builds don't fail
Pre-existing issue on main, also tracked in symfony#2100 and symfony#2103. Inlined here so this PR's CI goes green without waiting for either to merge first; drop this commit if one of them lands first.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Introduces a new Mate bridge —
symfony/ai-knowledge-mate-extension— that gives agents structured access to official documentation through MCP tools (browse TOC, read pages, substring search). The goal is to replace "guessing from training data" with crawling the actual docs.This bridge does not ship semantic / vector / RAG search. A
SearcherInterfaceextension seam is provided so a future implementation can plug in embedding-based search without changing the tool surface.The Symfony bridge ships a built-in
SymfonyDocsProviderfor symfony/symfony-docs that registers itself when the Knowledge bridge is also installed (interface_exists()guard). The cloned docs branch is auto-detected from the host application's installed Symfony version viaComposer\InstalledVersions(probingsymfony/framework-bundle→symfony/runtime→symfony/http-kernel→symfony/dependency-injection) and can be pinned explicitly withai_mate_symfony.docs_branch.Tools
knowledge-tocprovider: browse its TOC atpath(or root).knowledge-readknowledge-searchBehavior
git clone --depth 1) into the local cache.ai_mate_knowledge.cache_ttl_seconds(default 24h).Symfony\AI\Store\Document\Loader\RstLoaderso chunking semantics stay aligned with the Store component.ensure()calls are safe: a per-providerflockserializes work, and the toc/chunks/metadata JSON files are written atomically (temp + rename).^[a-z0-9][a-z0-9_-]{0,63}$so they can't escape the cache dir or be used as shell metacharacters.metadata.jsonis written next to the cache artifacts (provider,synced_at,chunk_count, git revision when available) for debugging.Adding a custom provider
Tag the service
ai_mate.knowledge_provider.Future-proofing
The chunk model (
PageChunk) already matches whatSymfony\AI\Store\Document\Vectorizerconsumes, and aSearcherInterfaceseam is exposed so an embedding-based searcher can replaceKeywordSearcherwithout changing the tool surface. An indexer seam (over the JSON cache step) would be the other half — left out for now to keep scope tight (YAGNI).Test plan
SymfonyDocsProviderunit tests covering auto-detection vs. explicit-branch override + 1 end-to-end integration test wiringSymfonyDocsProvider→ProviderRegistry→KnowledgeCache→ all three MCP tools against a local bare git repo)SymfonyDocsProviderIntegrationTestexercises a realgit clonefrom a local bare repo through every tool (knowledge-toc→knowledge-read→knowledge-search) and asserts the cache artifacts (toc.json,chunks.json,metadata.json) land on disk with the expected shape.