feat(indico): scraper, MCP sidecar, and on-demand ingest tooling#564
Open
hassan11196 wants to merge 10 commits into
Open
feat(indico): scraper, MCP sidecar, and on-demand ingest tooling#564hassan11196 wants to merge 10 commits into
hassan11196 wants to merge 10 commits into
Conversation
849f5d7 to
136def1
Compare
Adds a containerized mcp4indico server using the generic MCP sidecar mechanism (PR archi-physics#557): Dockerfile pinned to upstream commit 800c5fc3 with a small entrypoint wrapper that initializes API client globals from env vars at HTTP-app import time. Ships an indico skill, a worked example config block, and a redo.sh modeled on the existing smoke-test flow.
Adds support for scraping Indico events and meeting materials, alongside the existing link/git/sso/elog scrapers. Scraper (indico_scraper.py) - Fetches event metadata, contributions, and slide attachments via the Indico REST API. - Converts PDF/PPTX/PPT/ODP slides to Markdown via MarkItDown; strips embedded <latexit> blocks that inflate chunk counts on formula-heavy slides (slide_converter.py). - Deduplicates attachments when the same slides are uploaded in multiple formats (e.g. PDF + PPTX): keeps the higher-priority format. - Detects SSO-protected events and authenticates via CERNSSOScraper. - Stores speaker affiliation alongside speaker name in resource metadata. ScraperManager integration - collect_indico() / schedule_collect_indico() hooks. - URL routing: explicit "indico-" prefix in weblists, plus auto-detection for URLs with "indico" in the hostname and /event/ in the path. - Indico documents use source_type="web" (matching the existing CHECK constraint) with a "scraper": "indico" metadata field for filtering. Vectorstore - Prepends a one-line metadata header to each Indico chunk (event title, date, contribution, speaker, affiliation, start time, duration, session) so BM25 retrieval can match on speaker name, time of day, etc. Gated on metadata "scraper"="indico"; no other sources affected. Config / docs / examples - base-config.yaml: indico source block (disabled by default). - docs/docs/data_sources.md: Indico section. - examples/agents/indico-assistant.md: agent spec for Indico queries. - examples/deployments/basic-agent/indico_example.list: example weblist. - SourceRegistry: register "indico" source (depends on links). Dependencies - pyproject.toml + requirements-base.txt: add markitdown[pdf,pptx]. Known limitations - Images and figures in slides are not extracted or described; only text content is converted to Markdown. Slides that communicate primarily through plots/diagrams will produce thin or empty chunks. - LaTeX in slides: embedded <latexit> blocks are stripped (they are base64-encoded and useless for retrieval), but inline LaTeX notation and formula-heavy slides may still produce low-quality Markdown that chunks poorly. - Slide context is per-page, not per-deck: each chunk comes from one page/section of the converted Markdown. There is no cross-slide summarisation, so a narrative that spans multiple slides may be split across chunks without connecting context. - Category URLs (/category/<id>/) are handled in the code but not yet tested end-to-end; only event URLs are documented. - SSO authentication is CERN-specific (CERNSSOScraper). Other Indico instances with different login flows would need a different auth path. - No rate limiting or incremental scraping; large events with many attachments are processed in a single run.
The /document_index/upload_url endpoint unconditionally called collect_links, scraping Indico event pages as plain HTML. With PR archi-physics#550's IndicoScraper now on this branch, dispatch via _is_indico_url so a single agent-side POST of an Indico event URL ingests via the API + slide-conversion path instead of generic HTML scraping. collect_indico gains an int return for parity with collect_links.
Wraps the data-manager's POST /document_index/upload_url so the agent can ask archi to scrape and index a URL it has just discovered (e.g. an Indico event URL surfaced by INDICO_get_files). The endpoint dispatches Indico URLs through IndicoScraper and falls back to LinkScraper for the rest, so the tool stays URL-agnostic. Skill text instructs the agent to chain INDICO_get_files(download_files=false) -> ingest_url(event_url) -> search_vectorstore_hybrid when slide contents are needed.
redo.sh is a developer-local smoke-test script and should not live in the repo. It was previously committed in the integration commit (now purged from history via filter-branch); keeping it gitignored prevents future re-additions.
- Add shared volume for Indico downloads in Dockerfile - Implement ingest_indico_event tool for processing Indico event attachments - Enhance CMSCompOpsAgent to utilize shared downloads directory - Add ingest_local_path endpoint to Flask app for directory ingestion - Update LocalFileManager to support directory ingestion from shared volumes - Modify data manager to handle new ingestion logic and metadata
… fallback support
4125ce1 to
76d35e2
Compare
Collaborator
Author
|
Dedicated skill for indico MCP |
Keeping markitdown[pdf,pptx]>=0.1.0 in pyproject.toml is enough — the
data-manager and chat Dockerfiles install it via `pip install .`, and
only the data-manager image actually uses it (slide_converter +
indico_scraper). Having it in requirements-base.txt instead caused two
problems on this PR (head from a fork):
* build-base-images was triggered (requirements-base.txt is one of the
watched paths) and failed at the Docker Hub push step because
DOCKERHUB_* secrets are not exposed to fork PRs.
* unit-tests' Install dependencies step took ~16 min on the runner
installing the heavy markitdown transitive deps (onnxruntime,
magika, pdfplumber, ...) and exited 1.
Removing the duplicate line skips build-base-images entirely on this PR
(detect step writes changed=false) and shrinks the unit-tests install
back to its prior size.
The new Indico MCP-sidecar subsection added two relative links pointing at ../../mcp/indico/README.md. With docs_dir=docs (default) the rendered page lives under site/data_sources/, so the relative path resolves to mcp/indico/README.md outside the docs tree — mkdocs cannot find the target and `mkdocs build --config-file docs/mkdocs.yml --strict` (used by the PR-preview `Verify MkDocs build` step) exits 1. Switch the two links to absolute GitHub URLs (same convention used in docs/docs/install.md and docs/docs/troubleshooting.md) so mkdocs no longer tries to resolve them as in-tree paths.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds Indico as a first-class data source, end-to-end:
Indico scraper (
bcd4c078, originally @livaage's work — PR feat(data-manager): add Indico scraper integration #550 was not merged upstream, so it ships here) — fetches events/contributions/attachments via the Indico REST API, converts PDF/PPTX/PPT/ODP slides to Markdown via MarkItDown, dedups multi-format uploads, authenticates SSO-protected events viaCERNSSOScraper, and prepends per-chunk metadata headers (event/speaker/affiliation/time) to improve BM25 retrieval. Addsmarkitdown[pdf,pptx]to deps.mcp4indicosidecar (93e994dd) — containerized Indico MCP server using the generic MCP sidecar mechanism from PR Generic MCP sidecar + domain-knowledge skills + tool-use observability #557 (now on main). Ships a Dockerfile pinned to upstream800c5fc3, an entrypoint wrapper that initializes API client globals from env vars, an Indico skill, and a worked example config block.On-demand ingest agent tools —
ingest_url(b48b1554,76d35e20): wrapsPOST /document_index/upload_urlso the agent can scrape & index any URL it discovers. Dispatches Indico URLs throughIndicoScraper(288494e4), falls back toLinkScraperfor the rest. Adds configurable routing rules and SSO fallback in the latest revision.ingest_indico_event(5c1f2b9d,a43a819e): ingests Indico event attachments on demand via a shared downloads volume, withingest_local_pathFlask endpoint andLocalFileManagerdirectory-ingestion support.Suggested agent flow
INDICO_get_files(download_files=false)→ingest_url(event_url)(oringest_indico_event) →search_vectorstore_hybridonce slides are indexed.Known limitations (Indico scraper)
<latexit>blocks are stripped; inline LaTeX may still chunk poorly./category/<id>/) are coded but not e2e-tested.CERNSSOScraper); other Indico instances need a different auth path.Example config
The PR ships an example config block in
mcp/indico/README.md; below is a working snippet drawn from a livebase_config_indico.yamldeployment that exercises both the sidecar and the agent-tool routing.MCP sidecar
Declare the Indico MCP server under
mcp_servers. Thebuild_contextpoints at this PR'smcp/indico/directory so the sidecar is built locally; tokens come from env (.env/ secrets manager). Theshared_volumeis mounted read-only into the data-manager at the same path, and the template derivesINDICO_DOWNLOADS_DIR=/shared/indico-downloadsfrom it soingest_indico_eventcan pick up downloaded attachments.ingest_urlrouting rules + SSO fallbackingest_urlevaluates rules in declaration order; first match wins.action: refuse— returnmessageto the agent so it switches to a better tool (used here to bounce Indico event URLs over toingest_indico_event, which authenticates via the MCP sidecar instead of storing a login-redirect page).action: sso_retry— whensso_fallback_enabled: true, the data-manager retries viaCERNSSOScraperif anonymousLinkScraperlands on a Keycloak page. Requiresdata_manager.sources.sso.enabled: trueandSSO_USERNAME/SSO_PASSWORDsecrets.Data-manager SSO source (required for
sso_retry)