feat(indico): scraper, MCP sidecar, and on-demand ingest tooling by hassan11196 · Pull Request #564 · archi-physics/archi

hassan11196 · 2026-05-11T13:08:43Z

Summary

Adds Indico as a first-class data source, end-to-end:

Indico scraper (bcd4c078, originally @livaage's work — PR feat(data-manager): add Indico scraper integration #550 was not merged upstream, so it ships here) — fetches events/contributions/attachments via the Indico REST API, converts PDF/PPTX/PPT/ODP slides to Markdown via MarkItDown, dedups multi-format uploads, authenticates SSO-protected events via CERNSSOScraper, and prepends per-chunk metadata headers (event/speaker/affiliation/time) to improve BM25 retrieval. Adds markitdown[pdf,pptx] to deps.
mcp4indico sidecar (93e994dd) — containerized Indico MCP server using the generic MCP sidecar mechanism from PR Generic MCP sidecar + domain-knowledge skills + tool-use observability #557 (now on main). Ships a Dockerfile pinned to upstream 800c5fc3, an entrypoint wrapper that initializes API client globals from env vars, an Indico skill, and a worked example config block.
On-demand ingest agent tools —
- ingest_url (b48b1554, 76d35e20): wraps POST /document_index/upload_url so the agent can scrape & index any URL it discovers. Dispatches Indico URLs through IndicoScraper (288494e4), falls back to LinkScraper for the rest. Adds configurable routing rules and SSO fallback in the latest revision.
- ingest_indico_event (5c1f2b9d, a43a819e): ingests Indico event attachments on demand via a shared downloads volume, with ingest_local_path Flask endpoint and LocalFileManager directory-ingestion support.

Suggested agent flow

INDICO_get_files(download_files=false) → ingest_url(event_url) (or ingest_indico_event) → search_vectorstore_hybrid once slides are indexed.

Known limitations (Indico scraper)

Images/diagrams in slides aren't extracted (text-only conversion).
<latexit> blocks are stripped; inline LaTeX may still chunk poorly.
Chunks are per-page, not per-deck — no cross-slide summarization.
Category URLs (/category/<id>/) are coded but not e2e-tested.
SSO is CERN-specific (CERNSSOScraper); other Indico instances need a different auth path.
No rate limiting / incremental scraping.

Example config

The PR ships an example config block in mcp/indico/README.md; below is a working snippet drawn from a live base_config_indico.yaml deployment that exercises both the sidecar and the agent-tool routing.

MCP sidecar

Declare the Indico MCP server under mcp_servers. The build_context points at this PR's mcp/indico/ directory so the sidecar is built locally; tokens come from env (.env / secrets manager). The shared_volume is mounted read-only into the data-manager at the same path, and the template derives INDICO_DOWNLOADS_DIR=/shared/indico-downloads from it so ingest_indico_event can pick up downloaded attachments.

mcp_servers:
  indico:
    transport: streamable_http
    url: http://localhost:8012/mcp        # use http://indico-mcp:8012/mcp on bridge networking
    build_context: ./mcp/indico           # path to this PR's sidecar
    env:
      INDICO_BASE_URL: https://indico.cern.ch
      BEARER_TOKEN: ${INDICO_BEARER_TOKEN}
      API_KEY: ${INDICO_API_KEY}
      API_SECRET: ${INDICO_API_SECRET}
    shared_volume: indico-downloads
    skill: indico

`ingest_url` routing rules + SSO fallback

ingest_url evaluates rules in declaration order; first match wins.

action: refuse — return message to the agent so it switches to a better tool (used here to bounce Indico event URLs over to ingest_indico_event, which authenticates via the MCP sidecar instead of storing a login-redirect page).
action: sso_retry — when sso_fallback_enabled: true, the data-manager retries via CERNSSOScraper if anonymous LinkScraper lands on a Keycloak page. Requires data_manager.sources.sso.enabled: true and SSO_USERNAME / SSO_PASSWORD secrets.

services:
  chat_app:
    tools:
      ingest_url:
        sso_fallback_enabled: true
        routing_rules:
          # 1) Indico event pages -> ingest_indico_event (auth'd via MCP, no Selenium)
          - pattern: '^https?://[^/]*indico\.[^/]*/event/(?P<event_id>\d+)'
            action: refuse
            scraper: indico_mcp
            message: |
              Error: this URL is an Indico event page (event_id={event_id}).
              `ingest_url` cannot authenticate against CERN SSO and would store the
              login redirect page. Call `ingest_indico_event(event_id="{event_id}")`
              instead — it drives the bearer-authenticated Indico MCP server.
          - pattern: '^https?://[^/]*indico\.[^/]*/export/event/(?P<event_id>\d+)\.json'
            action: refuse
            scraper: indico_mcp
            message: |
              Error: Indico API export (event_id={event_id}). Call
              `ingest_indico_event(event_id="{event_id}")` instead.
          # 2) Any other CERN host -> eligible for Selenium-driven SSO fallback
          - pattern: '^https?://[^/]*\.cern\.ch(/|$)'
            action: sso_retry
            scraper: sso

Data-manager SSO source (required for `sso_retry`)

data_manager:
  sources:
    links:
      selenium_scraper:
        enabled: true
        use_for_scraping: true
        selenium_class: CERNSSOScraper
        selenium_class_map:
          CERNSSOScraper:
            class: CERNSSOScraper
            kwargs:
              headless: true
              max_depth: 5
    sso:
      enabled: true

Adds a containerized mcp4indico server using the generic MCP sidecar mechanism (PR archi-physics#557): Dockerfile pinned to upstream commit 800c5fc3 with a small entrypoint wrapper that initializes API client globals from env vars at HTTP-app import time. Ships an indico skill, a worked example config block, and a redo.sh modeled on the existing smoke-test flow.

Adds support for scraping Indico events and meeting materials, alongside the existing link/git/sso/elog scrapers. Scraper (indico_scraper.py) - Fetches event metadata, contributions, and slide attachments via the Indico REST API. - Converts PDF/PPTX/PPT/ODP slides to Markdown via MarkItDown; strips embedded <latexit> blocks that inflate chunk counts on formula-heavy slides (slide_converter.py). - Deduplicates attachments when the same slides are uploaded in multiple formats (e.g. PDF + PPTX): keeps the higher-priority format. - Detects SSO-protected events and authenticates via CERNSSOScraper. - Stores speaker affiliation alongside speaker name in resource metadata. ScraperManager integration - collect_indico() / schedule_collect_indico() hooks. - URL routing: explicit "indico-" prefix in weblists, plus auto-detection for URLs with "indico" in the hostname and /event/ in the path. - Indico documents use source_type="web" (matching the existing CHECK constraint) with a "scraper": "indico" metadata field for filtering. Vectorstore - Prepends a one-line metadata header to each Indico chunk (event title, date, contribution, speaker, affiliation, start time, duration, session) so BM25 retrieval can match on speaker name, time of day, etc. Gated on metadata "scraper"="indico"; no other sources affected. Config / docs / examples - base-config.yaml: indico source block (disabled by default). - docs/docs/data_sources.md: Indico section. - examples/agents/indico-assistant.md: agent spec for Indico queries. - examples/deployments/basic-agent/indico_example.list: example weblist. - SourceRegistry: register "indico" source (depends on links). Dependencies - pyproject.toml + requirements-base.txt: add markitdown[pdf,pptx]. Known limitations - Images and figures in slides are not extracted or described; only text content is converted to Markdown. Slides that communicate primarily through plots/diagrams will produce thin or empty chunks. - LaTeX in slides: embedded <latexit> blocks are stripped (they are base64-encoded and useless for retrieval), but inline LaTeX notation and formula-heavy slides may still produce low-quality Markdown that chunks poorly. - Slide context is per-page, not per-deck: each chunk comes from one page/section of the converted Markdown. There is no cross-slide summarisation, so a narrative that spans multiple slides may be split across chunks without connecting context. - Category URLs (/category/<id>/) are handled in the code but not yet tested end-to-end; only event URLs are documented. - SSO authentication is CERN-specific (CERNSSOScraper). Other Indico instances with different login flows would need a different auth path. - No rate limiting or incremental scraping; large events with many attachments are processed in a single run.

The /document_index/upload_url endpoint unconditionally called collect_links, scraping Indico event pages as plain HTML. With PR archi-physics#550's IndicoScraper now on this branch, dispatch via _is_indico_url so a single agent-side POST of an Indico event URL ingests via the API + slide-conversion path instead of generic HTML scraping. collect_indico gains an int return for parity with collect_links.

Wraps the data-manager's POST /document_index/upload_url so the agent can ask archi to scrape and index a URL it has just discovered (e.g. an Indico event URL surfaced by INDICO_get_files). The endpoint dispatches Indico URLs through IndicoScraper and falls back to LinkScraper for the rest, so the tool stays URL-agnostic. Skill text instructs the agent to chain INDICO_get_files(download_files=false) -> ingest_url(event_url) -> search_vectorstore_hybrid when slide contents are needed.

redo.sh is a developer-local smoke-test script and should not live in the repo. It was previously committed in the integration commit (now purged from history via filter-branch); keeping it gitignored prevents future re-additions.

- Add shared volume for Indico downloads in Dockerfile - Implement ingest_indico_event tool for processing Indico event attachments - Enhance CMSCompOpsAgent to utilize shared downloads directory - Add ingest_local_path endpoint to Flask app for directory ingestion - Update LocalFileManager to support directory ingestion from shared volumes - Modify data manager to handle new ingestion logic and metadata

…ingestion

… fallback support

hassan11196 · 2026-05-11T14:19:40Z

Dedicated skill for indico MCP
https://gitlab.cern.ch/archi/cms-compops/-/blob/hassan-dev-vocms248/config/comp_ops/skills/indico.md?ref_type=heads

Keeping markitdown[pdf,pptx]>=0.1.0 in pyproject.toml is enough — the data-manager and chat Dockerfiles install it via `pip install .`, and only the data-manager image actually uses it (slide_converter + indico_scraper). Having it in requirements-base.txt instead caused two problems on this PR (head from a fork): * build-base-images was triggered (requirements-base.txt is one of the watched paths) and failed at the Docker Hub push step because DOCKERHUB_* secrets are not exposed to fork PRs. * unit-tests' Install dependencies step took ~16 min on the runner installing the heavy markitdown transitive deps (onnxruntime, magika, pdfplumber, ...) and exited 1. Removing the duplicate line skips build-base-images entirely on this PR (detect step writes changed=false) and shrinks the unit-tests install back to its prior size.

The new Indico MCP-sidecar subsection added two relative links pointing at ../../mcp/indico/README.md. With docs_dir=docs (default) the rendered page lives under site/data_sources/, so the relative path resolves to mcp/indico/README.md outside the docs tree — mkdocs cannot find the target and `mkdocs build --config-file docs/mkdocs.yml --strict` (used by the PR-preview `Verify MkDocs build` step) exits 1. Switch the two links to absolute GitHub URLs (same convention used in docs/docs/install.md and docs/docs/troubleshooting.md) so mkdocs no longer tries to resolve them as in-tree paths.

hassan11196 marked this pull request as ready for review May 11, 2026 13:08

hassan11196 force-pushed the feat-indico-mcp branch from 849f5d7 to 136def1 Compare May 11, 2026 13:37

hassan11196 and others added 8 commits May 11, 2026 15:53

Ignore local redo.sh

96bea38

redo.sh is a developer-local smoke-test script and should not live in the repo. It was previously committed in the integration commit (now purged from history via filter-branch); keeping it gitignored prevents future re-additions.

feat: enhance Indico MCP integration with on-demand event attachment …

a43a819

…ingestion

feat: enhance ingest_url tool with configurable routing rules and SSO…

76d35e2

… fallback support

hassan11196 force-pushed the feat-indico-mcp branch from 4125ce1 to 76d35e2 Compare May 11, 2026 14:04

hassan11196 changed the title ~~feat: Indico scraper, MCP sidecar, and on-demand ingest agent tools~~ feat(indico): scraper, MCP sidecar, and on-demand ingest tooling May 11, 2026

hassan11196 added 2 commits May 11, 2026 17:32

juanpablosalas requested review from juanpablosalas May 12, 2026 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(indico): scraper, MCP sidecar, and on-demand ingest tooling#564

feat(indico): scraper, MCP sidecar, and on-demand ingest tooling#564
hassan11196 wants to merge 10 commits into
archi-physics:mainfrom
hassan11196:feat-indico-mcp

hassan11196 commented May 11, 2026 •

edited

Loading

Uh oh!

hassan11196 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hassan11196 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Suggested agent flow

Known limitations (Indico scraper)

Example config

MCP sidecar

ingest_url routing rules + SSO fallback

Data-manager SSO source (required for sso_retry)

Uh oh!

hassan11196 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hassan11196 commented May 11, 2026 •

edited

Loading

`ingest_url` routing rules + SSO fallback

Data-manager SSO source (required for `sso_retry`)