Skip to content

feat(indico): scraper, MCP sidecar, and on-demand ingest tooling#564

Open
hassan11196 wants to merge 10 commits into
archi-physics:mainfrom
hassan11196:feat-indico-mcp
Open

feat(indico): scraper, MCP sidecar, and on-demand ingest tooling#564
hassan11196 wants to merge 10 commits into
archi-physics:mainfrom
hassan11196:feat-indico-mcp

Conversation

@hassan11196
Copy link
Copy Markdown
Collaborator

@hassan11196 hassan11196 commented May 11, 2026

Summary

Adds Indico as a first-class data source, end-to-end:

  1. Indico scraper (bcd4c078, originally @livaage's work — PR feat(data-manager): add Indico scraper integration #550 was not merged upstream, so it ships here) — fetches events/contributions/attachments via the Indico REST API, converts PDF/PPTX/PPT/ODP slides to Markdown via MarkItDown, dedups multi-format uploads, authenticates SSO-protected events via CERNSSOScraper, and prepends per-chunk metadata headers (event/speaker/affiliation/time) to improve BM25 retrieval. Adds markitdown[pdf,pptx] to deps.

  2. mcp4indico sidecar (93e994dd) — containerized Indico MCP server using the generic MCP sidecar mechanism from PR Generic MCP sidecar + domain-knowledge skills + tool-use observability #557 (now on main). Ships a Dockerfile pinned to upstream 800c5fc3, an entrypoint wrapper that initializes API client globals from env vars, an Indico skill, and a worked example config block.

  3. On-demand ingest agent tools

    • ingest_url (b48b1554, 76d35e20): wraps POST /document_index/upload_url so the agent can scrape & index any URL it discovers. Dispatches Indico URLs through IndicoScraper (288494e4), falls back to LinkScraper for the rest. Adds configurable routing rules and SSO fallback in the latest revision.
    • ingest_indico_event (5c1f2b9d, a43a819e): ingests Indico event attachments on demand via a shared downloads volume, with ingest_local_path Flask endpoint and LocalFileManager directory-ingestion support.

Suggested agent flow

INDICO_get_files(download_files=false)ingest_url(event_url) (or ingest_indico_event) → search_vectorstore_hybrid once slides are indexed.

Known limitations (Indico scraper)

  • Images/diagrams in slides aren't extracted (text-only conversion).
  • <latexit> blocks are stripped; inline LaTeX may still chunk poorly.
  • Chunks are per-page, not per-deck — no cross-slide summarization.
  • Category URLs (/category/<id>/) are coded but not e2e-tested.
  • SSO is CERN-specific (CERNSSOScraper); other Indico instances need a different auth path.
  • No rate limiting / incremental scraping.

Example config

The PR ships an example config block in mcp/indico/README.md; below is a working snippet drawn from a live base_config_indico.yaml deployment that exercises both the sidecar and the agent-tool routing.

MCP sidecar

Declare the Indico MCP server under mcp_servers. The build_context points at this PR's mcp/indico/ directory so the sidecar is built locally; tokens come from env (.env / secrets manager). The shared_volume is mounted read-only into the data-manager at the same path, and the template derives INDICO_DOWNLOADS_DIR=/shared/indico-downloads from it so ingest_indico_event can pick up downloaded attachments.

mcp_servers:
  indico:
    transport: streamable_http
    url: http://localhost:8012/mcp        # use http://indico-mcp:8012/mcp on bridge networking
    build_context: ./mcp/indico           # path to this PR's sidecar
    env:
      INDICO_BASE_URL: https://indico.cern.ch
      BEARER_TOKEN: ${INDICO_BEARER_TOKEN}
      API_KEY: ${INDICO_API_KEY}
      API_SECRET: ${INDICO_API_SECRET}
    shared_volume: indico-downloads
    skill: indico

ingest_url routing rules + SSO fallback

ingest_url evaluates rules in declaration order; first match wins.

  • action: refuse — return message to the agent so it switches to a better tool (used here to bounce Indico event URLs over to ingest_indico_event, which authenticates via the MCP sidecar instead of storing a login-redirect page).
  • action: sso_retry — when sso_fallback_enabled: true, the data-manager retries via CERNSSOScraper if anonymous LinkScraper lands on a Keycloak page. Requires data_manager.sources.sso.enabled: true and SSO_USERNAME / SSO_PASSWORD secrets.
services:
  chat_app:
    tools:
      ingest_url:
        sso_fallback_enabled: true
        routing_rules:
          # 1) Indico event pages -> ingest_indico_event (auth'd via MCP, no Selenium)
          - pattern: '^https?://[^/]*indico\.[^/]*/event/(?P<event_id>\d+)'
            action: refuse
            scraper: indico_mcp
            message: |
              Error: this URL is an Indico event page (event_id={event_id}).
              `ingest_url` cannot authenticate against CERN SSO and would store the
              login redirect page. Call `ingest_indico_event(event_id="{event_id}")`
              instead — it drives the bearer-authenticated Indico MCP server.
          - pattern: '^https?://[^/]*indico\.[^/]*/export/event/(?P<event_id>\d+)\.json'
            action: refuse
            scraper: indico_mcp
            message: |
              Error: Indico API export (event_id={event_id}). Call
              `ingest_indico_event(event_id="{event_id}")` instead.
          # 2) Any other CERN host -> eligible for Selenium-driven SSO fallback
          - pattern: '^https?://[^/]*\.cern\.ch(/|$)'
            action: sso_retry
            scraper: sso

Data-manager SSO source (required for sso_retry)

data_manager:
  sources:
    links:
      selenium_scraper:
        enabled: true
        use_for_scraping: true
        selenium_class: CERNSSOScraper
        selenium_class_map:
          CERNSSOScraper:
            class: CERNSSOScraper
            kwargs:
              headless: true
              max_depth: 5
    sso:
      enabled: true

@hassan11196 hassan11196 marked this pull request as ready for review May 11, 2026 13:08
hassan11196 and others added 8 commits May 11, 2026 15:53
Adds a containerized mcp4indico server using the generic MCP sidecar
mechanism (PR archi-physics#557): Dockerfile pinned to upstream commit 800c5fc3 with a
small entrypoint wrapper that initializes API client globals from env
vars at HTTP-app import time. Ships an indico skill, a worked example
config block, and a redo.sh modeled on the existing smoke-test flow.
Adds support for scraping Indico events and meeting materials, alongside
the existing link/git/sso/elog scrapers.

Scraper (indico_scraper.py)
- Fetches event metadata, contributions, and slide attachments via the
  Indico REST API.
- Converts PDF/PPTX/PPT/ODP slides to Markdown via MarkItDown; strips
  embedded <latexit> blocks that inflate chunk counts on formula-heavy
  slides (slide_converter.py).
- Deduplicates attachments when the same slides are uploaded in multiple
  formats (e.g. PDF + PPTX): keeps the higher-priority format.
- Detects SSO-protected events and authenticates via CERNSSOScraper.
- Stores speaker affiliation alongside speaker name in resource metadata.

ScraperManager integration
- collect_indico() / schedule_collect_indico() hooks.
- URL routing: explicit "indico-" prefix in weblists, plus auto-detection
  for URLs with "indico" in the hostname and /event/ in the path.
- Indico documents use source_type="web" (matching the existing CHECK
  constraint) with a "scraper": "indico" metadata field for filtering.

Vectorstore
- Prepends a one-line metadata header to each Indico chunk (event title,
  date, contribution, speaker, affiliation, start time, duration, session)
  so BM25 retrieval can match on speaker name, time of day, etc.
  Gated on metadata "scraper"="indico"; no other sources affected.

Config / docs / examples
- base-config.yaml: indico source block (disabled by default).
- docs/docs/data_sources.md: Indico section.
- examples/agents/indico-assistant.md: agent spec for Indico queries.
- examples/deployments/basic-agent/indico_example.list: example weblist.
- SourceRegistry: register "indico" source (depends on links).

Dependencies
- pyproject.toml + requirements-base.txt: add markitdown[pdf,pptx].

Known limitations
- Images and figures in slides are not extracted or described; only text
  content is converted to Markdown. Slides that communicate primarily
  through plots/diagrams will produce thin or empty chunks.
- LaTeX in slides: embedded <latexit> blocks are stripped (they are
  base64-encoded and useless for retrieval), but inline LaTeX notation
  and formula-heavy slides may still produce low-quality Markdown that
  chunks poorly.
- Slide context is per-page, not per-deck: each chunk comes from one
  page/section of the converted Markdown. There is no cross-slide
  summarisation, so a narrative that spans multiple slides may be split
  across chunks without connecting context.
- Category URLs (/category/<id>/) are handled in the code but not yet
  tested end-to-end; only event URLs are documented.
- SSO authentication is CERN-specific (CERNSSOScraper). Other Indico
  instances with different login flows would need a different auth path.
- No rate limiting or incremental scraping; large events with many
  attachments are processed in a single run.
The /document_index/upload_url endpoint unconditionally called
collect_links, scraping Indico event pages as plain HTML. With PR archi-physics#550's
IndicoScraper now on this branch, dispatch via _is_indico_url so a
single agent-side POST of an Indico event URL ingests via the API +
slide-conversion path instead of generic HTML scraping. collect_indico
gains an int return for parity with collect_links.
Wraps the data-manager's POST /document_index/upload_url so the agent
can ask archi to scrape and index a URL it has just discovered (e.g.
an Indico event URL surfaced by INDICO_get_files). The endpoint dispatches
Indico URLs through IndicoScraper and falls back to LinkScraper for the
rest, so the tool stays URL-agnostic. Skill text instructs the agent to
chain INDICO_get_files(download_files=false) -> ingest_url(event_url) ->
search_vectorstore_hybrid when slide contents are needed.
redo.sh is a developer-local smoke-test script and should not live in
the repo. It was previously committed in the integration commit (now
purged from history via filter-branch); keeping it gitignored prevents
future re-additions.
- Add shared volume for Indico downloads in Dockerfile
- Implement ingest_indico_event tool for processing Indico event attachments
- Enhance CMSCompOpsAgent to utilize shared downloads directory
- Add ingest_local_path endpoint to Flask app for directory ingestion
- Update LocalFileManager to support directory ingestion from shared volumes
- Modify data manager to handle new ingestion logic and metadata
@hassan11196 hassan11196 changed the title feat: Indico scraper, MCP sidecar, and on-demand ingest agent tools feat(indico): scraper, MCP sidecar, and on-demand ingest tooling May 11, 2026
@hassan11196
Copy link
Copy Markdown
Collaborator Author

Keeping markitdown[pdf,pptx]>=0.1.0 in pyproject.toml is enough — the
data-manager and chat Dockerfiles install it via `pip install .`, and
only the data-manager image actually uses it (slide_converter +
indico_scraper). Having it in requirements-base.txt instead caused two
problems on this PR (head from a fork):

  * build-base-images was triggered (requirements-base.txt is one of the
    watched paths) and failed at the Docker Hub push step because
    DOCKERHUB_* secrets are not exposed to fork PRs.
  * unit-tests' Install dependencies step took ~16 min on the runner
    installing the heavy markitdown transitive deps (onnxruntime,
    magika, pdfplumber, ...) and exited 1.

Removing the duplicate line skips build-base-images entirely on this PR
(detect step writes changed=false) and shrinks the unit-tests install
back to its prior size.
The new Indico MCP-sidecar subsection added two relative links pointing
at ../../mcp/indico/README.md. With docs_dir=docs (default) the rendered
page lives under site/data_sources/, so the relative path resolves to
mcp/indico/README.md outside the docs tree — mkdocs cannot find the
target and `mkdocs build --config-file docs/mkdocs.yml --strict` (used
by the PR-preview `Verify MkDocs build` step) exits 1.

Switch the two links to absolute GitHub URLs (same convention used in
docs/docs/install.md and docs/docs/troubleshooting.md) so mkdocs no
longer tries to resolve them as in-tree paths.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants