Skip to content

Phase 3a follow-up: propose_selectors for non-CSS extraction algorithms #2

Description

@gregoryfoster

Migrated from CannObserv/watcher#148 during the Archiver service extraction (CannObserv/watcher#149) on 2026-05-07. Paths in the body have been updated to reflect the new archiver repo layout.

Context

Phase 3a's `propose_selectors` (src/core/tools/propose_selectors.py) returns CSS selectors only. The InfoSpec v1 schema accepts `extraction.algorithm: enum [css, xpath, jsonpath, regex, full_page]` — for non-CSS algorithms, an LLM agent following the authoring loop (`fetch_and_render` → `propose_selectors` → `preview_extraction` → `create_info_spec`) currently has no proposer for the selector-shaped field.

What to build

Per-algorithm proposers, gated on a request `algorithm` parameter (default `css`):

  • xpath: same DOM walk + same scoring; emit XPath expressions instead of CSS. Volatility heuristics carry over (hash-looking attribute values).
  • jsonpath: applies when target is JSON (sniff via `Content-Type: application/json` from `fetch_and_render`). Walk the JSON tree, propose paths whose `str(value)` contains the description.
  • regex: heuristic regex synthesis from the description + 1-2 surrounding tokens. Lower confidence than DOM-based proposers; flag with a lower base `stability_score`.
  • full_page: trivial — returns one candidate with empty selector + score 1.0 (full_page extracts everything).

Why deferred

CSS covers >90% of currently-modelled targets in the wild. The other algorithms are real but each adds its own heuristic stack and test fixtures; bundling them into Phase 3a would have ballooned the slice.

Decision criteria

Land in priority order:

  1. xpath when an operator hits a target where CSS specificity is insufficient (deeply-nested or attribute-driven content).
  2. jsonpath when an InfoItem on a JSON API is needed (likely Phase 3b — Archive may consume JSON sources).
  3. regex / full_page as opportunistic adds.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions