Skip to content

JSON extractor: application/json falls back to HTML (#168 follow-up) #212

Description

@gregoryfoster

Background

#168 (slice 2) made content_media_type drive extractor dispatch via ServiceRegistry.get_extractor (src/core/registry.py). Dispatch is total: any essence not explicitly mapped falls back to the HTML extractor.

application/json has no dedicated extractor, so JSON targets are currently run through HtmlExtractor. Archiver's declared content-kind family already includes json (jsonpath → json), but Watcher has no JSON extractor to match.

Ask

Add a JSON extractor (e.g. canonicalize/pretty-print + structural chunking, or JSONPath-aware extraction aligned with Archiver's jsonpath algorithm) and register it:

  • src/core/extractors/ — new JsonExtractor (mirror to /home/exedev/archiver/src/core/extractors/ per the mirrored-content-acquisition policy).
  • src/core/registry.py — map application/json (and likely application/*+json) → JsonExtractor.
  • Tests: routing (application/json → JsonExtractor) + extraction behavior.

Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions