Background
#168 (slice 2) made content_media_type drive extractor dispatch via ServiceRegistry.get_extractor (src/core/registry.py). Dispatch is total: any essence not explicitly mapped falls back to the HTML extractor.
application/json has no dedicated extractor, so JSON targets are currently run through HtmlExtractor. Archiver's declared content-kind family already includes json (jsonpath → json), but Watcher has no JSON extractor to match.
Ask
Add a JSON extractor (e.g. canonicalize/pretty-print + structural chunking, or JSONPath-aware extraction aligned with Archiver's jsonpath algorithm) and register it:
src/core/extractors/ — new JsonExtractor (mirror to /home/exedev/archiver/src/core/extractors/ per the mirrored-content-acquisition policy).
src/core/registry.py — map application/json (and likely application/*+json) → JsonExtractor.
- Tests: routing (
application/json → JsonExtractor) + extraction behavior.
Notes
Background
#168 (slice 2) made
content_media_typedrive extractor dispatch viaServiceRegistry.get_extractor(src/core/registry.py). Dispatch is total: any essence not explicitly mapped falls back to the HTML extractor.application/jsonhas no dedicated extractor, so JSON targets are currently run throughHtmlExtractor. Archiver's declared content-kind family already includesjson(jsonpath → json), but Watcher has no JSON extractor to match.Ask
Add a JSON extractor (e.g. canonicalize/pretty-print + structural chunking, or JSONPath-aware extraction aligned with Archiver's
jsonpathalgorithm) and register it:src/core/extractors/— newJsonExtractor(mirror to/home/exedev/archiver/src/core/extractors/per the mirrored-content-acquisition policy).src/core/registry.py— mapapplication/json(and likelyapplication/*+json) →JsonExtractor.application/json→ JsonExtractor) + extraction behavior.Notes
application/*+json(vendor JSON) essence handling at the same time.