diff --git a/docs/CONFIG_REFERENCE.md b/docs/CONFIG_REFERENCE.md index 5468e06..8bfd2a8 100644 --- a/docs/CONFIG_REFERENCE.md +++ b/docs/CONFIG_REFERENCE.md @@ -133,7 +133,7 @@ Default download policies used when library-specific override is not enabled. ## `settings.network.libraries.` -Libraries supported: `gallica`, `vaticana`, `bodleian`, `institut_de_france`, `unknown`. +Libraries supported: `gallica`, `vaticana`, `bodleian`, `institut_de_france`, `internet_culturale` (BETA), `unknown`. **HTTPClient Integration**: These settings are used by the centralized `HTTPClient` class for per-library network policies (rate limiting, retry, backoff, concurrency). @@ -145,7 +145,7 @@ Global-only fields (never overridden by library): Library override fields (used only when `use_custom_policy=true`): - `enabled` (`bool`, default: `true`) -- `use_custom_policy` (`bool`, default: `true` for `gallica`, otherwise `false`) +- `use_custom_policy` (`bool`, default: `true` for `gallica` and `internet_culturale`, otherwise `false`) - When `true`, library-specific settings override global defaults - When `false`, global defaults from `settings.network.download.*` are used - `workers_per_job` (`int`, `1..8`) @@ -432,8 +432,9 @@ Discovery search configuration. Editable from Settings > Discovery tab in the we - `max_results_per_provider` (`int`, default: `20`) - Maximum number of results returned by each search provider per query. - Clamped to [1, 50] at runtime and on save. - - For paginatable providers (Archive.org, Harvard, LOC, Gallica), additional results can be loaded via the "Carica altri risultati" button. + - For paginatable providers (Archive.org, Harvard, LOC, Gallica, Internet Culturale (BETA)), additional results can be loaded via the "Carica altri risultati" button. - Non-paginatable providers (Vatican, Bodleian, Cambridge, Heidelberg, Institut, e-codices) return at most this many results from a single API call. + - For Internet Culturale (BETA) the upstream page size is fixed at 20 regardless of `max_results_per_provider`; the "has more" check relies on the authoritative `totalPages` parsed from the HTML instead of the result cap. ## Migration Notes diff --git a/docs/guides/discovery-and-library.md b/docs/guides/discovery-and-library.md index a9c3e01..760e18a 100644 --- a/docs/guides/discovery-and-library.md +++ b/docs/guides/discovery-and-library.md @@ -57,6 +57,8 @@ Discovery also reflects provider-specific result behavior. Some providers can ex The practical posture is to treat Discovery as a normalized gateway, not as proof that every library offers the same search ergonomics. +Internet Culturale **(BETA)** is a special case worth calling out explicitly. It sits at the bottom of the provider select because the integration is experimental: useful when ICCU is the only channel to reach an Italian record, but less reliable than any native IIIF provider. It is an aggregator that fronts around fifty Italian libraries (Laurenziana, Marciana, BNCF, BNCR, Estense, and many smaller partners) and it routinely returns thousands of results for a single keyword. Scriptoria shows the upstream total as "Mostrati X di Y risultati" so the size of the result set is visible, and "Carica altri risultati" walks through the remaining pages twenty at a time. Because the upstream does not expose a IIIF manifest directly, the manifest used internally is converted on-the-fly from ICCU's MAG/XML document; partial records (those declaring more pages than the server actually serves) are still saved as partial scans rather than failing outright, but expect occasional teaser records where only the frontispiece is really available. + ## What Library Does Library is the local catalog of manuscript records and their current working state. diff --git a/docs/reference/configuration.md b/docs/reference/configuration.md index 01203b0..97ae8c1 100644 --- a/docs/reference/configuration.md +++ b/docs/reference/configuration.md @@ -78,6 +78,10 @@ It is split into three layers: - `settings.network.download.*` for default document download behavior; - `settings.network.libraries..*` for provider-specific overrides. +The supported provider keys under `settings.network.libraries.*` are `gallica`, `vaticana`, `bodleian`, `institut_de_france`, `internet_culturale` **(BETA)**, and `unknown`. Setting `use_custom_policy: false` on a library makes it inherit the `settings.network.download.*` defaults; `true` activates the per-library override fields. + +`internet_culturale` (BETA) ships with a conservative default policy (2 workers per job, 1.0–3.0s delay, 300s cooldown on 403/429, 40 requests per 60s burst window) because the ICCU aggregator is a shared infrastructure and is noticeably less tolerant than large IIIF-native providers. + You touch this family when: - a provider rate-limits too aggressively; @@ -85,7 +89,7 @@ You touch this family when: - one library needs stricter policy than the global default; - you want reproducible network behavior across machines. -This family is directly reflected in the Settings `Network` pane. +This family is directly reflected in the Settings `Network & Libraries` pane, which exposes per-library override cards for each supported provider key. ### `settings.images.*` diff --git a/docs/reference/provider-support.md b/docs/reference/provider-support.md index ac644d3..54569cf 100644 --- a/docs/reference/provider-support.md +++ b/docs/reference/provider-support.md @@ -16,6 +16,7 @@ The shared provider registry currently exposes these providers: - Harvard University - Library of Congress - Internet Archive +- Internet Culturale (ICCU) **[BETA]** - generic direct IIIF manifest URL These entries come from the runtime provider registry in `src/universal_iiif_core/providers.py`, which is the source used by both UI and shared resolution logic. @@ -56,6 +57,7 @@ The provider has a stronger search-first experience and can reasonably be used f | Harvard | DRS-bearing item URL | `fallback` | Usually best treated as URL-driven | | Library of Congress | public item URL | `fallback` | Prefer `loc.gov/item/...` URLs | | Internet Archive | item URL or text query | `search_first` | Good discovery-first behavior for many cases | +| Internet Culturale **[BETA]** | text query, OAI ID, or magparser/viewresource URL | `search_first` | Gateway for ~50 Italian libraries (Laurenziana, Marciana, BNCF/BNCR, Estense, Marucelliana, Ambrosiana partners, etc.). Integration is experimental: many upstream records are incomplete and image quality is variable. Use only when ICCU is the only channel available | | Generic / direct manifest | exact manifest URL | `direct` | Use only when you already have a valid IIIF manifest URL | ## Per-Provider Notes @@ -100,6 +102,23 @@ Library of Congress is best approached with the public `loc.gov/item/...` page a Internet Archive supports a more discovery-first workflow than many other providers in the registry. It is usually comfortable both for direct item URLs and for broad text search. +### Internet Culturale (ICCU) **[BETA]** + +Internet Culturale is the Italian national aggregator run by ICCU. The integration is currently **BETA**: it is good enough to reach content that is otherwise unreachable from Scriptoria, but far from the reliability of native IIIF providers. Treat it as a last-resort channel when no other provider covers the item. + +Unlike the other providers in the registry, ICCU does not expose a native IIIF Presentation manifest. Instead, Scriptoria fetches the upstream MAG/XML document (the `jmms/magparser` endpoint) and converts it to a IIIF v2 manifest on the fly. Canvas image URLs come from the real `src` attribute of each `` element — the `/jmms/thumbnail?page=N` endpoint ignores the page parameter and must not be used. + +Search is HTML scraping over the advanced search page, paginated with `pag=N` (not `paginate_pageNum`, which the server silently ignores). The parser extracts the total result count and total pages so the UI can show "Mostrati X di Y risultati" and enable "Carica altri". Typical result set sizes are in the thousands. + +Known BETA limitations: + +- Many ICCU records are "teaser" entries: the MAG XML declares several pages but only the first image is actually served upstream. The downloader applies a partial-finalize mode for ICCU manifests so partial downloads still land correctly in `scans/`, but the user-visible experience is still "you asked for N pages and only got M". +- Image quality and resolution vary widely between teche and between records in the same teca. +- The external viewer path used by Scriptoria is the canonical `/jmms/iccuviewer/iccu.jsp?id=...&mode=all&teca=...`. The older `viewresource` URL renders as a blank page for some teche (BNCF in particular). +- For Mirador-based local reading Scriptoria exposes an internal proxy endpoint, `/api/iccu/manifest?url=...`, that serves the converted manifest as JSON with CORS-friendly headers. +- The ICCU Image API v2.1 does exist at `internetculturale.it/iiif/image/2.1/{id_b64}/...` but is level 0 only (no tile server, no zoom). Static fullsize is the only tier available regardless of which download path is chosen. +- For native IIIF access to Biblioteca Estense records, prefer a dedicated Estense provider (Jarvis backend) when available, rather than going through ICCU. + ### Generic Direct Manifest This path is intentionally simple. It exists for the case where the source is IIIF-compatible but not covered by one of the dedicated resolvers. Scriptoria expects a valid direct manifest URL and does not try to infer provider-specific behavior beyond that. @@ -112,9 +131,12 @@ In both cases, the registry metadata already assumes that browser-assisted searc ## Provider Filters -The current provider registry exposes one dedicated provider filter: the `Gallica` material type filter. +The current provider registry exposes two dedicated provider filters: + +- `Gallica` — material type (all, manuscripts, printed books). +- `Internet Culturale` **[BETA]** — material type (all, `Manoscritto`, `Libro moderno`, `Musica`, `Fotografia`). -It lets users narrow Gallica results to all materials, manuscripts, or printed books. More provider-specific filters can be added later, but only when the upstream service and the user workflow justify them. +Both filters map directly to server-side parameters and survive pagination, so "Carica altri" preserves the selected material type. More provider-specific filters can be added later, but only when the upstream service and the user workflow justify them. ## How To Choose The Right Input diff --git a/src/studio_ui/components/discovery_results.py b/src/studio_ui/components/discovery_results.py index 04c10a3..d507622 100644 --- a/src/studio_ui/components/discovery_results.py +++ b/src/studio_ui/components/discovery_results.py @@ -8,7 +8,7 @@ from fasthtml.common import H3, A, Button, Div, Img, P, Span -def _provider_viewer_fallback(library: str, doc_id: str, ark: str = "") -> str: +def _provider_viewer_fallback(library: str, doc_id: str, ark: str = "", manifest_url: str = "") -> str: if library == "Gallica" and ark: return f"https://gallica.bnf.fr/{ark}" if library == "Gallica" and doc_id: @@ -21,6 +21,12 @@ def _provider_viewer_fallback(library: str, doc_id: str, ark: str = "") -> str: return f"https://digital.bodleian.ox.ac.uk/objects/{doc_id}" if library == "Archive.org" and doc_id: return f"https://archive.org/details/{doc_id}" + if library == "Internet Culturale" and manifest_url: + from universal_iiif_core.resolvers.mag_parser import build_viewer_url, extract_oai_and_teca_from_url + + oai, teca = extract_oai_and_teca_from_url(manifest_url) + if oai and teca: + return build_viewer_url(oai, teca) return "" @@ -30,12 +36,15 @@ def _resolve_viewer_url(data: dict) -> str: raw = data.get("raw") if not viewer_url and isinstance(raw, dict): viewer_url = str(raw.get("viewer_url") or "").strip() + if not viewer_url: + viewer_url = str(data.get("source_detail_url") or "").strip() if viewer_url: return viewer_url return _provider_viewer_fallback( str(data.get("library") or ""), str(data.get("id") or ""), str(data.get("ark") or ""), + str(data.get("url") or data.get("manifest") or ""), ) @@ -72,6 +81,7 @@ def _render_load_more_section(pagination: dict | None) -> Div | str: "library": pagination["library"], "shelfmark": pagination["shelfmark"], "gallica_type": pagination.get("gallica_type", "all"), + "ic_type": pagination.get("ic_type", "all"), "page": page + 1, } ) @@ -249,14 +259,34 @@ def _build_result_cards(results: list) -> list: return cards +def _results_header_text(results: list, pagination: dict | None) -> str: + """Build the 'Trovati N risultati' header, using total search size when known.""" + shown = len(results) + total = 0 + if results: + raw = results[0].get("raw") if isinstance(results[0], dict) else None + if isinstance(raw, dict): + try: + total = int(raw.get("_search_total_results") or 0) + except (TypeError, ValueError): + total = 0 + page = int((pagination or {}).get("page") or 1) + per_page = shown if shown else 0 + if total and per_page: + seen = min(page * per_page, total) + return f"Mostrati {seen} di {total} risultati" + return f"Trovati {shown} risultati" + + def render_search_results_list(results: list, *, pagination: dict | None = None) -> Div: """Render list of search results aligned with global app theme.""" cards = _build_result_cards(results) load_more = _render_load_more_section(pagination) + header_text = _results_header_text(results, pagination) return Div( Div( - H3(f"Trovati {len(results)} risultati", cls="text-lg font-semibold text-slate-900 dark:text-slate-100"), + H3(header_text, cls="text-lg font-semibold text-slate-900 dark:text-slate-100"), Span( "Seleziona un risultato per aggiungerlo in Libreria o avviare il download.", cls="text-xs text-slate-500", diff --git a/src/studio_ui/components/library_stats.py b/src/studio_ui/components/library_stats.py index 7d66980..a594886 100644 --- a/src/studio_ui/components/library_stats.py +++ b/src/studio_ui/components/library_stats.py @@ -236,11 +236,15 @@ def render_stats_page_content(manuscripts: list[dict]) -> Div: ) recent = manuscripts[:6] - recent_panel = Div( - P("Ultimi aggiornati", cls="text-xs uppercase tracking-widest text-slate-500 dark:text-slate-400 mb-2"), - Ul(*[_recent_activity_row(m) for m in recent], cls="divide-y divide-slate-100 dark:divide-slate-800"), - cls=_CARD_CLS + " mb-6", - ) if recent else Div() + recent_panel = ( + Div( + P("Ultimi aggiornati", cls="text-xs uppercase tracking-widest text-slate-500 dark:text-slate-400 mb-2"), + Ul(*[_recent_activity_row(m) for m in recent], cls="divide-y divide-slate-100 dark:divide-slate-800"), + cls=_CARD_CLS + " mb-6", + ) + if recent + else Div() + ) detail_placeholder = Div( id="stats-detail-panel", diff --git a/src/studio_ui/components/settings/panes/network.py b/src/studio_ui/components/settings/panes/network.py index fb1f3c6..bc1ab95 100644 --- a/src/studio_ui/components/settings/panes/network.py +++ b/src/studio_ui/components/settings/panes/network.py @@ -490,6 +490,20 @@ def _build_network_pane(cm, s): **{"data-network-tab-pane": "institut_de_france"}, ) + internet_culturale_section = Div( + _build_network_library_card( + title="Internet Culturale (ICCU) [BETA]", + policy_key="internet_culturale", + policy_cfg=libraries_cfg.get( + "internet_culturale", + defaults["libraries"]["internet_culturale"], + ), + global_cfg=global_cfg, + ), + cls="hidden", + **{"data-network-tab-pane": "internet_culturale"}, + ) + return Div( Div(H3("Network & Libraries", cls="text-lg font-bold text-slate-800 dark:text-slate-100 mb-3")), P( @@ -527,6 +541,12 @@ def _build_network_pane(cm, s): cls="app-btn app-btn-neutral", **{"data-network-tab-btn": "institut_de_france"}, ), + Button( + "Internet Culturale [BETA]", + type="button", + cls="app-btn app-btn-neutral", + **{"data-network-tab-btn": "internet_culturale"}, + ), cls="flex items-center flex-wrap gap-2 mb-4", ), global_section, @@ -534,6 +554,7 @@ def _build_network_pane(cm, s): vaticana_section, bodleian_section, institut_section, + internet_culturale_section, _network_subtabs_script(), cls="p-4", data_pane="network", diff --git a/src/studio_ui/routes/_studio/manifest_helpers.py b/src/studio_ui/routes/_studio/manifest_helpers.py index a3b2071..489a58e 100644 --- a/src/studio_ui/routes/_studio/manifest_helpers.py +++ b/src/studio_ui/routes/_studio/manifest_helpers.py @@ -9,9 +9,9 @@ from fasthtml.common import Div from universal_iiif_core.config_manager import get_config_manager -from universal_iiif_core.http_client import get_http_client from universal_iiif_core.iiif_logic import total_canvases as manifest_total_canvases from universal_iiif_core.logger import get_logger +from universal_iiif_core.resolvers.manifest_fetch import fetch_manifest_dict from universal_iiif_core.services.storage.vault_manager import VaultManager from .ui_utils import _with_toast @@ -96,7 +96,7 @@ def _load_studio_manifest_context( return manifest_json, initial_canvas, True if remote_manifest_url: - remote_manifest = get_http_client().get_json(remote_manifest_url, retries=2) or {} + remote_manifest = fetch_manifest_dict(remote_manifest_url, retries=2) or {} if isinstance(remote_manifest, dict) and remote_manifest: return remote_manifest, _resolve_initial_canvas(remote_manifest, page), False @@ -126,7 +126,7 @@ def _resolve_manifest_for_selected_source( ) -> tuple[dict, str | None, bool, str, str]: manifest_exists_local = manifest_path.exists() if read_source_mode == "remote" and remote_manifest_url: - remote_manifest = get_http_client().get_json(remote_manifest_url, retries=2) or {} + remote_manifest = fetch_manifest_dict(remote_manifest_url, retries=2) or {} if isinstance(remote_manifest, dict) and remote_manifest: return ( remote_manifest, diff --git a/src/studio_ui/routes/_studio/workspace.py b/src/studio_ui/routes/_studio/workspace.py index da1693f..32bc05b 100644 --- a/src/studio_ui/routes/_studio/workspace.py +++ b/src/studio_ui/routes/_studio/workspace.py @@ -12,6 +12,7 @@ from studio_ui.components.studio.tabs import render_studio_tabs from universal_iiif_core.config_manager import get_config_manager from universal_iiif_core.logger import get_logger +from universal_iiif_core.resolvers.mag_parser import is_iccu_magparser_url from universal_iiif_core.services.ocr.storage import OCRStorage from universal_iiif_core.services.storage.vault_manager import VaultManager from universal_iiif_core.utils import load_json @@ -197,6 +198,8 @@ def _resolve_workspace_manifest_context( local_pages_count=int(inventory.local_pages_count), manifest_exists_local=manifest_exists_local, ) + if is_iccu_magparser_url(manifest_url): + manifest_url = f"/api/iccu/manifest?url={quote(manifest_url, safe='')}" return { "manifest_url": manifest_url, "manifest_json": manifest_json, diff --git a/src/studio_ui/routes/discovery.py b/src/studio_ui/routes/discovery.py index 5a1b52c..ff42721 100644 --- a/src/studio_ui/routes/discovery.py +++ b/src/studio_ui/routes/discovery.py @@ -24,6 +24,7 @@ def setup_discovery_routes(app): app.post("/api/discovery/load_more")(discovery_handlers.load_more_results) app.post("/api/library/add_prefetch_light")(discovery_handlers.add_to_library) app.get("/api/discovery/pdf_capability")(discovery_handlers.pdf_capability) + app.get("/api/iccu/manifest")(discovery_handlers.serve_iccu_manifest) app.post("/api/start_download")(discovery_handlers.start_download) app.get("/api/download_status/{download_id}")(discovery_handlers.get_download_status) app.post("/api/cancel_download/{download_id}")(discovery_handlers.cancel_download) diff --git a/src/studio_ui/routes/discovery_handlers.py b/src/studio_ui/routes/discovery_handlers.py index 24c01eb..4c4afa2 100644 --- a/src/studio_ui/routes/discovery_handlers.py +++ b/src/studio_ui/routes/discovery_handlers.py @@ -29,12 +29,12 @@ upsert_saved_entry, ) from universal_iiif_core.config_manager import get_config_manager -from universal_iiif_core.http_client import get_http_client from universal_iiif_core.iiif_logic import total_canvases from universal_iiif_core.jobs import job_manager from universal_iiif_core.logger import get_logger from universal_iiif_core.providers import is_known_provider from universal_iiif_core.resolvers.discovery import resolve_provider_input +from universal_iiif_core.resolvers.manifest_fetch import fetch_manifest_dict from universal_iiif_core.services.storage.vault_manager import VaultManager logger = get_logger(__name__) @@ -135,6 +135,8 @@ def _build_item_preview_data(item: dict, library: str, pages: int = 0) -> dict: "pages": pages, "thumbnail": item.get("thumbnail"), "has_native_pdf": item.get("has_native_pdf"), + "viewer_url": item.get("viewer_url", ""), + "raw": item.get("raw"), } @@ -151,6 +153,7 @@ def _build_manifest_preview_data(manifest_info: dict, manifest_url: str, doc_id: "pages": manifest_info.get("pages", 0), "thumbnail": manifest_info.get("thumbnail"), "has_native_pdf": manifest_info.get("has_native_pdf"), + "source_detail_url": manifest_info.get("source_detail_url", ""), } @@ -211,13 +214,13 @@ def _quick_manifest_has_native_pdf(manifest_url: str) -> bool: if cached and cached[0] > now: return bool(cached[1]) - manifest = get_http_client().get_json(clean_url, retries=1) + manifest = fetch_manifest_dict(clean_url, retries=1) has_pdf = bool(isinstance(manifest, dict) and _has_native_pdf_rendering(manifest)) _pdf_capability_cache[clean_url] = (now + _PDF_CAPABILITY_TTL_SECONDS, has_pdf) return has_pdf -def resolve_manifest(library: str, shelfmark: str, gallica_type: str = "all"): +def resolve_manifest(library: str, shelfmark: str, gallica_type: str = "all", ic_type: str = "all"): """Resolve a shelfmark or URL and return a preview fragment.""" try: if not shelfmark or not shelfmark.strip(): @@ -231,7 +234,9 @@ def resolve_manifest(library: str, shelfmark: str, gallica_type: str = "all"): max_results = cm.data.get("settings", {}).get("discovery", {}).get("max_results_per_provider", 20) resolution = resolve_provider_input( - library, shelfmark, filters={"gallica_type": gallica_type, "max_results": max_results} + library, + shelfmark, + filters={"gallica_type": gallica_type, "ic_type": ic_type, "max_results": max_results}, ) provider = resolution.provider @@ -243,13 +248,14 @@ def resolve_manifest(library: str, shelfmark: str, gallica_type: str = "all"): pages = 0 if is_direct else _page_count_from_result(first) return render_preview(_build_item_preview_data(first, provider.key, pages=pages)) - has_more = _provider_supports_pagination(provider) and len(resolution.results) >= max_results + has_more = _compute_has_more(provider, resolution.results, 1, max_results) return render_search_results_list( resolution.results, pagination={ "library": library, "shelfmark": shelfmark, "gallica_type": gallica_type, + "ic_type": ic_type, "page": 1, "has_more": has_more, }, @@ -290,6 +296,7 @@ def probe_manifest(manifest_url: str, result_id: str = ""): from fasthtml.common import Div, Span from universal_iiif_core.resolvers.discovery import archive_manifest_is_usable + from universal_iiif_core.resolvers.mag_parser import is_iccu_magparser_url, probe_magparser_url manifest_url = (manifest_url or "").strip() if not manifest_url: @@ -298,7 +305,10 @@ def probe_manifest(manifest_url: str, result_id: str = ""): id=f"probe-{result_id}", ) - ok = archive_manifest_is_usable(manifest_url) + if is_iccu_magparser_url(manifest_url): + ok = probe_magparser_url(manifest_url) + else: + ok = archive_manifest_is_usable(manifest_url) if ok: return Div( Span("✓ Manifesto disponibile", cls="text-xs text-emerald-600 dark:text-emerald-400 font-medium"), @@ -311,14 +321,41 @@ def probe_manifest(manifest_url: str, result_id: str = ""): # Providers whose external API supports page/offset pagination. -_PAGINATABLE_STRATEGIES = frozenset({"archive_org", "loc", "harvard", "gallica"}) +_PAGINATABLE_STRATEGIES = frozenset({"archive_org", "loc", "harvard", "gallica", "internetculturale"}) def _provider_supports_pagination(provider) -> bool: return (provider.search_strategy or "") in _PAGINATABLE_STRATEGIES -def load_more_results(library: str, shelfmark: str, page: int = 2, gallica_type: str = "all"): +def _compute_has_more(provider, results: list, current_page: int, max_results: int) -> bool: + """Return True when the provider exposes more pages beyond ``current_page``. + + Prefers a server-reported ``_search_total_pages`` when the search handler + populated it (ICCU, and future providers that can extract an authoritative + count). Falls back to the heuristic ``len >= max_results`` when no upstream + total is available. + """ + if not _provider_supports_pagination(provider) or not results: + return False + first_raw = results[0].get("raw") if isinstance(results[0], dict) else None + if isinstance(first_raw, dict): + try: + total_pages = int(first_raw.get("_search_total_pages") or 0) + except (TypeError, ValueError): + total_pages = 0 + if total_pages > 0: + return int(current_page) < total_pages + return len(results) >= int(max_results) + + +def load_more_results( + library: str, + shelfmark: str, + page: int = 2, + gallica_type: str = "all", + ic_type: str = "all", +): """Return the next page of search results as an HTMX fragment.""" from studio_ui.components.discovery import render_load_more_fragment @@ -333,13 +370,18 @@ def load_more_results(library: str, shelfmark: str, page: int = 2, gallica_type: resolution = resolve_provider_input( library, shelfmark, - filters={"gallica_type": gallica_type, "max_results": max_results, "page": page}, + filters={ + "gallica_type": gallica_type, + "ic_type": ic_type, + "max_results": max_results, + "page": page, + }, ) if resolution.status != "results" or not resolution.results: return render_load_more_fragment([], has_more=False) - has_more = _provider_supports_pagination(resolution.provider) and len(resolution.results) >= max_results + has_more = _compute_has_more(resolution.provider, resolution.results, page, max_results) return render_load_more_fragment( resolution.results, has_more=has_more, @@ -347,6 +389,7 @@ def load_more_results(library: str, shelfmark: str, page: int = 2, gallica_type: "library": library, "shelfmark": shelfmark, "gallica_type": gallica_type, + "ic_type": ic_type, "page": page, }, ) @@ -379,7 +422,7 @@ def add_to_library(manifest_url: str, doc_id: str, library: str, result_title: s description=str(info.get("description") or ""), pages=int(info.get("pages", 0) or 0), thumbnail_url=str(info.get("thumbnail") or ""), - get_json_fn=get_http_client().get_json, + get_json_fn=fetch_manifest_dict, ) upsert_saved_entry( manifest_url, @@ -451,7 +494,7 @@ def add_and_download(manifest_url: str, doc_id: str, library: str, result_title: description=str(info.get("description") or ""), pages=int(info.get("pages", 0) or 0), thumbnail_url=str(info.get("thumbnail") or ""), - get_json_fn=get_http_client().get_json, + get_json_fn=fetch_manifest_dict, ) upsert_saved_entry( manifest_url, @@ -707,3 +750,27 @@ def pdf_capability(manifest_url: str): return render_pdf_capability_badge(has_pdf) except Exception: return render_pdf_capability_badge(False) + + +def serve_iccu_manifest(url: str): + """Serve the ICCU MAG document as a IIIF v2 JSON manifest (Mirador-friendly).""" + import json as _json + + from fasthtml.common import Response + + from universal_iiif_core.resolvers.mag_parser import fetch_and_convert, is_iccu_magparser_url + + clean = unquote(url or "").strip() + if not clean or not is_iccu_magparser_url(clean): + return Response("{}", status_code=400, media_type="application/json") + try: + manifest = fetch_and_convert(clean) + except Exception: + logger.exception("ICCU manifest proxy failed for %s", clean) + return Response("{}", status_code=502, media_type="application/json") + body = _json.dumps(manifest, ensure_ascii=False) + return Response( + body, + media_type="application/json", + headers={"Access-Control-Allow-Origin": "*", "Cache-Control": "public, max-age=300"}, + ) diff --git a/src/studio_ui/routes/discovery_helpers.py b/src/studio_ui/routes/discovery_helpers.py index 774831e..2c2e8ac 100644 --- a/src/studio_ui/routes/discovery_helpers.py +++ b/src/studio_ui/routes/discovery_helpers.py @@ -11,12 +11,12 @@ from urllib.parse import unquote from universal_iiif_core.config_manager import get_config_manager -from universal_iiif_core.http_client import get_http_client from universal_iiif_core.jobs import job_manager from universal_iiif_core.library_catalog import parse_manifest_catalog from universal_iiif_core.logger import get_logger from universal_iiif_core.logic.downloader import IIIFDownloader from universal_iiif_core.network_policy import resolve_library_network_policy +from universal_iiif_core.resolvers.manifest_fetch import fetch_manifest_dict from universal_iiif_core.resolvers.parsers import IIIFManifestParser from universal_iiif_core.services.storage.vault_manager import VaultManager from universal_iiif_core.utils import generate_job_id @@ -37,7 +37,7 @@ def analyze_manifest(manifest_url: str) -> dict[str, Any]: Returns a dict with keys: label, description, pages. Raises exceptions on network / parsing errors so callers can handle them. """ - manifest_data = get_http_client().get_json(manifest_url) + manifest_data = fetch_manifest_dict(manifest_url) if not manifest_data: raise ValueError("Manifest vuoto o irraggiungibile") diff --git a/src/studio_ui/routes/library_handlers.py b/src/studio_ui/routes/library_handlers.py index 134f518..e198077 100644 --- a/src/studio_ui/routes/library_handlers.py +++ b/src/studio_ui/routes/library_handlers.py @@ -22,9 +22,9 @@ _safe_catalog_title, ) from universal_iiif_core.config_manager import get_config_manager -from universal_iiif_core.http_client import get_http_client from universal_iiif_core.library_catalog import infer_item_type, normalize_item_type, parse_manifest_catalog from universal_iiif_core.logger import get_logger +from universal_iiif_core.resolvers.manifest_fetch import fetch_manifest_dict from universal_iiif_core.services.ocr.storage import OCRStorage from universal_iiif_core.services.storage.vault_manager import VaultManager @@ -105,7 +105,7 @@ def _refresh_response(*, message: str, tone: str = "info", **filters): def _update_catalog_metadata(doc_id: str, manifest_url: str) -> dict: - manifest = get_http_client().get_json(manifest_url) + manifest = fetch_manifest_dict(manifest_url) if not manifest: raise ValueError("Manifest non accessibile") catalog = parse_manifest_catalog( diff --git a/src/universal_iiif_core/logic/downloader.py b/src/universal_iiif_core/logic/downloader.py index bffe79c..e74ce7a 100644 --- a/src/universal_iiif_core/logic/downloader.py +++ b/src/universal_iiif_core/logic/downloader.py @@ -24,6 +24,7 @@ from ..logger import get_download_logger from ..network_policy import resolve_library_network_policy from ..pdf_utils import convert_pdf_to_images # noqa: F401 - preserved for monkeypatch compatibility in tests +from ..resolvers.mag_parser import fetch_and_convert, is_iccu_magparser_url from ..services.storage.vault_manager import VaultManager from ..utils import DEFAULT_HEADERS, ensure_dir, save_json from .download_helpers import derive_identifier @@ -102,22 +103,44 @@ def __init__(self, downloader: IIIFDownloader, canvas: dict[str, Any], index: in self.final_filename = Path(scans_dir) / f"pag_{index:04d}.jpg" self.cm = downloader.cm self.base_url = CanvasServiceLocator.locate(canvas) + self.direct_image_url = None if self.base_url else self._locate_direct_image_url(canvas) + + @staticmethod + def _locate_direct_image_url(canvas: Any) -> str | None: + """Return a direct image URL from canvas.images[*].resource.@id (non-IIIF).""" + if not isinstance(canvas, dict): + return None + for image in canvas.get("images") or []: + resource = image.get("resource") if isinstance(image, dict) else None + if not isinstance(resource, dict): + continue + url = resource.get("@id") or resource.get("id") + if isinstance(url, str) and url.startswith(("http://", "https://")): + return url + return None def fetch(self, should_cancel: Callable[[], bool] | None = None) -> tuple[str, dict[str, Any]] | None: """Download (or resume) the requested canvas.""" if should_cancel and should_cancel(): return None + if not self.base_url and self.direct_image_url: + return self._fetch_direct(should_cancel=should_cancel) if not self.base_url: return None - base_url = self.base_url if should_cancel and should_cancel(): return None if resumed := self.resume_cached(): return resumed + return self._fetch_iiif(should_cancel=should_cancel) + + def _fetch_iiif(self, should_cancel: Callable[[], bool] | None = None) -> tuple[str, dict[str, Any]] | None: + """Download a canvas backed by a IIIF image service.""" if should_cancel and should_cancel(): return None + base_url = self.base_url + assert base_url is not None iiif_q = str(self.cm.get_setting("images.iiif_quality", "default") or "default") stitch_mode = self._get_stitch_mode() if stitch_mode != "stitch_only": @@ -143,6 +166,26 @@ def fetch(self, should_cancel: Callable[[], bool] | None = None) -> tuple[str, d self.cm, self.filename, base_url, self.canvas, self.index, iiif_q, should_cancel=should_cancel ) + def _fetch_direct(self, should_cancel: Callable[[], bool] | None = None) -> tuple[str, dict[str, Any]] | None: + """Download a canvas whose image resource is a direct URL (no IIIF service).""" + direct_url = self.direct_image_url + if not direct_url: + return None + if should_cancel and should_cancel(): + return None + if not bool(getattr(self.downloader, "force_redownload", False)): + resumed = self._resume_existing_scan(direct_url) + if resumed: + return resumed + return self.downloader._download_with_retries( + [direct_url], + self.filename, + self.canvas, + self.index, + direct_url, + should_cancel=should_cancel, + ) + def _get_stitch_mode(self) -> str: return normalize_stitch_mode( getattr(self.downloader, "stitch_mode", "") @@ -210,9 +253,10 @@ def resume_cached(self) -> tuple[str, dict[str, Any]] | None: """Expose resume logic so callers can avoid a full download.""" if bool(getattr(self.downloader, "force_redownload", False)): return None - if not self.base_url: + source = self.base_url or self.direct_image_url + if not source: return None - return self._resume_existing_scan(self.base_url) + return self._resume_existing_scan(source) class IIIFDownloader: @@ -256,7 +300,13 @@ def __init__( ) # load manifest and derive human label (for display, NOT for storage) - self.manifest: dict[str, Any] = get_http_client().get_json(manifest_url) or {} + if is_iccu_magparser_url(manifest_url): + self.manifest = fetch_and_convert(manifest_url) + else: + self.manifest = get_http_client().get_json(manifest_url) or {} + # ICCU records often declare more pages than are actually served — + # finalize whatever pages responded instead of discarding them. + self.allow_partial_finalize = bool(self.manifest.get("_iccu")) self.label = self.manifest.get("label", "unknown_manuscript") if isinstance(self.label, list): self.label = self.label[0] if self.label else "unknown_manuscript" diff --git a/src/universal_iiif_core/logic/downloader_runtime.py b/src/universal_iiif_core/logic/downloader_runtime.py index 2d894e0..a4be4ad 100644 --- a/src/universal_iiif_core/logic/downloader_runtime.py +++ b/src/universal_iiif_core/logic/downloader_runtime.py @@ -331,9 +331,16 @@ def _finalize_downloads(self, valid): known_pages = _page_numbers_in_dir(self.scans_dir) | validated_pages expected_pages = set(range(1, total_expected + 1)) if total_expected > 0 else set() allow_partial_overwrite = bool(getattr(self, "overwrite_existing_scans", False) and validated_pages) - - # Keep staged files in temp until the full manuscript is available. - if total_expected > 0 and not expected_pages.issubset(known_pages) and not allow_partial_overwrite: + allow_partial_finalize = bool(getattr(self, "allow_partial_finalize", False) and validated_pages) + + # Keep staged files in temp until the full manuscript is available, + # unless the provider is known to declare more pages than are served. + if ( + total_expected > 0 + and not expected_pages.issubset(known_pages) + and not allow_partial_overwrite + and not allow_partial_finalize + ): return [] for staged_file in sorted(set(validated_staged)): diff --git a/src/universal_iiif_core/network_policy.py b/src/universal_iiif_core/network_policy.py index d0695f3..565dc9b 100644 --- a/src/universal_iiif_core/network_policy.py +++ b/src/universal_iiif_core/network_policy.py @@ -12,6 +12,10 @@ "bodleian (oxford)": "bodleian", "institut de france": "institut_de_france", "institut de france (bibnum)": "institut_de_france", + "internet culturale": "internet_culturale", + "internet culturale (iccu)": "internet_culturale", + "iccu": "internet_culturale", + "internetculturale": "internet_culturale", "unknown": "unknown", } @@ -20,6 +24,7 @@ "vaticana", "bodleian", "institut_de_france", + "internet_culturale", "unknown", ) @@ -123,6 +128,25 @@ "send_referer_header": True, "send_origin_header": False, }, + "internet_culturale": { + "enabled": True, + "use_custom_policy": True, + "workers_per_job": 2, + "per_host_concurrency": 2, + "min_delay_s": 1.0, + "max_delay_s": 3.0, + "retry_max_attempts": 4, + "backoff_base_s": 15.0, + "backoff_cap_s": 300.0, + "cooldown_on_403_s": 300, + "cooldown_on_429_s": 300, + "burst_window_s": 60, + "burst_max_requests": 40, + "respect_retry_after": True, + "prewarm_viewer": False, + "send_referer_header": True, + "send_origin_header": False, + }, "unknown": { "enabled": True, "use_custom_policy": False, diff --git a/src/universal_iiif_core/providers.py b/src/universal_iiif_core/providers.py index 9fe8bb8..8a9d403 100644 --- a/src/universal_iiif_core/providers.py +++ b/src/universal_iiif_core/providers.py @@ -15,6 +15,7 @@ from universal_iiif_core.resolvers.harvard import HarvardResolver from universal_iiif_core.resolvers.heidelberg import HeidelbergResolver from universal_iiif_core.resolvers.institut import InstitutResolver +from universal_iiif_core.resolvers.internetculturale import InternetCulturaleResolver from universal_iiif_core.resolvers.loc import LOCResolver from universal_iiif_core.resolvers.models import SearchResult from universal_iiif_core.resolvers.oxford import OxfordResolver @@ -31,6 +32,7 @@ "harvard", "heidelberg", "institut", + "internetculturale", "loc", "vatican", ] @@ -95,8 +97,56 @@ def supports_direct_resolution(self) -> bool: ), ) +_IC_FILTER = ProviderFilter( + key="ic_type", + label="Tipo (Internet Culturale [BETA])", + options=( + ProviderFilterOption("Tutti i materiali", "all"), + ProviderFilterOption("Solo manoscritti", "Manoscritto"), + ProviderFilterOption("Solo libri a stampa", "Libro moderno"), + ProviderFilterOption("Solo musica", "Musica"), + ProviderFilterOption("Solo fotografie", "Fotografia"), + ), +) PROVIDERS: tuple[IIIFProvider, ...] = ( + IIIFProvider( + key="Internet Culturale", + label="Internet Culturale (ICCU) [BETA]", + aliases=( + "internet culturale", + "internet culturale (iccu)", + "internet culturale (iccu) [beta]", + "iccu", + "internetculturale", + "bml", + "laurenziana", + "marciana", + "bncf", + "bncr", + "manoscritti italiani", + ), + resolver_cls=InternetCulturaleResolver, + search_strategy="internetculturale", + search_fn="search_internetculturale", + search_mode="search_first", + filters=(_IC_FILTER,), + not_found_hint=( + "[BETA] Incolla un URL internetculturale.it/it/16/search/viewresource?id=oai:...&teca=... " + "oppure cerca per titolo, autore o segnatura. Alcuni record hanno solo parte delle pagine " + "digitalizzate effettivamente disponibili." + ), + placeholder="es. Pluteus 40.26 oppure Dante Commedia", + sort_order=98, + metadata={ + "helper_text": ( + "[BETA] Integrazione sperimentale. ICCU aggrega ~50 biblioteche italiane " + "(Laurenziana, Marciana, BNCF, BNCR, Estense, altre): la ricerca può dare migliaia di risultati, " + "ma molti record sono incompleti upstream e la qualità immagini è variabile. " + "Usa 'Carica altri' per scorrere le pagine." + ), + }, + ), IIIFProvider( key="Vaticana", label="Vaticana (BAV)", @@ -349,6 +399,20 @@ def _adapter(query: str, payload: dict[str, Any]) -> list[SearchResult]: return _adapter +def _make_ic_adapter(fn: Callable[..., list[SearchResult]]) -> SearchHandlerFn: + """Wrap the IC search fn, forwarding ic_type_filter from payload.""" + + def _adapter(query: str, payload: dict[str, Any]) -> list[SearchResult]: + return fn( + query, + _max_results_from_payload(payload), + _page_from_payload(payload), + ic_type_filter=str(payload.get("ic_type") or "all"), + ) + + return _adapter + + _search_handlers_cache: types.MappingProxyType[str, SearchHandlerFn] | None = None @@ -381,6 +445,8 @@ def get_search_handlers() -> types.MappingProxyType[str, SearchHandlerFn]: continue if provider.search_strategy == "gallica": handlers[provider.search_strategy] = _make_gallica_adapter(raw_fn) + elif provider.search_strategy == "internetculturale": + handlers[provider.search_strategy] = _make_ic_adapter(raw_fn) else: handlers[provider.search_strategy] = _make_standard_adapter(raw_fn) diff --git a/src/universal_iiif_core/resolvers/discovery.py b/src/universal_iiif_core/resolvers/discovery.py index 4bbecce..2f844d9 100644 --- a/src/universal_iiif_core/resolvers/discovery.py +++ b/src/universal_iiif_core/resolvers/discovery.py @@ -31,6 +31,7 @@ search_harvard, search_heidelberg, search_institut, + search_internetculturale, search_loc, search_vatican, ) @@ -171,6 +172,7 @@ def get_manifest_details(manifest_url: str) -> SearchResult | None: "search_harvard", "search_heidelberg", "search_institut", + "search_internetculturale", "search_loc", "search_vatican", "smart_search", diff --git a/src/universal_iiif_core/resolvers/internetculturale.py b/src/universal_iiif_core/resolvers/internetculturale.py new file mode 100644 index 0000000..8d625cc --- /dev/null +++ b/src/universal_iiif_core/resolvers/internetculturale.py @@ -0,0 +1,113 @@ +"""Resolver for Internet Culturale (ICCU) — Italian national digital library aggregator. + +Covers: + - Biblioteca Medicea Laurenziana (Firenze) + - Biblioteca Nazionale Marciana (Venezia) + - BNCF (Firenze), BNCR (Roma) + - ~50+ partner institutions via ICCU + +Accepted inputs: + - Full IC viewer URL: https://www.internetculturale.it/it/16/search/viewresource?id=oai:...&teca=... + - Magparser URL: https://www.internetculturale.it/jmms/magparser?id=...&teca=... + - OAI ID + teca: oai:193.206.197.121:18:VE0049:CNMD0000299115 (requires teca separately) + - Raw OAI string: "oai:teca.bmlonline.it:21:XXXX:Plutei:..." + +When only an OAI ID is provided (no teca), the resolver attempts to infer the teca +from the known OAI prefix → teca mapping table. +""" + +from __future__ import annotations + +import re + +from .base import BaseResolver +from .mag_parser import build_magparser_url, extract_oai_and_teca_from_url + +# Known OAI prefix → teca mappings (discoverable from IC search results). +# Key: substring that uniquely identifies the OAI host/path prefix. +# Value: teca identifier used in IC API calls. +_OAI_PREFIX_TO_TECA: dict[str, str] = { + "193.206.197.121:18:VE0049": "marciana", # Biblioteca Nazionale Marciana + "teca.bmlonline.it": "Laurenziana - FI", # Biblioteca Medicea Laurenziana + "oai.bmlonline.it": "Laurenziana - FI", + "www.internetculturale.sbn.it/Teca": "MagTeca - ICCU", # Generic ICCU MagTeca + "www.internetculturale.sbn.it": "MagTeca - ICCU", +} + +_OAI_PREFIX_RE = re.compile(r"^oai:", re.IGNORECASE) +_IC_DOMAIN = "internetculturale.it" + + +def _infer_teca(oai_id: str) -> str | None: + """Infer the teca identifier from the OAI ID prefix.""" + for prefix, teca in _OAI_PREFIX_TO_TECA.items(): + if prefix in oai_id: + return teca + return None + + +class InternetCulturaleResolver(BaseResolver): + """Resolver for Internet Culturale (ICCU) MAG-based digital collections.""" + + def can_resolve(self, url_or_id: str) -> bool: + """Return True when the input is an IC URL or a known OAI identifier.""" + s = (url_or_id or "").strip() + if not s: + return False + if _IC_DOMAIN in s: + return True + if _OAI_PREFIX_RE.match(s): + return bool(_infer_teca(s)) + return False + + def get_manifest_url(self, url_or_id: str) -> tuple[str | None, str | None]: + """Return (magparser_url, doc_id) for the given input. + + The magparser URL is used by IccuMagParser to fetch and convert the + MAG XML document into a IIIF v2 manifest in the download pipeline. + """ + s = (url_or_id or "").strip() + if not s: + return None, None + + oai_id, teca = extract_oai_and_teca_from_url(s) + + # If teca not in URL, try to infer from OAI ID + if oai_id and not teca: + teca = _infer_teca(oai_id) + + if not oai_id or not teca: + return None, None + + manifest_url = build_magparser_url(oai_id, teca) + doc_id = _make_doc_id(oai_id) + return manifest_url, doc_id + + +def _make_doc_id(oai_id: str) -> str: + """Build a short, filesystem-safe doc_id from an OAI identifier. + + Example: + "oai:193.206.197.121:18:VE0049:CNMD0000299115" → "VE0049_CNMD0000299115" + "oai:teca.bmlonline.it:21:XXXX:Plutei:IT:FI0100_Plutei_40.26_0004" → "bml_Plutei_40.26" + """ + # Strip leading "oai:" prefix + stripped = re.sub(r"^oai:", "", oai_id, flags=re.IGNORECASE) + + # BML pattern: teca.bmlonline.it:21:XXXX:Plutei:IT%3AFI0100_Plutei_40.26_0004 + if "bmlonline" in stripped: + m = re.search(r"(?:Plutei|Ashburn|Acq|Conv|Conv_Soppr)[^:]*(?::[^:]+)?$", stripped, re.IGNORECASE) + if m: + return "bml_" + re.sub(r"[^a-zA-Z0-9._-]", "_", m.group(0))[:50] + return "bml_" + re.sub(r"[^a-zA-Z0-9._-]", "_", stripped[-30:]) + + # Marciana / SBN pattern: 193.206.197.121:18:VE0049:CNMD0000299115 + parts = stripped.split(":") + if len(parts) >= 2: + sbn_part = next((p for p in parts if re.match(r"[A-Z]{2}\d{4}", p)), None) + last_part = parts[-1] + if sbn_part and last_part != sbn_part: + return f"{sbn_part}_{last_part}"[:60] + return last_part[:60] + + return re.sub(r"[^a-zA-Z0-9._-]", "_", stripped)[:60] diff --git a/src/universal_iiif_core/resolvers/mag_parser.py b/src/universal_iiif_core/resolvers/mag_parser.py new file mode 100644 index 0000000..3016aea --- /dev/null +++ b/src/universal_iiif_core/resolvers/mag_parser.py @@ -0,0 +1,466 @@ +"""Parser for Internet Culturale MAG/XML API. + +Converts the ICCU MAG XML format (from /jmms/magparser) into a IIIF v2-compatible +manifest dict so the existing downloader pipeline can handle it without changes. + +MAG = Metadati Amministrativi e Gestionali (ICCU standard for Italian digital libraries). + +Image URL pattern (verified live): + GET /jmms/thumbnail?type=normal&id={oai_id}&teca={teca}&page={1-based-n} +""" + +from __future__ import annotations + +import re +import xml.etree.ElementTree as ET +from dataclasses import dataclass, field +from typing import Any +from urllib.parse import parse_qs, urlencode, urlparse + +import defusedxml.ElementTree as SafeET +import requests + +from ..logger import get_logger + +logger = get_logger(__name__) + +_IC_BASE = "https://www.internetculturale.it" +_MAGPARSER_PATH = "/jmms/magparser" +_THUMBNAIL_PATH = "/jmms/thumbnail" +_VIEWER_PATH = "/jmms/iccuviewer/iccu.jsp" + +# Max pages to request in a single call — covers virtually all manuscripts. +_MAX_PAGES_PER_CALL = 2000 + +# Namespace used in MAG XML +_MAG_NS = {"mag": "urn:meta:internetculturale"} + +# Pattern: "Biblioteca XYZ - Città - IT-XX0000" +_LOCALIZATION_RE = re.compile(r"^(?P.+?)\s+-\s+(?P[^-]+?)\s+-\s+(?PIT-[A-Z]{2}\d+)\s*$") + +# Pattern for BML-style shelfmark: IT:FI0100_Plutei_40.26_0004 +_BML_IDENT_RE = re.compile(r"^[A-Z]{2}:[A-Z]{2}\d+_(?P.+?)(?:_\d+)?$") + +# Pattern for SBN-prefixed shelfmark: VE0049_It_09_0127_06278 +_SBN_SHELF_RE = re.compile(r"^[A-Z]{2}\d{4}_(?P.+)$") + + +@dataclass +class IccuMetadata: + """Parsed bibliographic metadata from a MAG XML document.""" + + title: str = "" + authors: list[str] = field(default_factory=list) + date: str = "" + library: str = "" + city: str = "" + sbn_code: str = "" + shelfmark: str = "" + oai_id: str = "" + teca: str = "" + raw_identificativo: list[str] = field(default_factory=list) + page_count: int = 0 + + @property + def library_label(self) -> str: + """Human-readable library + city label.""" + if self.city and self.city not in self.library: + return f"{self.library} ({self.city})" + return self.library + + @property + def full_reference(self) -> str: + """Full manuscript reference for display.""" + parts = [self.library_label] + if self.shelfmark: + parts.append(self.shelfmark) + return ", ".join(parts) + + +def build_magparser_url(oai_id: str, teca: str, max_pages: int = _MAX_PAGES_PER_CALL) -> str: + """Build the magparser API URL for a given OAI ID and teca identifier.""" + params = urlencode( + { + "id": oai_id, + "teca": teca, + "mode": "all", + "offset": "0", + "pag": str(max_pages), + } + ) + return f"{_IC_BASE}{_MAGPARSER_PATH}?{params}" + + +def build_viewer_url(oai_id: str, teca: str) -> str: + """Build the public Internet Culturale JSP viewer URL for a document.""" + params = urlencode({"id": oai_id, "mode": "all", "teca": teca}) + return f"{_IC_BASE}{_VIEWER_PATH}?{params}" + + +def build_thumbnail_url(oai_id: str, teca: str, page_1based: int, quality: str = "normal") -> str: + """Build the image URL for a specific page via the IC thumbnail endpoint. + + Args: + oai_id: OAI identifier of the document. + teca: Teca identifier (provider ID within IC). + page_1based: Page number, 1-based (page 1 = first image). + quality: "normal" (full-res), "preview" (medium), "web" (small). + """ + params = urlencode( + { + "type": quality, + "id": oai_id, + "teca": teca, + "page": str(page_1based), + } + ) + return f"{_IC_BASE}{_THUMBNAIL_PATH}?{params}" + + +def extract_oai_and_teca_from_url(url: str) -> tuple[str | None, str | None]: + """Extract OAI ID and teca from a magparser or IC viewer URL. + + Handles: + - magparser URLs: /jmms/magparser?id={oai}&teca={teca} + - IC viewer URLs: /it/16/search/viewresource?id={oai}&teca={teca} + - Raw OAI IDs passed directly + """ + if url.lower().startswith("oai:") and "?" not in url: + return url, None + + parsed = urlparse(url) + if not parsed.scheme and "?" not in url: + # Raw non-OAI identifier — no teca extractable + return url, None + + qs = parse_qs(parsed.query) + oai_id = (qs.get("id") or [None])[0] + teca = (qs.get("teca") or [None])[0] + if not oai_id: + teca_val = (qs.get("descSourceLevel2") or [None])[0] + if teca_val: + teca = teca_val + return oai_id, teca + + +def _parse_localization(raw: str) -> tuple[str, str, str]: + """Parse a localization string into (library, city, sbn_code). + + Example: "Biblioteca Medicea Laurenziana - Firenze - IT-FI0100" + Returns: ("Biblioteca Medicea Laurenziana", "Firenze", "IT-FI0100") + """ + m = _LOCALIZATION_RE.match(raw.strip()) + if m: + return m.group("library").strip(), m.group("city").strip(), m.group("sbn").strip() + # Fallback: split on " - " + parts = [p.strip() for p in raw.split(" - ")] + library = parts[0] if parts else raw + city = parts[1] if len(parts) > 1 else "" + sbn = parts[2] if len(parts) > 2 else "" + return library, city, sbn + + +def _extract_shelfmark_from_title(title: str, library: str) -> str | None: + """Extract shelfmark from titles like 'Venezia, Biblioteca ..., It. IX 127 (=6278)'.""" + if not title or not library: + return None + # Check if title contains the library name + lib_short = library.split()[0] if library else "" + if lib_short and lib_short.lower() in title.lower(): + # Find the last comma — everything after is the shelfmark + parts = title.rsplit(",", 1) + if len(parts) == 2: + candidate = parts[1].strip() + # Reject if too long (likely not a shelfmark) + if 2 <= len(candidate) <= 80: + return candidate + return None + + +def _extract_shelfmark_from_identificativo(identificativi: list[str], sbn_code: str) -> str | None: + """Try to extract a human-readable shelfmark from raw ICCU identifiers. + + Handles patterns like: + - "IT:FI0100_Plutei_40.26_0004" → "Plutei 40.26" + - "VE0049_It_09_0127_06278" → raw (complex decode needed) + - "CNMD0000299115 VE0049_It_09_0127_06278 ARM0000580" → try the SBN-prefixed part + """ + sbn_short = sbn_code.replace("IT-", "") if sbn_code else "" + + for raw in identificativi: + # Multi-token: split and try each + tokens = raw.split() if " " in raw else [raw] + for token in tokens: + # BML-style: IT:FI0100_Plutei_40.26_0004 + m = _BML_IDENT_RE.match(token) + if m: + shelf = m.group("shelf").replace("_", " ") + return shelf + + # SBN-prefixed: VE0049_It_09_0127_06278 — only if matches our institution + if sbn_short and token.startswith(sbn_short + "_"): + m2 = _SBN_SHELF_RE.match(token) + if m2: + return m2.group("shelf").replace("_", " ") + + return None + + +def _apply_info_field(meta: IccuMetadata, key: str, values: list[str]) -> None: + """Apply a single MAG key/values pair to a metadata object.""" + if key == "Titolo": + meta.title = values[0] + elif key == "Autore": + meta.authors = values + elif key == "Data di pubblicazione": + meta.date = values[0] + elif key == "Localizzazione": + meta.library, meta.city, meta.sbn_code = _parse_localization(values[0]) + elif key == "Identificativo": + meta.raw_identificativo = values + + +def _parse_bibinfo(bibinfo: ET.Element) -> IccuMetadata: + """Extract metadata fields from a MAG element.""" + meta = IccuMetadata() + + # OAI ID and teca + tecaid_el = bibinfo.find("tecaid") + if tecaid_el is not None and tecaid_el.text: + meta.oai_id = tecaid_el.text.strip() + + provider_el = bibinfo.find("providerid") + if provider_el is not None and provider_el.text: + meta.teca = provider_el.text.strip() + + for info in bibinfo.findall("infos/info"): + key = (info.get("key") or "").strip() + values = [v.text.strip() for v in info.findall("value") if v.text] + if values: + _apply_info_field(meta, key, values) + + shelfmark = _extract_shelfmark_from_title(meta.title, meta.library) + if not shelfmark: + shelfmark = _extract_shelfmark_from_identificativo(meta.raw_identificativo, meta.sbn_code) + meta.shelfmark = shelfmark or "" + + return meta + + +def _build_iiif_v2_manifest(meta: IccuMetadata, pages: list[dict[str, Any]]) -> dict[str, Any]: + """Assemble a IIIF Presentation v2 manifest dict from ICCU metadata and pages.""" + manifest_id = build_magparser_url(meta.oai_id, meta.teca) + + iiif_metadata = [ + {"label": "Titolo", "value": meta.title}, + {"label": "Biblioteca", "value": meta.library}, + {"label": "Città", "value": meta.city}, + {"label": "Codice SBN", "value": meta.sbn_code}, + {"label": "Segnatura", "value": meta.shelfmark}, + {"label": "Data", "value": meta.date}, + {"label": "OAI ID", "value": meta.oai_id}, + {"label": "Provider ICCU", "value": meta.teca}, + ] + if meta.authors: + iiif_metadata.append({"label": "Autore", "value": "; ".join(meta.authors)}) + + canvases = [] + for page in pages: + idx = page["idx"] + label = page.get("name") or f"Pagina {idx + 1}" + w = page.get("w", 1000) + h = page.get("h", 1000) + src = str(page.get("src") or "").strip() + if src: + image_url = f"{_IC_BASE}/jmms/{src.lstrip('/')}" + else: + image_url = build_thumbnail_url(meta.oai_id, meta.teca, idx + 1) + canvas_id = f"{manifest_id}/canvas/{idx}" + + canvases.append( + { + "@id": canvas_id, + "@type": "sc:Canvas", + "label": label, + "width": w, + "height": h, + "images": [ + { + "@type": "oa:Annotation", + "motivation": "sc:painting", + "resource": { + "@id": image_url, + "@type": "dctypes:Image", + "format": "image/jpeg", + "width": w, + "height": h, + }, + "on": canvas_id, + } + ], + } + ) + + viewer_url = build_viewer_url(meta.oai_id, meta.teca) + return { + "@context": "http://iiif.io/api/presentation/2/context.json", + "@type": "sc:Manifest", + "@id": manifest_id, + "label": meta.title or meta.full_reference or "Documento ICCU", + "attribution": f"Internet Culturale / ICCU — {meta.library_label}", + "metadata": [m for m in iiif_metadata if m["value"]], + "related": { + "@id": viewer_url, + "format": "text/html", + "label": "Apri su Internet Culturale", + }, + "_iccu": { + "oai_id": meta.oai_id, + "teca": meta.teca, + "library": meta.library, + "city": meta.city, + "sbn_code": meta.sbn_code, + "shelfmark": meta.shelfmark, + "viewer_url": viewer_url, + }, + "sequences": [ + { + "@type": "sc:Sequence", + "canvases": canvases, + } + ], + } + + +def parse_mag_xml(xml_bytes: bytes) -> dict[str, Any]: + """Parse MAG XML bytes and return a IIIF v2 manifest dict. + + Raises: + ValueError: if the XML is malformed or missing required structure. + """ + try: + root = SafeET.fromstring(xml_bytes) + except ET.ParseError as exc: + raise ValueError(f"MAG XML parse error: {exc}") from exc + + # Strip namespace for easier access + for el in root.iter(): + if "}" in el.tag: + el.tag = el.tag.split("}", 1)[1] + + bibinfo = root.find("bibinfo") + if bibinfo is None: + raise ValueError("MAG XML missing element") + + meta = _parse_bibinfo(bibinfo) + + pages: list[dict[str, Any]] = [] + for media in root.findall("medias/media"): + for page in media.findall("pages/page"): + try: + pages.append( + { + "idx": int(page.get("idx", 0)), + "name": page.get("name", ""), + "w": int(page.get("w", 0)) or 1000, + "h": int(page.get("h", 0)) or 1000, + "src": page.get("src", ""), + } + ) + except (ValueError, TypeError): + continue + + pages.sort(key=lambda p: p["idx"]) + meta.page_count = len(pages) + + logger.debug("ICCU MAG parsed: library=%r shelfmark=%r pages=%d", meta.library, meta.shelfmark, meta.page_count) + + return _build_iiif_v2_manifest(meta, pages) + + +def fetch_and_convert(magparser_url: str, session: requests.Session | None = None) -> dict[str, Any]: + """Fetch a MAG XML document from Internet Culturale and convert to IIIF v2 manifest. + + Args: + magparser_url: Full URL to the IC magparser endpoint. + session: Optional requests session for connection reuse. + + Returns: + IIIF v2 manifest dict. + + Raises: + requests.RequestException: on network failure. + ValueError: on invalid XML or missing required fields. + """ + headers = { + "User-Agent": ( + "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " + "AppleWebKit/537.36 (KHTML, like Gecko) " + "Chrome/120.0.0.0 Safari/537.36" + ), + "Accept": "text/xml,application/xml,*/*", + "Referer": _IC_BASE, + } + + requester = session or requests + resp = requester.get(magparser_url, headers=headers, timeout=(10, 30)) + resp.raise_for_status() + + return parse_mag_xml(resp.content) + + +def is_iccu_magparser_url(url: str) -> bool: + """Return True if the URL points to the IC magparser endpoint.""" + return "internetculturale.it" in url and "magparser" in url + + +def probe_magparser_url(url: str, session: requests.Session | None = None) -> bool: + """Return True when the magparser URL returns a valid MAG XML document. + + Performs a lightweight GET (pag=1) and checks for a element. Used + by the Discovery probe flow, which otherwise assumes JSON manifests. + """ + parsed = urlparse(url) + qs = parse_qs(parsed.query) + oai_id = (qs.get("id") or [None])[0] + teca = (qs.get("teca") or [None])[0] + if not oai_id or not teca: + return False + + probe_url = build_magparser_url(oai_id, teca, max_pages=1) + headers = { + "User-Agent": "Mozilla/5.0", + "Accept": "text/xml,application/xml,*/*", + "Referer": _IC_BASE, + } + requester = session or requests + try: + resp = requester.get(probe_url, headers=headers, timeout=(5, 8)) + resp.raise_for_status() + except requests.RequestException as exc: + logger.debug("ICCU magparser probe failed for %s: %s", url, exc) + return False + + try: + root = SafeET.fromstring(resp.content) + except ET.ParseError: + return False + + for el in root.iter(): + tag = el.tag.split("}", 1)[-1] if "}" in el.tag else el.tag + if tag == "bibinfo": + return True + return False + + +__all__ = [ + "IccuMetadata", + "build_magparser_url", + "build_thumbnail_url", + "build_viewer_url", + "extract_oai_and_teca_from_url", + "fetch_and_convert", + "is_iccu_magparser_url", + "parse_mag_xml", + "probe_magparser_url", +] diff --git a/src/universal_iiif_core/resolvers/manifest_fetch.py b/src/universal_iiif_core/resolvers/manifest_fetch.py new file mode 100644 index 0000000..bc81886 --- /dev/null +++ b/src/universal_iiif_core/resolvers/manifest_fetch.py @@ -0,0 +1,41 @@ +"""Centralized manifest-to-dict fetcher. + +Handles both native IIIF JSON manifests and the ICCU MAG/XML endpoint, which +must be converted to a IIIF v2 dict before UI/library code can inspect it. +""" + +from __future__ import annotations + +from typing import Any + +from ..http_client import get_http_client +from ..logger import get_logger +from .mag_parser import fetch_and_convert, is_iccu_magparser_url + +logger = get_logger(__name__) + + +def fetch_manifest_dict(url: str, **kwargs: Any) -> dict[str, Any] | None: + """Return a manifest as a dict regardless of the source format. + + - ICCU magparser URLs go through the MAG→IIIF v2 converter. + - Everything else uses the shared HTTPClient JSON getter. + + Extra kwargs (e.g. ``retries``) are forwarded to ``get_json`` and ignored + for the MAG path, which has its own retry-free synchronous fetch. + """ + clean = str(url or "").strip() + if not clean: + return None + + if is_iccu_magparser_url(clean): + try: + return fetch_and_convert(clean) + except Exception as exc: + logger.debug("MAG→IIIF conversion failed for %s: %s", clean, exc) + return None + + return get_http_client().get_json(clean, **kwargs) + + +__all__ = ["fetch_manifest_dict"] diff --git a/src/universal_iiif_core/resolvers/search/__init__.py b/src/universal_iiif_core/resolvers/search/__init__.py index 3821f62..3505ecf 100644 --- a/src/universal_iiif_core/resolvers/search/__init__.py +++ b/src/universal_iiif_core/resolvers/search/__init__.py @@ -14,6 +14,7 @@ from .harvard import search_harvard from .heidelberg import search_heidelberg from .institut import search_institut +from .internetculturale import search_internetculturale from .loc import search_loc from .vatican import search_vatican @@ -28,6 +29,7 @@ "search_harvard", "search_heidelberg", "search_institut", + "search_internetculturale", "search_loc", "search_vatican", ] diff --git a/src/universal_iiif_core/resolvers/search/internetculturale.py b/src/universal_iiif_core/resolvers/search/internetculturale.py new file mode 100644 index 0000000..ff2acab --- /dev/null +++ b/src/universal_iiif_core/resolvers/search/internetculturale.py @@ -0,0 +1,243 @@ +"""Internet Culturale (ICCU) search via HTML scraping. + +Searches the IC manuscript catalog at: + https://www.internetculturale.it/it/16/search?q={query}&instance=magindice + &searchType=avanzato&channel__typeTipo=Manoscritto + +Returns SearchResult entries with full metadata including library, shelfmark, +OAI ID and teca — all required to resolve and download the document. +""" + +from __future__ import annotations + +import re +from html import unescape +from typing import Final +from urllib.parse import quote_plus, urlencode + +import requests + +from universal_iiif_core.logger import get_logger +from universal_iiif_core.resolvers.mag_parser import build_magparser_url, build_viewer_url +from universal_iiif_core.resolvers.models import SearchResult + +from ._common import DISCOVERY_TIMEOUT, REAL_BROWSER_HEADERS, get_search_http_client + +logger = get_logger(__name__) + +_IC_SEARCH_URL: Final = "https://www.internetculturale.it/it/16/search" + +# Extracts: id=oai%3A... and teca=... from viewresource href +_OAI_ID_RE = re.compile(r"id=(oai%3A[^&\"']+)", re.IGNORECASE) +_TECA_RE = re.compile(r"[?&]teca=([^&\"']+)", re.IGNORECASE) +_DESC_SOURCE_RE = re.compile(r"descSourceLevel2=([^&\"']+)", re.IGNORECASE) + +# Extract title from

following the image link +_TITLE_BLOCK_RE = re.compile( + r'viewresource\?[^"\']*?id=(oai%3A[^"\'&]+)[^"\']*?teca=([^"\'&]+)[^>]*?>.*?.*?' + r"]*>(.*?)

.*?" + r"Rilevanza:\s*([\d.]+)", + re.DOTALL | re.IGNORECASE, +) + +# Extract date/description from text block after h2 +_DATE_RE = re.compile(r"\[?\d{3,4}\]?(?:\s*[-–]\s*\[?\d{3,4}\]?)?|\d{4}\s*sec\.", re.IGNORECASE) + +# Matches "Biblioteca XYZ - descSourceLevel2=..." to get library from descSourceLevel2 +_DESC_LEVEL2_CLEAN_RE = re.compile(r"[+%]20|%2B", re.IGNORECASE) + + +def _decode_url_component(s: str) -> str: + """URL-decode a component, replacing %XX and + with their characters.""" + from urllib.parse import unquote_plus + + return unquote_plus(s) + + +def _clean_html(s: str) -> str: + """Strip HTML tags and unescape HTML entities.""" + s = re.sub(r"<[^>]+>", " ", s) + s = unescape(s) + return re.sub(r"\s+", " ", s).strip() + + +def _extract_date(text: str) -> str: + m = _DATE_RE.search(text) + return m.group(0).strip() if m else "" + + +def _parse_result_block(block: str) -> SearchResult | None: + """Parse a single `block-item-search-result` HTML block into a SearchResult.""" + # OAI ID: prefer dc_id span (raw, not URL-encoded); fallback to URL param + m_dc_id = re.search(r"]+dc_id[^>]*>(oai:[^<]+)", block, re.IGNORECASE) + if m_dc_id: + oai_id = m_dc_id.group(1).strip() + else: + m_oai = _OAI_ID_RE.search(block) + if not m_oai: + return None + oai_id = _decode_url_component(m_oai.group(1)) + + # teca from descSourceLevel2 URL param in the thumbnail img src + m_thumb_src = re.search(r'/jmms/thumbnail\?[^"\']*?teca=([^"\'& ]+)', block) + if m_thumb_src: + teca = _decode_url_component(m_thumb_src.group(1)) + else: + # fallback: descSourceLevel2 in viewresource URL + m_desc = _DESC_SOURCE_RE.search(block) + teca = _decode_url_component(m_desc.group(1)) if m_desc else "" + + if not oai_id or not teca: + return None + + # Title: h2.dc_title (distinct from h2.dc_creator which is the author) + m_title = re.search(r"]+dc_title[^>]*>(.*?)", block, re.DOTALL | re.IGNORECASE) + title = _clean_html(m_title.group(1)) if m_title else "" + + # Author: h2.dc_creator + m_creator = re.search(r"]+dc_creator[^>]*>(.*?)", block, re.DOTALL | re.IGNORECASE) + authors: list[str] = [] + if m_creator: + for m_a in re.finditer(r"]+>(.*?)", m_creator.group(1), re.IGNORECASE): + name = _clean_html(m_a.group(1)).strip(" ;") + if name: + authors.append(name) + + # Library: dc_descSourceLevel2 span + m_lib = re.search(r"]+dc_descSourceLevel2[^>]*>(.*?)", block, re.IGNORECASE) + library = _clean_html(m_lib.group(1)) if m_lib else teca.replace("+", " ") + + # Date: dc_issued span + m_date = re.search(r"]+dc_issued[^>]*>(.*?)", block, re.IGNORECASE) + date = _clean_html(m_date.group(1)) if m_date else "" + + # Material type: first dc_type span + m_type = re.search(r"]+dc_type[^>]*>(.*?)", block, re.IGNORECASE) + mat_type = _clean_html(m_type.group(1)) if m_type else "" + + # Thumbnail URL (reconstruct from src path using canonical format) + thumb_url = "" + if m_thumb_src: + thumb_url = ( + f"https://www.internetculturale.it/jmms/thumbnail" + f"?type=preview&id={quote_plus(oai_id)}&teca={quote_plus(teca)}" + ) + + manifest_url = build_magparser_url(oai_id, teca) + viewer_url = build_viewer_url(oai_id, teca) + + description = f"{mat_type} – {library}" if mat_type else library + + return SearchResult( + id=oai_id, + title=title, + author="; ".join(authors), + date=date, + description=description, + library=library, + thumbnail=thumb_url, + thumb=thumb_url, + manifest=manifest_url, + manifest_status="pending", + viewer_url=viewer_url, + raw={"oai_id": oai_id, "teca": teca, "type": mat_type}, + ) + + +_TOTAL_RESULTS_RE = re.compile(r"Pagina\s+\d+\s+di\s+(\d+)\s*\(\s*([\d.,]+)\s+risultati", re.IGNORECASE) + + +def _parse_total_results(html: str) -> tuple[int, int]: + """Return (total_pages, total_results) parsed from the IC search header, or (0, 0).""" + m = _TOTAL_RESULTS_RE.search(html) + if not m: + return 0, 0 + try: + total_pages = int(m.group(1)) + except ValueError: + total_pages = 0 + try: + total_results = int(m.group(2).replace(".", "").replace(",", "")) + except ValueError: + total_results = 0 + return total_pages, total_results + + +def _parse_search_html(html: str) -> list[SearchResult]: + """Parse IC search results page HTML into SearchResult list.""" + results: list[SearchResult] = [] + blocks = re.split(r"(?=]+block-item-search-result)", html) + for block in blocks: + result = _parse_result_block(block) + if result is not None: + results.append(result) + return results + + +def search_internetculturale( + query: str, + max_results: int = 20, + page: int = 1, + ic_type_filter: str = "all", +) -> list[SearchResult]: + """Search Internet Culturale digital catalog. + + Args: + query: Free-text search query. + max_results: Maximum results to return (IC returns ~10 per page). + page: Result page (1-based). + ic_type_filter: Material type filter key — "all", "Manoscritto", "Libro moderno", etc. + "all" means no filter (search all material types). + + Returns: + List of SearchResult with library, OAI ID, teca, manifest URL. + """ + if not query or not query.strip(): + return [] + + params: dict[str, str] = { + "q": query.strip(), + "instance": "magindice", + "searchType": "avanzato", + } + if ic_type_filter and ic_type_filter != "all": + params["channel__typeTipo"] = ic_type_filter + + if page > 1: + params["pag"] = str(page) + + url = f"{_IC_SEARCH_URL}?{urlencode(params)}" + + try: + resp = get_search_http_client().get( + url, + headers=REAL_BROWSER_HEADERS, + timeout=DISCOVERY_TIMEOUT, + library_name="internetculturale", + ) + resp.raise_for_status() + except requests.RequestException as exc: + logger.error("Internet Culturale search failed for %r: %s", query, exc) + return [] + + results = _parse_search_html(resp.text) + total_pages, total_results = _parse_total_results(resp.text) + logger.debug( + "IC search %r → %d raw results (page %d of %d; total=%d)", + query, + len(results), + page, + total_pages, + total_results, + ) + + # Inject total counts into the first result's raw payload so the UI can + # display "X di Y risultati" without changing the shared search signature. + if results and total_results: + raw = dict(results[0].get("raw") or {}) + raw["_search_total_results"] = total_results + raw["_search_total_pages"] = total_pages + raw["_search_page"] = page + results[0]["raw"] = raw + + return results[:max_results] diff --git a/tests/fixtures/iccu_mag_sample.xml b/tests/fixtures/iccu_mag_sample.xml new file mode 100644 index 0000000..76028c7 --- /dev/null +++ b/tests/fixtures/iccu_mag_sample.xml @@ -0,0 +1,33 @@ + + + + oai:teca.bmlonline.it:21:XXXX:Plutei:IT%3AFI0100_Plutei_40.26_0003 + Laurenziana - FI + + + III. Capitolo di Jacopo Alighieri + + + Alighieri, Jacopo + + + 1350 ca. + + + Biblioteca Medicea Laurenziana - Firenze - IT-FI0100 + + + IT:FI0100_Plutei_40.26_0003 + + + + + + + + + + + + + diff --git a/tests/test_cli_unit.py b/tests/test_cli_unit.py index 06ab13a..4e45bfc 100644 --- a/tests/test_cli_unit.py +++ b/tests/test_cli_unit.py @@ -10,6 +10,7 @@ # --- _status_icon --- + def test_status_icon_complete(): assert _status_icon("complete") == "✅" @@ -29,6 +30,7 @@ def test_status_icon_unknown_returns_circle(): # --- _build_parser --- + def test_build_parser_returns_parser(): parser = _build_parser() assert isinstance(parser, argparse.ArgumentParser) @@ -120,6 +122,7 @@ def test_build_parser_delete_job(): # --- _resolve_manifest --- + def test_resolve_manifest_uses_library_aware_resolver(monkeypatch): """CLI manifest resolution should preserve the detected library name.""" monkeypatch.setattr( @@ -151,6 +154,7 @@ def test_resolve_manifest_keeps_direct_manifest_urls(): # --- _handle_db_commands --- + def test_handle_db_commands_list(monkeypatch): mock_list = MagicMock() monkeypatch.setattr(cli, "_handle_list", mock_list) @@ -198,13 +202,21 @@ def test_handle_db_commands_no_command(): # --- _resolve_download_args --- + def test_resolve_download_args_with_url(): from universal_iiif_cli.cli import _resolve_download_args - args = _build_parser().parse_args([ - "http://x.com/m.json", "-w", "8", "--prefer-images", - "--ocr", "model.mlmodel", "--create-pdf", - ]) + args = _build_parser().parse_args( + [ + "http://x.com/m.json", + "-w", + "8", + "--prefer-images", + "--ocr", + "model.mlmodel", + "--create-pdf", + ] + ) result = _resolve_download_args(args) assert result == ("http://x.com/m.json", None, 8, False, True, "model.mlmodel", True) diff --git a/tests/test_downloader_pdf_unit.py b/tests/test_downloader_pdf_unit.py index dff36ea..a2b6e25 100644 --- a/tests/test_downloader_pdf_unit.py +++ b/tests/test_downloader_pdf_unit.py @@ -24,9 +24,7 @@ def _make_downloader_stub(tmp_path: Path, **overrides): pdf_dir.mkdir(parents=True, exist_ok=True) output_path = tmp_path / "output.pdf" - cm = SimpleNamespace( - get_setting=lambda key, default=None: overrides.get(f"setting.{key}", default) - ) + cm = SimpleNamespace(get_setting=lambda key, default=None: overrides.get(f"setting.{key}", default)) stub = SimpleNamespace( scans_dir=scans_dir, diff --git a/tests/test_downloader_runtime_unit.py b/tests/test_downloader_runtime_unit.py index 3ec65dc..89a9166 100644 --- a/tests/test_downloader_runtime_unit.py +++ b/tests/test_downloader_runtime_unit.py @@ -12,6 +12,7 @@ # --- _build_canvas_plan --- + class TestBuildCanvasPlan: def test_no_target_pages_returns_all(self): canvases = [{"id": "c0"}, {"id": "c1"}, {"id": "c2"}] @@ -47,6 +48,7 @@ def test_empty_canvases(self): # --- _page_number_from_filename --- + class TestPageNumberFromFilename: def test_standard_filename(self): assert _page_number_from_filename("pag_0000.jpg") == 1 @@ -70,6 +72,7 @@ def test_extension_agnostic(self): # --- _emit_canvas_progress (needs self stub) --- + def test_emit_canvas_progress_no_callback(): """No callback should not crash.""" from universal_iiif_core.logic.downloader_runtime import _emit_canvas_progress @@ -100,6 +103,7 @@ def test_emit_canvas_progress_swallows_callback_error(): # --- _store_page_stats --- + def test_store_page_stats_merges_with_existing(tmp_path: Path): """New stats should merge with existing page stats by page_index.""" from types import SimpleNamespace @@ -134,6 +138,7 @@ def test_store_page_stats_empty_is_noop(tmp_path: Path): # --- _collect_finalized_scan_files --- + def test_collect_finalized_scan_files(tmp_path: Path): """Should return sorted scan paths for expected pages.""" from PIL import Image @@ -173,6 +178,7 @@ def test_collect_finalized_scan_files_missing_pages(tmp_path: Path): # --- _page_numbers_in_dir --- + def test_page_numbers_in_dir(tmp_path: Path): """Should find 1-indexed page numbers from pag_XXXX.jpg files.""" from PIL import Image diff --git a/tests/test_downloader_unit.py b/tests/test_downloader_unit.py index 534b155..2edc0ba 100644 --- a/tests/test_downloader_unit.py +++ b/tests/test_downloader_unit.py @@ -12,6 +12,7 @@ # --- CanvasServiceLocator --- + class TestCanvasServiceLocator: def test_locate_direct_service_id(self): canvas = {"service": {"@id": "https://img.example.com/svc"}} @@ -59,6 +60,7 @@ def test_locate_handles_cyclic_references(self): # --- PageDownloader._format_dimension --- + class TestFormatDimension: def test_empty_returns_max(self): assert PageDownloader._format_dimension("") == "max" @@ -81,6 +83,7 @@ def test_non_digit_passthrough(self): # --- IIIFDownloader.get_pdf_url (needs manifest stub) --- + def _make_downloader_stub(manifest: dict): """Minimal object with .manifest for get_pdf_url / get_canvases / _get_thumbnail_url.""" from universal_iiif_core.logic.downloader import IIIFDownloader @@ -93,15 +96,11 @@ def _make_downloader_stub(manifest: dict): class TestGetPdfUrl: def test_finds_pdf_by_format(self): - dl = _make_downloader_stub({ - "rendering": [{"format": "application/pdf", "@id": "https://x.com/doc.pdf"}] - }) + dl = _make_downloader_stub({"rendering": [{"format": "application/pdf", "@id": "https://x.com/doc.pdf"}]}) assert dl.get_pdf_url() == "https://x.com/doc.pdf" def test_finds_pdf_by_url_extension(self): - dl = _make_downloader_stub({ - "rendering": [{"id": "https://x.com/output.pdf"}] - }) + dl = _make_downloader_stub({"rendering": [{"id": "https://x.com/output.pdf"}]}) assert dl.get_pdf_url() == "https://x.com/output.pdf" def test_no_rendering_returns_none(self): @@ -113,29 +112,21 @@ def test_empty_rendering_list(self): assert dl.get_pdf_url() is None def test_rendering_as_dict(self): - dl = _make_downloader_stub({ - "rendering": {"format": "application/pdf", "@id": "https://x.com/p.pdf"} - }) + dl = _make_downloader_stub({"rendering": {"format": "application/pdf", "@id": "https://x.com/p.pdf"}}) assert dl.get_pdf_url() == "https://x.com/p.pdf" def test_non_pdf_rendering_skipped(self): - dl = _make_downloader_stub({ - "rendering": [{"format": "text/plain", "@id": "https://x.com/t.txt"}] - }) + dl = _make_downloader_stub({"rendering": [{"format": "text/plain", "@id": "https://x.com/t.txt"}]}) assert dl.get_pdf_url() is None class TestGetCanvases: def test_v2_sequences(self): - dl = _make_downloader_stub({ - "sequences": [{"canvases": [{"id": "c1"}, {"id": "c2"}]}] - }) + dl = _make_downloader_stub({"sequences": [{"canvases": [{"id": "c1"}, {"id": "c2"}]}]}) assert dl.get_canvases() == [{"id": "c1"}, {"id": "c2"}] def test_v3_items(self): - dl = _make_downloader_stub({ - "items": [{"id": "c1"}, {"id": "c2"}] - }) + dl = _make_downloader_stub({"items": [{"id": "c1"}, {"id": "c2"}]}) assert dl.get_canvases() == [{"id": "c1"}, {"id": "c2"}] def test_no_canvases(self): @@ -143,10 +134,7 @@ def test_no_canvases(self): assert dl.get_canvases() == [] def test_v2_takes_priority_over_v3(self): - dl = _make_downloader_stub({ - "sequences": [{"canvases": [{"id": "v2"}]}], - "items": [{"id": "v3"}] - }) + dl = _make_downloader_stub({"sequences": [{"canvases": [{"id": "v2"}]}], "items": [{"id": "v3"}]}) assert dl.get_canvases() == [{"id": "v2"}] diff --git a/tests/test_handlers_helpers_unit.py b/tests/test_handlers_helpers_unit.py index be83857..dd785a1 100644 --- a/tests/test_handlers_helpers_unit.py +++ b/tests/test_handlers_helpers_unit.py @@ -17,6 +17,7 @@ # --- _toast_text --- + def test_toast_text_with_detail(): assert _toast_text("Title", "some detail") == "Title: some detail" @@ -29,6 +30,7 @@ def test_toast_text_without_detail(): # --- _build_item_preview_data --- + def test_build_item_preview_data_all_fields(): item = { "id": "abc", @@ -57,6 +59,7 @@ def test_build_item_preview_data_defaults(): # --- _build_manifest_preview_data --- + def test_build_manifest_preview_data(): info = { "catalog_title": "Catalog Title", @@ -89,6 +92,7 @@ def test_build_manifest_preview_data_no_label(): # --- _page_count_from_result --- + def test_page_count_explicit(): assert _page_count_from_result({"raw": {"page_count": 42}}) == 42 @@ -112,6 +116,7 @@ def test_page_count_invalid_page_count_falls_through(): # --- _has_native_pdf_rendering --- + def test_has_pdf_by_format(): manifest = {"rendering": [{"format": "application/pdf", "@id": "https://x.com/doc"}]} assert _has_native_pdf_rendering(manifest) is True @@ -143,6 +148,7 @@ def test_rendering_with_non_dict_entries(): # --- _pause_guard_response --- + def test_pause_guard_already_paused(): assert _pause_guard_response("paused") is not None @@ -165,6 +171,7 @@ def test_pause_guard_queued_returns_none(): # --- _provider_supports_pagination --- + def test_provider_supports_pagination_true(): provider = SimpleNamespace(search_strategy="archive_org") assert _provider_supports_pagination(provider) is True @@ -186,6 +193,7 @@ def test_provider_supports_pagination_none(): # --- _parse_ranges --- + def test_parse_ranges_simple(): assert _parse_ranges("1,2,3") == {1, 2, 3} diff --git a/tests/test_http_client_unit.py b/tests/test_http_client_unit.py index 5b1d732..a6d1395 100644 --- a/tests/test_http_client_unit.py +++ b/tests/test_http_client_unit.py @@ -14,6 +14,7 @@ # --- HTTPMetrics --- + class TestHTTPMetrics: def test_avg_response_time_empty(self): m = HTTPMetrics() @@ -45,6 +46,7 @@ def test_to_dict_avg_rounded(self): # --- _resolve_policy --- + class TestResolvePolicy: def _make_client(self, **overrides) -> HTTPClient: policy = { @@ -61,30 +63,25 @@ def test_global_defaults(self): assert policy.get("retries") == 3 def test_explicit_library_name(self): - client = self._make_client( - libraries={"gallica": {"use_custom_policy": True, "timeout_s": 60, "retries": 5}} - ) + client = self._make_client(libraries={"gallica": {"use_custom_policy": True, "timeout_s": 60, "retries": 5}}) policy = client._resolve_policy("https://gallica.bnf.fr/manifest", library_name="Gallica") assert policy["timeout_s"] == 60 assert policy["retries"] == 5 def test_library_without_custom_policy_uses_global(self): - client = self._make_client( - libraries={"oxford": {"use_custom_policy": False, "timeout_s": 99}} - ) + client = self._make_client(libraries={"oxford": {"use_custom_policy": False, "timeout_s": 99}}) policy = client._resolve_policy("https://iiif.bodleian.ox.ac.uk/manifest", library_name="Oxford") assert policy["timeout_s"] == 30 # Global, not oxford's 99 def test_hostname_fallback(self): - client = self._make_client( - libraries={"gallica": {"use_custom_policy": True, "timeout_s": 45}} - ) + client = self._make_client(libraries={"gallica": {"use_custom_policy": True, "timeout_s": 45}}) policy = client._resolve_policy("https://gallica.bnf.fr/iiif/manifest") assert policy["timeout_s"] == 45 # --- _compute_backoff --- + class TestComputeBackoff: def _make_client(self) -> HTTPClient: return HTTPClient(network_policy={"global": {}, "download": {}, "libraries": {}}) @@ -137,6 +134,7 @@ def test_429_sets_cooldown(self, mock_limiter_fn): # --- _is_retriable_error --- + class TestIsRetriableError: def _make_client(self) -> HTTPClient: return HTTPClient(network_policy={"global": {}, "download": {}, "libraries": {}}) @@ -173,6 +171,7 @@ def test_no_response_no_exception(self): # --- _handle_json_fallback --- + class TestHandleJsonFallback: def _make_client(self) -> HTTPClient: return HTTPClient(network_policy={"global": {}, "download": {}, "libraries": {}}) diff --git a/tests/test_iccu_unit.py b/tests/test_iccu_unit.py new file mode 100644 index 0000000..e8f0a24 --- /dev/null +++ b/tests/test_iccu_unit.py @@ -0,0 +1,150 @@ +"""Unit tests for the Internet Culturale (ICCU) provider.""" + +from pathlib import Path + +from universal_iiif_core.resolvers.internetculturale import InternetCulturaleResolver +from universal_iiif_core.resolvers.mag_parser import ( + build_magparser_url, + build_thumbnail_url, + extract_oai_and_teca_from_url, + is_iccu_magparser_url, + parse_mag_xml, + probe_magparser_url, +) + + +def _fixture_bytes() -> bytes: + return (Path(__file__).parent / "fixtures" / "iccu_mag_sample.xml").read_bytes() + + +def test_parse_mag_xml_builds_iiif_v2_manifest(): + manifest = parse_mag_xml(_fixture_bytes()) + assert manifest["@context"] == "http://iiif.io/api/presentation/2/context.json" + assert manifest["@type"] == "sc:Manifest" + assert manifest["label"].startswith("III.") + canvases = manifest["sequences"][0]["canvases"] + assert len(canvases) == 3 + first = canvases[0] + assert first["@type"] == "sc:Canvas" + assert first["width"] == 1600 + assert first["height"] == 2100 + image_url = first["images"][0]["resource"]["@id"] + assert image_url.startswith("https://www.internetculturale.it/jmms/cacheman/") + assert image_url.endswith("/1.jpg") + second_url = canvases[1]["images"][0]["resource"]["@id"] + assert second_url.endswith("/2.jpg") + assert image_url != second_url + + +def test_parse_mag_xml_extracts_metadata_block(): + manifest = parse_mag_xml(_fixture_bytes()) + labels = {m["label"]: m["value"] for m in manifest["metadata"]} + assert labels["Biblioteca"] == "Biblioteca Medicea Laurenziana" + assert labels["Città"] == "Firenze" + assert labels["Codice SBN"] == "IT-FI0100" + assert labels["Autore"] == "Alighieri, Jacopo" + iccu = manifest["_iccu"] + assert iccu["teca"] == "Laurenziana - FI" + assert iccu["oai_id"].startswith("oai:teca.bmlonline.it") + + +def test_build_magparser_url_roundtrip(): + url = build_magparser_url("oai:x:y", "marciana", max_pages=10) + oai, teca = extract_oai_and_teca_from_url(url) + assert oai == "oai:x:y" + assert teca == "marciana" + assert "pag=10" in url + assert is_iccu_magparser_url(url) + + +def test_build_thumbnail_url_includes_page(): + url = build_thumbnail_url("oai:a:b", "marciana", page_1based=7) + assert "page=7" in url + assert "teca=marciana" in url + + +def test_resolver_can_resolve_iccu_urls_and_oai_ids(): + r = InternetCulturaleResolver() + assert r.can_resolve("https://www.internetculturale.it/it/16/search/viewresource?id=oai:x&teca=marciana") + assert r.can_resolve("oai:teca.bmlonline.it:21:XXXX:Plutei:IT:FI0100_Plutei_40.26_0003") + assert not r.can_resolve("https://gallica.bnf.fr/ark:/12148/btv1b84260335") + assert not r.can_resolve("oai:unknown:host:1234") + assert not r.can_resolve("") + + +def test_resolver_infers_teca_from_known_prefix(): + r = InternetCulturaleResolver() + manifest_url, doc_id = r.get_manifest_url("oai:193.206.197.121:18:VE0049:CNMD0000299115") + assert manifest_url is not None + assert "teca=marciana" in manifest_url + assert doc_id == "VE0049_CNMD0000299115" + + +def test_resolver_returns_none_when_teca_unknown(): + r = InternetCulturaleResolver() + manifest_url, doc_id = r.get_manifest_url("oai:unknown:host:1234") + assert manifest_url is None + assert doc_id is None + + +class _StubSession: + def __init__(self, status: int, content: bytes): + self._status = status + self._content = content + + def get(self, url, **_kwargs): + outer = self + + class _Resp: + content = outer._content + status_code = outer._status + + def raise_for_status(self): + if outer._status >= 400: + import requests + + raise requests.HTTPError(f"status {outer._status}") + + return _Resp() + + +def test_probe_magparser_url_true_on_valid_xml(): + session = _StubSession(200, _fixture_bytes()) + url = build_magparser_url("oai:x:y", "Laurenziana - FI") + assert probe_magparser_url(url, session=session) is True + + +def test_probe_magparser_url_false_on_empty_response(): + session = _StubSession(200, b"") + url = build_magparser_url("oai:x:y", "marciana") + assert probe_magparser_url(url, session=session) is False + + +def test_probe_magparser_url_false_without_oai_or_teca(): + assert probe_magparser_url("https://www.internetculturale.it/jmms/magparser?foo=1") is False + + +def test_page_downloader_locates_direct_image_url(): + from universal_iiif_core.logic.downloader import PageDownloader + + canvas = { + "@id": "x/canvas/0", + "images": [ + { + "resource": { + "@id": "https://www.internetculturale.it/jmms/thumbnail?id=x&teca=y&page=1", + "@type": "dctypes:Image", + } + } + ], + } + assert PageDownloader._locate_direct_image_url(canvas) == ( + "https://www.internetculturale.it/jmms/thumbnail?id=x&teca=y&page=1" + ) + + +def test_page_downloader_returns_none_when_no_resource(): + from universal_iiif_core.logic.downloader import PageDownloader + + assert PageDownloader._locate_direct_image_url({"images": []}) is None + assert PageDownloader._locate_direct_image_url("not a dict") is None diff --git a/tests/test_iiif_tiles_unit.py b/tests/test_iiif_tiles_unit.py index f9256f6..ae9f1c2 100644 --- a/tests/test_iiif_tiles_unit.py +++ b/tests/test_iiif_tiles_unit.py @@ -103,8 +103,11 @@ def test_tile_plan_out_dimensions(): """Plan output dimensions should equal full dims at scale_factor=1.""" plan = IIIFTilePlan( base_url="https://example.org", - full_width=4000, full_height=3000, - tile_width=512, tile_height=512, scale_factor=1, + full_width=4000, + full_height=3000, + tile_width=512, + tile_height=512, + scale_factor=1, ) assert plan.out_width == 4000 assert plan.out_height == 3000 @@ -114,8 +117,11 @@ def test_tile_regions_covers_full_image(): """Tile regions should cover the entire image without gaps.""" plan = IIIFTilePlan( base_url="https://example.org", - full_width=1024, full_height=768, - tile_width=512, tile_height=512, scale_factor=1, + full_width=1024, + full_height=768, + tile_width=512, + tile_height=512, + scale_factor=1, ) regions = list(_tile_regions(plan)) # 2x2 grid for 1024x768 with 512px tiles diff --git a/tests/test_jobs_unit.py b/tests/test_jobs_unit.py index e2bb88b..e8fbf5e 100644 --- a/tests/test_jobs_unit.py +++ b/tests/test_jobs_unit.py @@ -52,6 +52,7 @@ def _seed_job(jm: JobManager, job_id: str, **overrides) -> dict: # --- update_job --- + class TestUpdateJob: def test_update_status(self): jm = _fresh_job_manager() @@ -78,6 +79,7 @@ def test_update_nonexistent_is_noop(self): # --- list_jobs --- + class TestListJobs: def test_list_all(self): jm = _fresh_job_manager() @@ -100,6 +102,7 @@ def test_list_active_only(self): # --- is_cancel_requested / is_stop_requested --- + class TestCancelStopFlags: def test_cancel_not_requested(self): jm = _fresh_job_manager() @@ -134,6 +137,7 @@ def test_nonexistent_job_is_false(self): # --- request_cancel --- + class TestRequestCancel: def test_cancel_direct_match(self): jm = _fresh_job_manager() @@ -165,6 +169,7 @@ def test_cancel_nonexistent_returns_false(self): # --- _mark_running / _mark_success / _mark_stopped / _mark_failure --- + class TestMarkMethods: def test_mark_running(self): jm = _fresh_job_manager() @@ -219,6 +224,7 @@ def test_mark_failure(self): # --- _is_within --- + class TestIsWithin: def test_within(self, tmp_path): child = tmp_path / "sub" / "file.txt" @@ -233,6 +239,7 @@ def test_not_within(self, tmp_path): # --- prioritize_download --- + class TestPrioritizeDownload: def test_prioritize_moves_to_front(self): jm = _fresh_job_manager() diff --git a/tests/test_providers.py b/tests/test_providers.py index 529431c..ce6bed1 100644 --- a/tests/test_providers.py +++ b/tests/test_providers.py @@ -59,9 +59,21 @@ def test_heidelberg_provider_exposes_browser_handoff_metadata(): def test_iter_providers_respects_explicit_sort_order(): - """UI/CLI ordering should follow provider metadata rather than tuple declaration luck.""" + """UI/CLI ordering should follow provider metadata rather than tuple declaration luck. + + Vaticana is the first option (default in the Discovery select) and + Internet Culturale sits at the end of the non-generic list because the + integration is flagged as BETA. + """ ordered_keys = [provider.key for provider in iter_providers()] - assert ordered_keys[:5] == ["Vaticana", "Gallica", "Institut de France", "Bodleian", "Heidelberg"] + assert ordered_keys[:5] == [ + "Vaticana", + "Gallica", + "Institut de France", + "Bodleian", + "Heidelberg", + ] + assert ordered_keys[-2] == "Internet Culturale" assert ordered_keys[-1] == "Unknown" diff --git a/tests/test_thumbnail_utils_extended.py b/tests/test_thumbnail_utils_extended.py index d378bba..debc9b0 100644 --- a/tests/test_thumbnail_utils_extended.py +++ b/tests/test_thumbnail_utils_extended.py @@ -52,12 +52,8 @@ def test_ensure_hover_preview_creates_larger_image(tmp_path: Path): thumbs = tmp_path / "thumbs" _create_scan(scans, 0, size=(3000, 2000)) - thumb = ensure_thumbnail( - scans_dir=scans, thumbnails_dir=thumbs, page_num_1_based=1, max_long_edge_px=320 - ) - hover = ensure_hover_preview( - scans_dir=scans, thumbnails_dir=thumbs, page_num_1_based=1, max_long_edge_px=900 - ) + thumb = ensure_thumbnail(scans_dir=scans, thumbnails_dir=thumbs, page_num_1_based=1, max_long_edge_px=320) + hover = ensure_hover_preview(scans_dir=scans, thumbnails_dir=thumbs, page_num_1_based=1, max_long_edge_px=900) assert thumb is not None and hover is not None assert thumb.exists() and hover.exists() @@ -98,9 +94,7 @@ def test_ensure_thumbnail_small_source_no_resize(tmp_path: Path): thumbs = tmp_path / "thumbs" _create_scan(scans, 0, size=(200, 150)) - result = ensure_thumbnail( - scans_dir=scans, thumbnails_dir=thumbs, page_num_1_based=1, max_long_edge_px=320 - ) + result = ensure_thumbnail(scans_dir=scans, thumbnails_dir=thumbs, page_num_1_based=1, max_long_edge_px=320) assert result is not None with PILImage.open(str(result)) as img: assert max(img.size) == 200