diff --git a/docs/explanation/architecture.md b/docs/explanation/architecture.md index e7167c5..9030809 100644 --- a/docs/explanation/architecture.md +++ b/docs/explanation/architecture.md @@ -1,25 +1,97 @@ # Architecture Summary -Scriptoria separates UI orchestration from reusable core services. +Scriptoria is built as a Python application with a strict separation between the web UI shell, the reusable core services, and the CLI. The split is not stylistic. It is the load-bearing rule that makes the same product work as a FastHTML application, a CLI tool, and a research workbench against unstable upstream providers. -## Main Layers +## Three Layers, One Direction Of Dependency -- `studio_ui/` renders the FastHTML and HTMX application. -- `universal_iiif_core/` owns provider resolution, storage, downloads, export, OCR, and runtime policy. -- `universal_iiif_cli/` exposes CLI behavior on top of the same core layer. +The repository is organized around three packages. -## Important Boundary +- `studio_ui/` renders the FastHTML and HTMX interface. It owns routes, components, and presentation glue. +- `universal_iiif_core/` owns provider resolution, storage, download orchestration, export logic, OCR services, networking, and runtime policy. +- `universal_iiif_cli/` exposes the CLI entry point on top of the same core services. -The UI depends on the core layer. The core layer does not depend on UI modules. +The dependency direction is one-way. The UI and CLI both depend on the core layer. The core layer does not import from `studio_ui/` or `universal_iiif_cli/`. That rule is what allows the web and CLI surfaces to share resolution, download, and export behavior without diverging implementations. -## Why This Matters +## Routes Orchestrate, Core Implements -This boundary lets Scriptoria: +Inside `studio_ui/`, the route modules in `studio_ui/routes/` exist to register HTTP endpoints and wire user actions to core services. They orchestrate. They do not contain business logic. + +The same applies to handlers and helpers under `studio_ui/routes/_studio/` and the panes under `studio_ui/components/`. These exist to keep presentation focused and small. Anything that resolves manuscripts, validates pages, manages jobs, or persists state belongs in `universal_iiif_core/`. + +This is the boundary that makes refactors safe. When a route module grows complex, the fix is usually to move logic into a core service, not to keep adding presentation-side conditionals. + +## Why The Layering Is The Important Part + +Without this separation, the application would still work, but it would behave differently in the CLI than in the web UI, and provider quirks would have to be patched in two places. With the separation, Scriptoria can: - share resolution and download behavior between web and CLI; - keep runtime policy in one place; -- reduce duplication in storage, networking, and export behavior. +- make CLI scripts a first-class operational tool rather than a thin demo; +- reduce duplication in storage, networking, and export logic; +- evolve the UI surface without rewriting core behavior. + +## Core Service Areas + +The core layer is intentionally subdivided so that each service area is replaceable without touching the others. + +### Discovery + +Discovery uses a shared provider registry and a typed orchestration layer in `universal_iiif_core.discovery`. It classifies user input, runs direct resolution against the appropriate provider, and routes free-text queries to provider-specific search adapters. Results are normalized into a shared search contract before they reach the UI. This is what makes Discovery look like one feature even though every provider behaves differently. + +### Downloading + +Download orchestration is handled by the core services together with a job manager backed by local state. The runtime prefers native PDF when one is advertised and configured, falls back to IIIF image acquisition otherwise, validates pages in staging, and only promotes them into the local scan set when the configured policy allows it. Resume and retry safety across partial runs is part of the model, not a bolt-on. + +### Storage + +`VaultManager` and the storage services in `universal_iiif_core/services/storage/` keep track of manuscript records, download and export jobs, UI preferences, and snippet/OCR-related rows. The vault is local SQLite. Runtime files live under managed directories resolved through `ConfigManager`. + +### Export + +Export is a dedicated service rather than a side effect of the download path. It owns PDF inventory discovery, profile-driven export jobs, local and temporary remote high-resolution image sourcing, and cleanup and retention policies. The capability model is explicit so the UI can advertise roadmap surfaces without faking them. + +### OCR + +OCR services abstract local Kraken and remote OpenAI/Anthropic engines behind a common workflow, while the UI handles asynchronous job feedback. The core is engine-agnostic so that the UI does not have to know which backend produced a transcription. + +### Networking + +The centralized HTTP client (`universal_iiif_core.http_client`) is the only sanctioned transport layer for runtime network operations. It owns retry and backoff policy, per-library customization, concurrency and rate limiting, and request hygiene around hostile or fragile upstream services. New code must not create ad-hoc `requests` sessions; doing so bypasses the entire network policy layer. + +## Runtime Configuration As An Architectural Concern + +Configuration in Scriptoria is not cosmetic. `ConfigManager` is the single source of truth for runtime paths and policy. Hardcoded paths are forbidden because they break test isolation, packaging, and per-user installation. Hardcoded network behavior is also forbidden because providers behave too differently to be served by one fixed policy. + +When a feature needs a path or a policy value, it asks `ConfigManager`. This keeps the rest of the codebase free of environmental assumptions. + +## End-To-End Flow + +The high-level flow across the three layers looks like this. + +1. **Discovery to Library**: the user submits a URL, identifier, shelfmark, or query; Discovery resolves or searches through the provider registry; results are normalized; `Add item` writes local metadata without forcing a download. +2. **Download to local working state**: a download job is enqueued through the job manager; the runtime acquires native PDF or IIIF images; pages are validated in staging and promoted into local scans when policy allows. +3. **Library to Studio**: the user opens a local item; Studio builds a workspace context from local records and manifest state; the viewer chooses local or remote mode based on coverage and policy. +4. **Output and export**: the Output tab reads PDF inventory and page state; the user picks a profile or a page-level repair action; export jobs run through the storage-backed job system; artifacts are persisted under managed runtime paths. + +Every step in that sequence touches multiple core services but stays inside the rules of the dependency layout: UI orchestrates, core implements, storage persists, networking transports. + +## Design Rules That Hold This Together + +- `docs/` is the documentation source of truth; wiki pages are derived publish targets. +- Runtime paths are resolved through `ConfigManager`, not hardcoded strings. +- `scans/` is the operational image source for local study workflows. +- Staging and retry behavior must remain safe for partial and resumed downloads. +- UI package structure should reflect responsibility boundaries, not just file size limits. +- The core layer never imports from UI or CLI packages. ## Deep Dive -For the more detailed component breakdown, see [Project Architecture](../ARCHITECTURE.md). +For the more detailed component breakdown, including current UI package structure and contributor hotspots, see [Project Architecture](../ARCHITECTURE.md). + +## Related Docs + +- [Storage Model](storage-model.md) +- [Job Lifecycle](job-lifecycle.md) +- [Discovery And Provider Model](discovery-provider-model.md) +- [Export And PDF Model](export-and-pdf-model.md) +- [Security And Path Safety](security-and-path-safety.md) diff --git a/docs/explanation/job-lifecycle.md b/docs/explanation/job-lifecycle.md index 6f9ace9..543131f 100644 --- a/docs/explanation/job-lifecycle.md +++ b/docs/explanation/job-lifecycle.md @@ -1,26 +1,106 @@ # Job Lifecycle -Scriptoria treats download and export work as tracked jobs backed by local state. +Scriptoria treats long-running work as tracked jobs backed by local state. Downloads and exports are not fire-and-forget calls: they have an identity, a status row in the vault, and a defined set of transitions. This is what makes acquisition and export reliable across providers that behave inconsistently and across sessions that can be interrupted. -## Download Jobs +## Why The Job Layer Exists -A typical download job: +Manuscript acquisition is large, slow, and failure-prone. Pages can fail individually, providers rate-limit aggressively, and a long download can be interrupted at any point. A naive script-style approach would lose work whenever something went wrong. -1. is created from discovery or library actions; -2. records progress in local state; -3. stages validated pages first; -4. promotes them according to storage policy; -5. can be paused, resumed, retried, or cancelled. +The job layer is the safety mechanism that prevents that. It records the work in progress, exposes pause and resume operations, lets the user retry only what failed, and survives application restarts without losing the partial state that was already on disk. + +## The Job Manager + +The download and export job manager is a process-wide singleton implemented in `universal_iiif_core/jobs.py`. It owns: + +- a registry of in-flight job records keyed by short job ids; +- a download queue that admits jobs up to a configured concurrency limit; +- the threading boundaries that isolate worker exceptions from the UI; +- the bridge between in-memory job state and the persistent `download_jobs` table in the local vault. + +The concurrency cap is taken from network policy. The default value of `max_concurrent_download_jobs` is `2`, which is intentionally conservative: most upstream providers prefer a few well-paced clients over many aggressive ones. + +## Status Values + +The vault recognizes a small, fixed set of statuses for download and export jobs. + +### Transitional States + +A job is in a transitional state when something is actively happening or being requested: + +- `queued`: the job has been created and is waiting for an execution slot; +- `running`: a worker thread is actively processing the job; +- `cancelling`: the user requested cancellation while the job was running and the worker is winding down; +- `pausing`: the user requested a pause and the worker is winding down to a paused state. + +### Terminal States + +Terminal states describe a job that is no longer doing work: + +- `paused`: the worker stopped at a clean point and the job can be resumed later; +- `cancelled`: the worker stopped after a cancel request and the job will not resume automatically; +- `completed`: the job finished successfully; +- `error`: the job stopped because of an error, and the failure reason is recorded in `error_message`. + +The vault enforces terminality. Once a row is in `paused`, `cancelled`, `completed`, or `error`, transitional updates that would overwrite that state are ignored. This prevents late worker callbacks from undoing a user-driven decision. + +## Lifecycle Of A Download Job + +A typical download job goes through this sequence. + +1. The route layer calls into the job manager to enqueue the job. A row is created in `download_jobs` with status `queued` and a `job_origin` such as `library_download`. +2. When a slot frees up, the job is promoted to `running`, `started_at` is set, and the worker thread starts acquiring pages. +3. Pages are written to a staging area first. They are validated as image files before being considered acceptable for promotion. +4. As the worker progresses, `current_page` and `total_pages` are updated. +5. On a clean finish, the job moves to `completed`, `finished_at` is recorded, and validated pages are promoted into the local scan set according to the configured promotion policy. +6. On a fatal error, the job moves to `error` and the failure reason is stored. +7. If the user pauses or cancels, the worker first goes through `pausing` or `cancelling`, then settles into the corresponding terminal state. + +The pause and cancel transitions exist as their own states because the worker cannot stop instantaneously. Acknowledging the request and the actual stop are different events, and the lifecycle reflects that. + +## Why Staging Comes Before Promotion + +Staged pages are not the same thing as local scans. The job writes to a temporary directory under the configured temp root, validates each image, and only moves the result into the manuscript's `scans/` directory once promotion is allowed. + +The promotion policy is governed by the `storage.partial_promotion_mode` setting: + +- `never`: only fully completed runs promote staged pages into `scans/`; +- `on_pause`: a clean pause also promotes the staged pages it managed to validate. + +This separation is what allows partial work to survive a restart without polluting the local scan set with half-validated images. It is also what lets `Retry missing` and `Retry range` make sense as targeted operations rather than always implying a full redownload. + +## Resume Safety + +Resume is a first-class operation, not a side effect. When a paused or interrupted job is resumed, the manager: + +- reuses the existing vault row instead of creating a duplicate; +- skips pages that already exist in the staging or scan directory; +- continues acquisition from where the previous run stopped; +- transitions the row back through `queued` and `running` like any new job. + +This is why Library exposes `Retry missing` and `Retry range` separately from `Download full`: they all rely on the same resume-safe job model, but they scope the work differently. ## Export Jobs -A typical export job: +Export jobs follow the same overall lifecycle but live in their own job records and run through the export service. Each export job stores scope type, document ids, library identity, export format, output kind, page-selection mode, destination, progress counters, the final output path, and any terminal error. + +The route layer creates the job entry first and only then spawns the worker thread. That order is important: it lets the UI poll status, cancel active jobs, and retain history after completion, without depending on whether the worker has had time to start. + +On startup, the application also marks any export rows that were left in transitional states by a previous crashed run as `error`, so that stale jobs do not appear to be still running. -1. starts from a profile or page-level action; -2. records progress in local state; -3. uses local or temporary remote assets depending on the profile; -4. persists output artifacts under managed paths. +## Job Origin + +Download jobs carry a `job_origin` field. Common values include `library_download`, `discovery_add_and_download`, and similar markers indicating where the job was triggered from. This is mostly diagnostic, but it lets the system distinguish between user-initiated downloads and chained operations when a problem needs to be traced. ## Why The Model Matters -The job layer is the safety mechanism that keeps partial work understandable and recoverable. +The job layer is the safety mechanism that keeps partial work understandable and recoverable. Without it, the application would silently lose progress on every interruption, retries would be all-or-nothing, and pause would either not exist or would corrupt the local scan set. + +With it, Scriptoria can run long acquisitions on flaky upstream providers, survive restarts cleanly, and let the user reason about their workspace in terms of states like `partial` and `complete` instead of just "did the download finish". + +## Related Docs + +- [Storage Model](storage-model.md) +- [First Manuscript Workflow](../guides/first-manuscript-workflow.md) +- [Discovery And Library](../guides/discovery-and-library.md) +- [Export And PDF Model](export-and-pdf-model.md) +- [Configuration Reference](../CONFIG_REFERENCE.md) diff --git a/docs/explanation/security-and-path-safety.md b/docs/explanation/security-and-path-safety.md index 6f1f348..6df668c 100644 --- a/docs/explanation/security-and-path-safety.md +++ b/docs/explanation/security-and-path-safety.md @@ -1,15 +1,91 @@ # Security And Path Safety -The documentation layer needs to reflect several project-level safety rules. +Scriptoria is a local-first application that nonetheless touches user input, the file system, external HTTP services, and a local SQLite vault. The security posture is shaped by a small set of project-level rules. They are not bureaucratic compliance items; they exist because each one is enforced in real code and protects a concrete failure mode. -## Main Rules +## Project Rules -- do not hardcode secrets or tokens; -- validate user input at system boundaries; -- enforce path safety for file read, write, download, delete, and optimization operations; -- avoid leaking sensitive details in user-facing errors; -- keep permissive CORS limited to explicit local development scenarios. +The baseline rules apply across UI, CLI, and core services. + +- Never hardcode secrets or tokens in source code. Provider credentials, API keys, and any user-bound tokens belong in configuration or environment variables, never in tracked files. +- Validate input at system boundaries. Routes, CLI argument parsing, and external payload deserialization are the points where validation must happen, not somewhere later in the pipeline. +- Enforce path safety for every operation that reads, writes, downloads, deletes, or optimizes a file on disk. +- Use parameterized access patterns for database operations. Never build SQL strings from user input. +- Do not leak sensitive details in user-facing error messages. The full reason can go to logs; the message shown to the user should not expose internal paths or credentials. +- Keep permissive CORS limited to explicit local development scenarios. Default to specific origins in any non-trivial deployment. + +## Path Safety In Practice + +Path safety is the rule with the most concrete enforcement in the codebase. Several flows could in principle be tricked into reading or writing outside the managed runtime tree, so they all share the same containment pattern. + +The canonical helper looks like this in `universal_iiif_core/jobs.py`: + +```python +@staticmethod +def _is_within(candidate: Path, root: Path) -> bool: + try: + return candidate.resolve().is_relative_to(root.resolve()) + except Exception: + return False +``` + +The shape is intentional. Both paths are resolved before comparison so that symlinks and `..` segments cannot escape the root. Any operation that touches the file system on behalf of a manuscript is expected to use this kind of containment check before the real read or write happens. + +The same pattern is used in scan optimization (`services/scan_optimize.py`) and in download runtime cleanup (`logic/downloader_runtime.py`), where the relevant root is the configured downloads or temp directory. + +## Roots Always Come From ConfigManager + +Path safety is meaningful only when the "root" is itself trustworthy. Scriptoria takes that root from `ConfigManager`, never from user input or from a hardcoded constant. + +The current root families are the configured downloads, exports, temp, models, logs, and snippets directories. These are resolved once and used as the comparison anchor for every containment check. If the user reconfigures `data/local/downloads` to a different absolute path, the safety checks transparently follow. + +This is why `docs/AGENTS.md` and the project rules forbid hardcoded runtime paths. It is not only about portability: hardcoded paths bypass the only mechanism that lets path safety work in user-customized installations. + +## Filename Sanitization + +User-visible folder and file names also need to be safe across operating systems. The `sanitize_filename` helper in `universal_iiif_core/utils.py` strips characters that are illegal on Windows or unwise on POSIX (`/ \ : * ? " < > |`), removes ASCII control characters, and collapses whitespace. + +This is applied when building the per-manuscript directory name from the provider, manuscript id, and optional title, so that arbitrary upstream metadata cannot inject path separators or hidden characters into the local layout. + +## The Centralized HTTP Client + +External network behavior goes through `universal_iiif_core.http_client`. New code is required to use that module rather than instantiate `requests.Session()` directly. + +This rule is partly architectural and partly a security measure. The centralized client is where retry policy, per-library backoff, rate limiting, and request hygiene live. Bypassing it does not just duplicate code: it disables the controls that prevent Scriptoria from hammering an upstream provider after a 403 or 429 response, and it removes the place where header policy and timeouts are enforced. + +The companion `network_policy.py` module declares per-library policy: cooldowns on `403` and `429`, burst windows, retry-after handling, and per-host concurrency limits. A handful of providers (Gallica is the most visible example) explicitly need slow, polite traffic, and the policy file is how that knowledge stays consistent across the application. + +## Vault Access + +Local persistence goes through `VaultManager` and the related modules under `services/storage/`. All SQL is parameterized at the call site. New code that needs to query or update local state must use the existing helper methods or follow the same parameterized pattern. String-built SQL is never acceptable, even for "internal" tables. + +The vault file itself is part of the user's local data and should be excluded from any sharing or backup that is not under the user's explicit control. + +## Input Validation At The Boundary + +The boundary for the web app is the FastHTML route handlers under `studio_ui/routes/`. The boundary for the CLI is `argparse` in `universal_iiif_cli/cli.py`. The boundary for external payloads is the resolver and discovery layer in `universal_iiif_core/resolvers/` and `universal_iiif_core/discovery/`. + +Each of these is responsible for rejecting or normalizing untrusted input before it reaches the rest of the system. Internal modules deeper in the stack are allowed to assume that paths, identifiers, and URLs they receive have already been validated. + +This is the principle that allows the rest of the codebase to stay readable: validation is concentrated where input enters the application, not scattered defensively throughout every function. + +## Error Messages + +Errors visible to the user should help them understand what went wrong without exposing internals that have no business being shown. Stack traces, full filesystem paths, raw HTTP response bodies from upstream providers, and configuration values belong in logs. + +The route layer is responsible for catching exceptions from core services and rendering a user-appropriate message. The CLI follows the same pattern by printing a short error and a hint, while the full traceback goes to the logger. + +## CORS Posture + +Permissive CORS is acceptable only for explicit local development. The default deployment posture should restrict origins. If a route or middleware introduces broader CORS than the local-development case, that is an architectural change and should be reviewed against this rule before merging. ## Documentation Implication -Operational guides should explain safe behavior without exposing sensitive internals. Technical docs should point to the path-safety and validation model, especially around export, cleanup, and optimization flows. +Operational guides should explain safe behavior without exposing sensitive internals. Technical docs should point to the path-safety, validation, and network-policy model, especially around export, cleanup, optimization, and download flows. When a doc page describes a path or a file operation, it should describe the path family and the controlling configuration, not encourage readers to assume a fixed absolute path. + +## Related Docs + +- [Architecture Summary](architecture.md) +- [Storage Model](storage-model.md) +- [Job Lifecycle](job-lifecycle.md) +- [Runtime Paths](../reference/runtime-paths.md) +- [Configuration Reference](../CONFIG_REFERENCE.md) diff --git a/docs/guides/discovery-and-library.md b/docs/guides/discovery-and-library.md index e09ce91..a9c3e01 100644 --- a/docs/guides/discovery-and-library.md +++ b/docs/guides/discovery-and-library.md @@ -19,6 +19,8 @@ Internally, Discovery supports both direct resolution and provider-specific sear Some providers are strong at both. Others are mostly direct-resolution providers with limited search value. This is why the type of input you paste matters. +Each supported provider declares whether it supports search through a `supports_search()` capability, and the active search adapter is selected per provider. A provider without a search adapter is usable as a resolver but will not return free-text results. + ## Discovery Input Strategy For the most stable results, use this order: @@ -41,6 +43,14 @@ This is one of the most important product rules: Because those are separate, you can build a serious shortlist in Library without immediately consuming network time or disk space. +If you already know that you want both the record and the scans in one step, Discovery also exposes `Add and download`, which creates the local record and immediately enqueues a download job. Use the chained action when the decision is already made; use `Add item` when you are still curating. + +## Probing Before You Commit + +Discovery exposes a `Probe manifest` action that fetches lightweight information from the candidate without registering it locally. Use it when you want to verify that the manifest actually points where you expect before adding the item to your catalog. + +There is also a `PDF capability` check that reports whether the provider exposes a native PDF for the item. This matters because export behavior in Output later depends on whether a native PDF is available, and knowing this in advance is useful when planning an acquisition. + ## Search Pagination And Provider Behavior Discovery also reflects provider-specific result behavior. Some providers can expose `Load more` because they offer real pagination. Others behave much better as resolver-first systems, where URLs and identifiers are more reliable than broad text queries. In a few cases the product deliberately points you toward browser-assisted discovery because the upstream search surface is not strong enough to justify pretending otherwise. @@ -55,7 +65,9 @@ In practical terms, Library is where you: - reopen known items; - inspect whether an item is only saved, partially local, or complete enough for local work; -- start or retry downloads; +- start, retry, or scope downloads; +- annotate and classify items inside your own catalog; +- refresh or reclassify entries when the upstream record or the provider registry changes; - clean partial runtime data; - delete an item and its related local workspace. @@ -85,6 +97,29 @@ The local workspace has enough material for the configured local reading and exp Those distinctions matter later in Studio. They are not only labels for the Library page. +## Acquisition Actions + +Library exposes acquisition at more than one granularity. This is intentional because real-world IIIF acquisition often fails unevenly. + +- `Download full` runs a complete acquisition for the item. +- `Download range` acquires a specific page interval. +- `Retry missing` fills in only the pages that the local workspace does not yet have. +- `Retry range` re-acquires a specific interval, useful when a portion came down weak. +- `Cleanup partial` clears inconsistent staged data so the next attempt starts from a clean slate. + +Each of these is dispatched as a tracked download job with the standard pause, resume, retry, cancel, prioritize, and remove operations available in the download manager. + +## Catalog Maintenance Actions + +The other surface in Library is catalog-side: actions that change how an item is described or classified locally without re-downloading anything. + +- `Set type` records the manuscript type inside your own catalog. +- `Update notes` stores free-form annotations on the entry. +- `Refresh metadata` re-fetches normalized metadata from the upstream provider when the source record has changed. +- `Reclassify` re-runs provider classification for one item; `Reclassify all` and `Normalize states` are bulk passes used after registry or schema upgrades. + +These actions are cheap, local-state operations. Use them to keep your catalog coherent without paying acquisition costs. + ## Why Discovery And Library Must Stay Separate The separation is deliberate for three reasons. Providers are inconsistent, and the local catalog should not inherit the instability of upstream discovery surfaces. Local state also has to remain legible: a manuscript may be known locally long before it becomes a complete local asset set. Finally, the workflow is incremental by design. Scriptoria is built for shortlisting, staged download, partial repair, and later export, not only for all-or-nothing acquisition. @@ -101,3 +136,4 @@ If Scriptoria already knows the manuscript and you are deciding what to do with - [Provider Support](../reference/provider-support.md) - [Runtime Paths](../reference/runtime-paths.md) - [Discovery And Provider Model](../explanation/discovery-provider-model.md) +- [Job Lifecycle](../explanation/job-lifecycle.md) diff --git a/docs/guides/first-manuscript-workflow.md b/docs/guides/first-manuscript-workflow.md index bdb17cc..1166567 100644 --- a/docs/guides/first-manuscript-workflow.md +++ b/docs/guides/first-manuscript-workflow.md @@ -8,7 +8,7 @@ Start in `Discovery` with the strongest reference you have available. Best case When the preview looks correct, use `Add item`. -That action is intentionally limited. It stores the local manuscript record and normalized metadata, but it does not start a heavy asset acquisition pipeline on your behalf. +That action is intentionally limited. It stores the local manuscript record and normalized metadata, but it does not start a heavy asset acquisition pipeline on your behalf. If you already know that you want both the record and the scans, use `Add and download` instead, which chains the two operations. ## 2. Register The Item In Library @@ -16,6 +16,15 @@ Once the item is in `Library`, it becomes part of your local workspace model. At that point you decide whether the item should remain a metadata-only local record, move immediately into a full download, or resume a previous partial acquisition. This is the moment where an external manuscript becomes a managed local object. The workspace may still be thin, but the identity and current state are now under Scriptoria's control. +Library is also where you act on local-side properties of the item without touching upstream: + +- `Set type` to classify the manuscript inside your own catalog; +- `Update notes` for personal annotations on the entry; +- `Refresh metadata` when the upstream record has changed and you want to re-fetch normalized metadata; +- `Reclassify` to re-run provider classification for one item, or `Reclassify all` and `Normalize states` for catalog-wide consistency passes after upgrades. + +These actions are catalog-level. They do not start downloads. + ## 3. Understand The Initial State Before opening Studio, understand the three states that matter operationally: @@ -26,7 +35,27 @@ Before opening Studio, understand the three states that matter operationally: These are not decorative labels. They drive later behavior in Studio and Output. -## 4. Open The Item In Studio +## 4. Acquire Scans With The Right Granularity + +Scriptoria does not assume that a download is always a single all-or-nothing operation. From Library, you can: + +- start a `Full download` to acquire every page; +- request a `Range download` to fetch only a specific page interval; +- run `Retry missing` to fill in gaps left by a previous interrupted run; +- use `Retry range` to re-acquire a specific interval that came down weak; +- run `Cleanup partial` when a previous attempt left inconsistent staged data and you want a clean slate. + +Each of these is a real download job and is tracked in the download manager. + +## 5. Track Long Jobs In The Download Manager + +Acquisition is treated as a tracked job, not a fire-and-forget script. + +The download manager exposes the standard operations you would expect on a queued job: pause, resume, retry, cancel, prioritize, and remove. If a job dies because of an upstream rate limit or a transient network error, you do not lose the partial work — you can resume it from where it stopped, and the vault keeps the existing staged pages. + +This is what makes incremental acquisition viable across providers that behave inconsistently. + +## 6. Open The Item In Studio Open the manuscript from `Library` into `Studio`. @@ -39,7 +68,7 @@ At this point Scriptoria resolves several things at once: This is why the correct question is not "does the item exist in Library?" but "what local coverage does the item currently have?" -## 5. Inspect And Repair Only What Needs Attention +## 7. Inspect And Repair Only What Needs Attention If the manuscript is incomplete or some pages are weak, do not assume you need to redownload everything. @@ -47,13 +76,13 @@ Use `Output` for: - thumbnail-based page inspection; - page selection; -- `Scarica`, `Hi-res`, and `Opt` actions on individual pages; +- per-page actions: `Scarica` to acquire the page, `Hi-res` to fetch a higher-resolution version, `Opt` to optimize an existing local scan; - PDF inventory review; - export job creation. -This is one of Scriptoria's strengths: page-level repair exists because real-world manuscript pipelines often fail unevenly rather than uniformly. +This is one of Scriptoria's strengths: page-level repair exists because real-world manuscript pipelines often fail unevenly rather than uniformly. Repairing one weak page is cheaper and more honest than re-running the whole acquisition. -## 6. Export Deliberately +## 8. Export Deliberately When the manuscript is ready, create the export from `Output`. @@ -71,7 +100,7 @@ That is why Scriptoria treats export as a tracked workflow instead of a one-clic ## Common Failure Modes -If the first workflow feels wrong, the cause is usually one of a small set of predictable cases: the provider input was too vague and should be replaced with a direct URL or identifier, the item is only `saved` and not yet locally complete, Studio is correctly staying in remote mode because coverage is still incomplete, staged pages have not yet been promoted because storage policy is conservative, or the provider itself is better handled through URL-driven or browser-assisted discovery. +If the first workflow feels wrong, the cause is usually one of a small set of predictable cases: the provider input was too vague and should be replaced with a direct URL or identifier, the item is only `saved` and not yet locally complete, Studio is correctly staying in remote mode because coverage is still incomplete, staged pages have not yet been promoted because storage policy is conservative, a previous run left partial state that should be resolved with `Cleanup partial` before retrying, or the provider itself is better handled through URL-driven or browser-assisted discovery. ## Related Docs @@ -79,3 +108,4 @@ If the first workflow feels wrong, the cause is usually one of a small set of pr - [Studio Workflow](studio-workflow.md) - [PDF Export](pdf-export.md) - [Troubleshooting](troubleshooting.md) +- [Job Lifecycle](../explanation/job-lifecycle.md) diff --git a/docs/index.md b/docs/index.md index 311211d..e2e77cb 100644 --- a/docs/index.md +++ b/docs/index.md @@ -2,10 +2,10 @@
-

Scriptoria is a local-first IIIF manuscript workbench. It helps you move from provider discovery to a managed local workspace where reading, repair, transcription, and export remain under your control.

-

The project exists for situations where a normal viewer is not enough: inconsistent provider search, uneven image delivery, partial local downloads, page-by-page repair needs, and reproducible export requirements.

+

Scriptoria is a local-first IIIF manuscript workbench. It takes you from provider discovery to a managed local workspace where reading, repair, transcription, and export stay under your control.

+

It exists for the cases where a standard viewer is not enough: inconsistent provider search, uneven image delivery, partial downloads, page-level repair needs, reproducible export.

- Scriptoria header graphic + Morte al tamburo — Scriptoria symbol
@@ -27,19 +27,21 @@
-Scriptoria is a local research workbench for IIIF manuscripts. It is built for people who need more than a generic viewer: scholars who must move from catalog discovery to close reading, librarians or digital curators who need reproducible local working copies, and advanced users who want controlled export, provenance retention, and page-level repair tools. +Scriptoria is a local research workbench for IIIF manuscripts. It is built for people who need more than a generic viewer: scholars moving from catalog discovery to close reading, librarians or digital curators who need reproducible local working copies, and technical users who want controlled export, provenance retention, and page-level repair. -The product is designed around one practical idea: remote IIIF sources are valuable but inconsistent. Search behavior differs from provider to provider, manifests are not equally clean, image delivery is uneven, and PDF availability is highly variable. Scriptoria gives you one local workspace where those differences remain visible and manageable instead of being hidden behind a fake notion of uniformity. +The core idea is simple: remote IIIF sources are valuable but unreliable. Search behavior varies by provider, manifests are not equally clean, image delivery is inconsistent, and PDF availability is a lottery. Scriptoria gives you one local workspace where those differences stay visible and manageable — not hidden behind a false sense of uniformity. :::info Why "Scriptoria"? -In the history of the book, a *scriptorium* was a place where manuscripts were copied, annotated, corrected, and prepared for circulation. The name fits the product well: Scriptoria is not only a reader, but a working environment for acquiring, inspecting, correcting, and exporting manuscript material. +The *scriptorium* (plural *scriptoria*) was the room in a medieval monastery set aside for copying, illuminating, and preserving manuscripts. Kept quiet and well-lit, it sat close to the library and was one of the few places where Greek and Latin culture survived the early Middle Ages. The monks who worked there — scribes, or *amanuenses* — produced their texts by hand on parchment, one page at a time. + +The name fits: Scriptoria is not just a reader. It is a working environment for acquiring, inspecting, correcting, and exporting manuscript material. ::: ## What Scriptoria Is For -Use Scriptoria when you need to resolve a manuscript from a supported IIIF provider, preserve a stable local record before a full download exists, work from local scans instead of trusting upstream availability, repair weak pages selectively, and export PDF or image bundles with explicit source and quality choices. +Use Scriptoria when you need to resolve a manuscript from a supported IIIF provider, keep a stable local record before a full download exists, work from local scans instead of depending on upstream availability, repair weak pages selectively, and export PDF or image bundles with explicit source and quality choices. -Just as important, Scriptoria keeps the operational boundary between remote source, local workspace, and final export artifact visible all the way through the workflow. It is not a public-facing digital library frontend, not a cloud collaboration suite, and not a generic IIIF demo viewer. It is a local-first technical workspace for manuscript-heavy research and curation work. +Scriptoria keeps the boundary between remote source, local workspace, and final export artifact visible throughout the workflow. It is not a public-facing digital library frontend, not a cloud collaboration tool, and not a generic IIIF demo. It is a local-first technical workspace for manuscript-heavy research and curation. ## Main Surfaces diff --git a/docs/intro/getting-started.md b/docs/intro/getting-started.md index c411e14..00485fd 100644 --- a/docs/intro/getting-started.md +++ b/docs/intro/getting-started.md @@ -6,7 +6,7 @@ Scriptoria exposes two entry points. `scriptoria` starts the web application and ## Prerequisites -You need Python 3.10 or newer, a local virtual environment, and the project installed in editable mode. +You need Python 3.10 or newer, a local virtual environment, and the project installed in editable mode. No system-level services are required for a first run: Scriptoria stores its catalog in a local SQLite vault and writes runtime data under a managed directory tree. ## Install @@ -18,13 +18,30 @@ source .venv/bin/activate pip install -e . ``` +After install, verify the binaries are on your path: + +```bash +scriptoria --version +scriptoria-cli --version +``` + +Both should print the same version. The legacy aliases `iiif-studio` and `iiif-cli` are still installed and point to the same entry points, so older scripts and bookmarks continue to work. + ## Start The Web Application ```bash scriptoria ``` -Then open `http://127.0.0.1:8000`. +Then open `http://127.0.0.1:8000`. The default port is `8000` and is not currently configurable from the command line; if it is in use, free it on your side or run from a different shell session. + +For active development, use the watcher mode: + +```bash +scriptoria --reload +``` + +The watcher reloads on changes to `*.py` and `*.html` files and ignores runtime data directories so it does not restart on every download. At first start, expect a local-first application rather than a public website. Even when you work against remote IIIF sources, Scriptoria is already building a managed local record of the item and preparing its runtime workspace. @@ -71,12 +88,34 @@ Example: scriptoria-cli "https://digi.vatlib.it/iiif/MSS_Urb.lat.1779/manifest.json" ``` +If you run `scriptoria-cli` with no positional argument, it enters an interactive wizard that asks for the URL, an optional output filename, and an optional OCR model. The wizard is intentionally minimal; for anything more advanced use explicit flags. + The CLI is a good fit for: - direct acquisition of known items; - shell-based workflows; - scripted processing; -- environments where you do not need the full Studio and Output surfaces. +- environments where you do not need the full Studio and Output surfaces; +- inspecting or repairing local vault state without opening the web app. + +See [CLI Reference](../reference/cli.md) for the complete flag list. + +## Configuration On First Run + +Scriptoria reads its runtime configuration from `config.json`, which controls network policy, image acquisition, viewer defaults, export behavior, storage retention, and test behavior. On first run a default configuration is written if one is not present, and runtime directories are created under `data/local/` (downloads, exports, logs, temp images, models, snippets). + +You do not need to touch configuration for a first session. Once you start working seriously across providers, read [Configuration Overview](../reference/configuration.md) and, when you need exact behavior, [Detailed Configuration Reference](../CONFIG_REFERENCE.md). + +## When Something Does Not Work + +Most first-run friction comes from a small set of predictable cases: + +- the input pasted into Discovery is too vague for the chosen provider; +- the manuscript is `saved` but not yet downloaded, so Studio opens in remote mode and looks slower than expected; +- a partial download was interrupted and Library shows the item in a mid-state; +- the upstream provider rate-limited a fast acquisition. + +Before assuming a bug, read [Troubleshooting](../guides/troubleshooting.md) and check `Provider Support` for any provider-specific caveats. ## What To Read Next diff --git a/docs/reference/cli.md b/docs/reference/cli.md index 3b916e9..4000f1c 100644 --- a/docs/reference/cli.md +++ b/docs/reference/cli.md @@ -1,6 +1,8 @@ # CLI Reference -The CLI lives in `src/universal_iiif_cli/cli.py` and exposes both direct download flows and local database utilities. +The CLI lives in `src/universal_iiif_cli/cli.py` and is exposed by the `scriptoria-cli` entry point. It shares the same provider registry, resolver layer, and local vault used by the web application, so anything resolved or stored from the CLI shows up in the same Library that Studio reads. + +The CLI exists for two situations: direct acquisition when you already know the manuscript you want, and quick inspection or repair of local state without opening the web app. ## Basic Usage @@ -8,48 +10,102 @@ The CLI lives in `src/universal_iiif_cli/cli.py` and exposes both direct downloa scriptoria-cli "" ``` -If no URL is provided, the CLI enters an interactive wizard. +If the input is a known provider URL, shelfmark, or supported identifier, it is normalized to a manifest URL through the same resolver chain used by Discovery. If the input cannot be classified and is not an HTTP URL, the CLI exits with an error and points you toward pasting a direct `manifest.json` link. + +If you call `scriptoria-cli` with no positional argument, it enters an interactive wizard. + +## Wizard Mode + +Wizard mode is intentionally minimal. It asks for a manuscript or viewer URL, an optional output filename, and an optional OCR model name. It is meant for one-off downloads where you do not want to remember flag names. Anything more advanced should use explicit flags. + +```text +🌍 UNIVERSAL IIIF DOWNLOADER 🌍 + +Paste the URL (Manifest or Viewer link): ... +Output filename (optional, press Enter for auto): ... +OCR Model (optional, e.g. 'kraken', press Enter to skip): ... +``` -## Main Options +## Download Options -### Download Options +These flags control the acquisition run started by a positional URL or by the wizard. - `-o, --output` - - Output PDF filename. + - Output PDF filename. Without this flag, Scriptoria picks a name from the manuscript identifier. - `-w, --workers` - - Concurrent downloads for the current run. + - Concurrent downloads for the current run. Default `4`. Increase only if both your network and the upstream provider can absorb it without rate-limiting penalties. - `--clean-cache` - - Clean cache before running. + - Clear cached working state before running. Use when a previous attempt left inconsistent staged data and you want a fresh acquisition. - `--prefer-images` - - Force image download even if a native PDF exists. -- `--ocr` - - Run OCR after download using the provided model name. + - Force per-page image download even if the provider exposes a native PDF. The default is to use a native PDF when one is advertised, because that path is usually faster and produces a more faithful artifact. +- `--ocr MODEL` + - Run OCR after download using the given Kraken model filename. Only meaningful when the model is reachable from the configured local model directory. - `--create-pdf` - - Explicitly build a PDF from downloaded images. + - Explicitly build a PDF from the downloaded images at the end of the run. Use this when the provider has no native PDF and you still want a final PDF artifact. -### Database And Local State +## Database And Local State + +These flags do not start a download. They read or modify the local vault directly through `VaultManager`. - `--list` - - List local manuscripts in the database. + - List local manuscripts in the database. The output shows manuscript id, status, page progress, and provider library, with a status icon: ✅ complete, ⏳ downloading, ❌ error, ⚪ other. - `--info ID` - - Show detailed info for a manuscript. + - Show stored fields for one manuscript (provider identity, status, paths, progress, manifest URL, and related metadata). - `--delete ID` - - Delete a manuscript record. + - Delete a manuscript record from the vault. This removes the local catalog entry; runtime files on disk are handled by separate cleanup flows. - `--delete-job JOB_ID` - - Delete a download job record. + - Remove a single download job row from the internal `download_jobs` table. Mostly useful during development or when stray records survive a crash. - `--set-status ID STATUS` - - Force update the stored status. + - Force the stored status for a manuscript. Standard values are `pending`, `downloading`, `complete`, and `error`. Other strings are accepted with a warning, but the rest of the system reasons in terms of the standard set. + +## Other Options + +- `--version` + - Print the installed Scriptoria version and exit. + +## Where Files End Up + +The CLI does not invent paths. Output and runtime locations come from `ConfigManager` exactly as in the web application: + +- downloaded scans go under the configured downloads directory; +- temporary working files go under the temp directory; +- logs go under the configured log directory. + +If you need to change those locations, edit `config.json` rather than passing path overrides on the command line. See [Runtime Paths](runtime-paths.md) and [Configuration Overview](configuration.md). ## Operational Notes -- Resolution and provider classification use the same core registry used by the web UI. -- Local state is backed by `VaultManager`. -- The CLI is useful for direct download workflows and for inspecting local runtime state without opening the web app. +- Resolution and provider classification use the same registry as the web UI. If a URL resolves in the CLI, it will resolve the same way in Discovery. +- Local state is shared with Studio. A manuscript downloaded from the CLI is immediately visible in Library and openable in Studio without further import. +- The CLI is the right surface for shell pipelines, scripted batch acquisition, headless environments, and local-state inspection. +- The legacy entry points `iiif-cli` and `iiif-studio` are still installed as aliases for `scriptoria-cli` and `scriptoria` to avoid breaking older scripts. New work should use the `scriptoria` names. ## Examples +Download a manuscript by direct manifest URL: + ```bash scriptoria-cli "https://digi.vatlib.it/iiif/MSS_Urb.lat.1779/manifest.json" +``` + +Force image-based download and build the PDF explicitly, with eight workers: + +```bash +scriptoria-cli "https://gallica.bnf.fr/ark:/12148/btv1b8470209j" \ + --prefer-images --create-pdf --workers 8 +``` + +Inspect and repair local state without launching the web app: + +```bash scriptoria-cli --list scriptoria-cli --info MSS_Urb.lat.1779 +scriptoria-cli --set-status MSS_Urb.lat.1779 complete ``` + +## Related Docs + +- [Getting Started](../intro/getting-started.md) +- [Configuration Overview](configuration.md) +- [Runtime Paths](runtime-paths.md) +- [Provider Support](provider-support.md) diff --git a/static/img/morte_tamburo.png b/static/img/morte_tamburo.png new file mode 100644 index 0000000..53d6cb2 Binary files /dev/null and b/static/img/morte_tamburo.png differ