bleu · brunota20 · Jun 17, 2026
diff --git a/docs/deployment.md b/docs/deployment.md
@@ -0,0 +1,253 @@
+# Deploying Shepherd
+
+This guide covers the **operator** side — running a `nexum-engine`
+instance against a fleet of WASM modules. For module-author topics
+(building a module from scratch, writing tests, packaging) see the
+[SDK overview](./sdk.md) and the [first-module
+tutorial](./tutorial-first-module.md).
+
+## What an operator runs
+
+A Shepherd deployment is one or more `nexum-engine` processes, each
+pointed at:
+
+1. an `engine.toml` describing the local environment (chain RPCs,
+   resource caps, where state lives);
+2. one or more `[[modules]]` entries listing `.wasm` artefacts and
+   their `module.toml` manifests;
+3. a `state_dir` the engine creates / owns (the redb local-store
+   database).
+
+Modules are statically declared in `engine.toml`. The engine does
+not pull them from a registry today; you ship the `.wasm` files
+alongside the binary and reference them by path.
+
+## `engine.toml` reference
+
+```toml
+[engine]
+# Directory the local-store redb file (and future engine artefacts)
+# will be created under. Created automatically at boot.
+state_dir = "./data"
+
+# `tracing_subscriber::EnvFilter`-compatible directive. `RUST_LOG`
+# overrides at process start.
+log_level = "info"
+
+# Resource caps applied to every module store at instantiation.
+# wasmtime traps a module that overruns either; the supervisor then
+# logs and continues on the next event.
+[engine.limits]
+# Fuel budget granted before every `on_event` invocation.
+# 1 unit ~ 1 wasm instruction. 1 billion ~ ~1 second of pure compute.
+fuel_per_event = 1_000_000_000
+# Linear-memory ceiling per module, in bytes. Default 64 MiB.
+memory_bytes = 67_108_864
+
+# One [chains.<id>] table per chain the engine should be able to
+# reach. Chain ids are EVM decimal.
+#
+#   ws:// + wss:// — alloy pubsub transport (REQUIRED for the
+#                    eth_subscribe-backed [[subscription]] kinds:
+#                    `block`, `log`).
+#   http:// + https:// — HTTP transport; request/response only,
+#                        no subscriptions.
+#
+# Mix and match: a chain used only for eth_call (e.g. a Chainlink
+# oracle module) can be HTTP; chains carrying log subscriptions
+# need WebSocket.
+
+[chains.1]
+rpc_url = "https://ethereum-rpc.publicnode.com"
+
+[chains.100]
+rpc_url = "https://rpc.gnosischain.com"
+
+[chains.11155111]
+rpc_url = "wss://ethereum-sepolia-rpc.publicnode.com"
+
+[chains.42161]
+rpc_url = "https://arb1.arbitrum.io/rpc"
+```
+
+### `[[modules]]` entries
+
+> 0.2 takes the module path + manifest as positional CLI args (a
+> single module per engine process). The multi-module
+> `[[modules]]` array is shipped by the supervisor work in BLEU-818
+> (nullislabs/shepherd PR #9).
+
+Once the supervisor PR lands, the syntax is:
+
+```toml
+[[modules]]
+name = "twap-monitor"
+wasm = "modules/twap-monitor.wasm"
+manifest = "modules/twap-monitor/module.toml"
+
+[[modules]]
+name = "ethflow-watcher"
+wasm = "modules/ethflow-watcher.wasm"
+manifest = "modules/ethflow-watcher/module.toml"
+```
+
+## Building module `.wasm` artefacts
+
+Modules compile to the `wasm32-wasip2` target. Add the target once
+per dev machine:
+
+```sh
+rustup target add wasm32-wasip2
+```
+
+Then build release artefacts from the workspace root:
+
+```sh
+cargo build --target wasm32-wasip2 --release \
+  -p twap-monitor -p ethflow-watcher
+```
+
+The `.wasm` files land in
+`target/wasm32-wasip2/release/{twap_monitor,ethflow_watcher}.wasm`.
+Copy them to wherever your `engine.toml` points (typical:
+`./modules/` next to the binary).
+
+Size sanity check after a build (CI guards regression):
+
+```sh
+ls -lh target/wasm32-wasip2/release/*.wasm
+```
+
+The M2 modules sit at 270–310 KB optimised. Sudden +10× growth
+usually means a fresh dependency landed in the wasm graph — review
+`cargo tree -p <module> --target wasm32-wasip2` to confirm.
+
+## Single-binary local runs
+
+The 0.2 engine ships as the `nexum-engine` binary. From the
+workspace root, dispatch a module against a test event:
+
+```sh
+cargo run -p nexum-engine -- \
+  target/wasm32-wasip2/release/twap_monitor.wasm \
+  modules/twap-monitor/module.toml
+```
+
+On a fresh checkout, the engine creates `./data/local-store.redb`,
+opens RPC providers for the chains in `engine.toml`, loads the
+component, calls `init`, and dispatches a synthetic block event.
+Console output is `tracing` JSON (or pretty if you set
+`RUST_LOG=info,nexum_engine=debug`).
+
+For systemd-style production runs, see BLEU-850 (Production
+deployment guide) once it lands.
+
+## Docker
+
+A reference Dockerfile + Compose file is tracked as BLEU-X18 (M5).
+Until that lands, build manually:
+
+```dockerfile
+# (sketch — full Dockerfile lands in BLEU-X18)
+FROM rust:1.91 as build
+COPY . /src
+WORKDIR /src
+RUN cargo build --release -p nexum-engine
+RUN rustup target add wasm32-wasip2 \
+ && cargo build --target wasm32-wasip2 --release \
+      -p twap-monitor -p ethflow-watcher
+
+FROM gcr.io/distroless/cc-debian12
+COPY --from=build /src/target/release/nexum-engine /usr/local/bin/
+COPY --from=build /src/target/wasm32-wasip2/release/*.wasm /modules/
+COPY engine.toml /etc/shepherd/engine.toml
+ENTRYPOINT ["/usr/local/bin/nexum-engine"]
+```
+
+Mount the `state_dir` as a volume so the redb file survives container
+restarts.
+
+## Observability
+
+### Logs
+
+Every host backend logs through `tracing`. Set `RUST_LOG` to filter:
+
+```sh
+RUST_LOG=info,nexum_engine=debug,nexum_engine::host::cow_orderbook=trace \
+  cargo run -p nexum-engine -- ...
+```
+
+Recommended baseline for production:
+
+```
+RUST_LOG=info,nexum_engine::host=debug
+```
+
+The structured-logging audit (BLEU-X13) consolidates the field set
+across every dispatch / state change / submission path so a single
+JSON grep reconstructs each order's timeline.
+
+### Prometheus metrics
+
+BLEU-X14 wires a `metrics-exporter-prometheus` endpoint at
+`engine.toml::[engine.metrics].bind_addr` (default
+`127.0.0.1:9100`). Once it lands, scrape with:
+
+```yaml
+scrape_configs:
+  - job_name: shepherd
+    static_configs:
+      - targets: ['shepherd-host:9100']
+```
+
+Suggested Grafana panels (BLEU-X15 ships the dashboard JSON):
+
+- Module uptime — `shepherd_module_uptime_seconds{module}`
+- Event latency p50 / p95 / p99 —
+  `shepherd_event_latency_seconds{module}`
+- Submit success rate —
+  `rate(shepherd_cow_api_submit_total{outcome="success"}[5m])`
+  /
+  `rate(shepherd_cow_api_submit_total[5m])`
+- Fuel headroom —
+  `1 - (shepherd_fuel_consumed / 1_000_000_000)`
+- Memory pressure —
+  `shepherd_memory_peak_bytes / 67_108_864`
+
+## Backups
+
+`state_dir/local-store.redb` is the only durable state the engine
+holds. redb's WAL means a file-level snapshot taken while the
+engine is running is consistent; for safety, either:
+
+- Pause the engine (`systemctl stop shepherd`), copy the file, then
+  restart. Sub-second downtime on a small store.
+- Use `redb::Database::backup` from a sidecar (BLEU-X16 documents
+  the helper).
+
+The store is per-module-namespaced (32-byte keccak prefix per
+`module.name`), so a fresh deployment can re-import partial backups
+without cross-module bleed.
+
+## Troubleshooting
+
+| Symptom | Likely cause | Fix |
+|---|---|---|
+| `init failed: unsupported` | Module imports a capability that needs a chain RPC not configured. | Add the missing `[chains.<id>]` entry to `engine.toml`. |
+| `unknown chain ... (no engine.toml RPC entry)` | Module dispatched `chain::request` for a chain not in `engine.toml`. | Same — add the chain. |
+| `OutOfFuel` trap, immediate restart loop | Module's `on_event` exceeds `[engine.limits].fuel_per_event`. | Bump `fuel_per_event`, or audit the module's loop bounds. |
+| `MemoryOutOfBounds` trap | Module's linear-memory growth exceeds `[engine.limits].memory_bytes`. | Bump `memory_bytes`; profile the module for runaway allocations. |
+| `submit failed (... InvalidAppData)` | Module sent an `OrderCreation` with a non-empty app-data hash but `app_data = "{}"`. | Out of M2 scope — modules currently only support `EMPTY_APP_DATA_JSON`. Patch is on the M3 follow-up board. |
+
+## Reference
+
+- [SDK overview](./sdk.md)
+- [First-module tutorial](./tutorial-first-module.md) (BLEU-848)
+- ADR-0001 (`docs/adr/0001-engine-toml-separate-from-nexum-toml.md`)
+  — why `engine.toml` and `module.toml` are split.
+- ADR-0003 (`docs/adr/0003-local-store-namespacing.md`) — how the
+  `state_dir/local-store.redb` file partitions across modules.
+- ADR-0005 (`docs/adr/0005-cow-api-via-cached-orderbookapi.md`) —
+  how the `cow-api` host backend caches per-chain `OrderBookApi`
+  clients.