diff --git a/CHANGELOG.md b/CHANGELOG.md index 6a2ad97e1e..42a271bd26 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,6 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - **Multistatic fusion guard interval is now operator-configurable — fixes permanent trust demotion with WiFi-synced ESP32 nodes (#1049).** Two independently-clocked ESP32-S3 boards on ESP-NOW sync drift 10–150 ms (typ. ~70 ms) — the 100 ms beacon + WiFi-MAC jitter cannot hold them within the published 60 ms default guard, so the governed-trust cycle permanently demoted to `Restricted`, suppressed all pose output, and spun the error counter to 200k+ with **no escape hatch but a container restart**. Added a **direct `WDP_GUARD_INTERVAL_US` override** (+ optional `WDP_SOFT_GUARD_US`) to `multistatic_guard_config_from_env`, so a deployment can lift the hard guard past its measured spread (e.g. `WDP_GUARD_INTERVAL_US=200000`) without having to know its exact TDM schedule. Precedence is most-specific-wins: a direct override beats the existing `WDP_TDM_SLOTS`+`WDP_TDM_SLOT_US` schedule-derived guard, which beats the 60 ms/20 ms default; the override is applied on top of whichever base is selected, the soft band is always clamped strictly below the hard guard, and a malformed/zero value is ignored (falls back to the base rather than breaking fusion). The effective guard is now logged at startup. Pinned by 6 new tests (`multistatic_guard_config_tests`): direct-override-wins / beats-TDM-derived / soft-clamped-below-hard / lowering-hard-pulls-soft-down / malformed-or-zero-falls-back / default-when-unset. `wifi-densepose-sensing-server` bin tests **449 → 455**, 0 failed; Python proof VERDICT PASS, hash unchanged (off the signal proof path). ### Security +- **Docker image `ruvnet/wifi-densepose:latest` — runtime base moved from `debian:bookworm-slim` to distroless `gcr.io/distroless/cc-debian12:debug`, shrinking the OS attack surface (#1140).** Docker Scout flagged **37 base-image CVEs** (1 Critical / 2 High / 4 Medium / 28 Low / 2 unspecified) against the Debian base packages (`perl`, `tar`, `pam`, `glibc`, `systemd`, `util-linux`, …). Analysis confirmed every CVE is real but **none is reachable in this image**: the runtime executes a single Rust binary with no Perl/Python/compiler, the 1 Critical + 2 High are all `perl` and 6 of the 8 perl CVEs (incl. both Highs) live in modules (`IO::Compress`/`IO::Uncompress`/`HTTP::Tiny`) that aren't even installed in the slim base, and the remainder is the standard Debian baseline long-tail — all with **no fix available in Debian 12**, so re-pulling the slim base would clear nothing. The fix swaps the runtime stage to distroless `cc-debian12`, which ships only glibc + libgcc/libstdc++ + libssl + the CA bundle (everything a glibc-linked Rust binary needs) and drops `perl`, `apt`, `tar`, `gnupg`, `pam`, `shadow`, `systemd` libs, etc. at the source — eliminating the large majority of the 37 findings. The `:debug` variant is used because the entrypoint is a `/bin/sh` script carrying real security logic (#864 fail-closed auth + cog-ha-matter/homecore routing); it bundles a static busybox shell at `/busybox/sh`, through which the entrypoint is invoked explicitly. The `#520/#514` UI-asset + binary regression check moved from the (now shell-less) runtime stage into the builder stage. **MEASURED (linux/arm64, Docker 29.5.3):** image builds clean (80.3 MB); the busybox entrypoint executes the #864 fail-closed logic (default 0.0.0.0 + no token → `exit 64`); the Rust binary boots in distroless (glibc + libssl3 resolve, no missing-shared-library errors) and serves `GET /ui/index.html` → 200 and live `GET /api/v1/sensing/latest` JSON under `CSI_SOURCE=simulated`; the runtime's tracked dpkg set drops to **10 packages** (`base-files, gcc-12-base, libc6, libgcc-s1, libgomp1, libssl3, libstdc++6, media-types, netbase, tzdata`) — `perl`/`apt`/`tar`/`gnupg`/`pam`/`shadow`/`systemd`/`util-linux`/`coreutils` all absent, eliminating the 1 Critical + 2 High + 4 Medium and ~21 of the 28 Low findings; only the irreducible glibc/openssl/gcc-base baseline Lows remain. The base-image swap does not touch CSI ingestion, so on-hardware ESP32 behaviour is unchanged from the prior image. - **`wifi-densepose-occworld-candle` — beyond-SOTA security + correctness review (Milestone #9, crate 4/4).** (1) **HIGH (MEASURED) — checkpoint-load crash on any int32 tensor** (`model.rs::safetensor_dtype_to_candle`). `safetensors::Dtype::I32` was mapped to `candle_core::DType::I64` and the raw int32 byte buffer (4 bytes/elem) was then handed to `Tensor::from_raw_buffer(.., I64, shape, ..)`. Candle derives `elem_count = data.len() / dtype.size_in_bytes()`, so the I64 path halved the element count while keeping the *original* shape — yielding a tensor whose declared shape claims twice as many elements as its backing storage holds. Reading it **panics** (`range end index 6 out of range for slice of length 3` — slice OOB inside candle-core) on any attacker-supplied or PyTorch-exported checkpoint containing an int32 tensor (common: index/buffer tensors). Fixed by mapping `I32 → DType::I32` (and `I16 → DType::I16`), both first-class candle dtypes. Reproduction recorded on old code; pinned by `tests/checkpoint_loading.rs::int32_tensor_loads_with_consistent_shape_and_values` (panics on old, passes on new) plus F32/I64/corrupt-file control cases. (2) **LOW (MEASURED) — `predict()` lacked frame/batch validation at the input boundary** (`inference.rs`). It validated H/W/D but not the externally-supplied frame count; an `f_in > num_frames*2` over-indexed the temporal positional embedding deep in the transformer and surfaced as a cryptic candle "gather" `InvalidIndex` (returned error, not a panic — candle bounds-checks), and a zero frame/batch dim fed a zero-element tensor into the pipeline. Now rejected at the boundary with a clear `ShapeMismatch`. Pinned by `predict_rejects_zero_frames` / `predict_rejects_too_many_frames` / `predict_accepts_frame_count_at_capacity`. (3) **LOW (MEASURED) — divide-by-zero panic on a degenerate input to the public `VQCodebook::encode`** (`vqvae.rs`): a rank-0 / empty-last-dim tensor made `last == 0` and panicked on `elem_count() / last`. Now fails closed with a clear error. Pinned by `encode_rejects_scalar_without_panicking`. **Dimensions confirmed CLEAN with evidence:** panic surface — zero `unwrap()`/`expect()`/`panic!`/`unreachable!` in production code paths (grep evidence; all error handling via `?`/`map_err`); NaN-state-poisoning — N/A (engine is stateless between `predict` calls, input is `u8` class indices so non-finite input is structurally impossible, no persistent world-model buffer to latch into); unbounded-alloc / shape-data mismatch from malformed weights — defended upstream by `safetensors::validate()` (overflow-checked `nelements*dtype.size()` vs declared byte range, rejected before reaching candle); secrets — none (grep clean, only `token_h`/`token_w` config fields match). `unsafe_code = forbid` in the crate manifest. **Build/validation status (MEASURED on Windows):** crate builds and tests under `cargo test -p wifi-densepose-occworld-candle --no-default-features` — **29/29 pass** (20 unit + 4 checkpoint_loading + 3 predict_honesty + 2 doc) after fixes; `cargo test --workspace --no-default-features` = 0 failed across all crates (lone `wifi-densepose-desktop` `api_integration` failure was a Windows "Access is denied (os error 5)" file-lock flake — re-ran in isolation **21/21 pass**); Python proof VERDICT PASS, hash `f8e76f21…446f7a` unchanged. *Warrants ADR slot 179 (parent to author).* - **`wifi-densepose-wasm-edge` beyond-SOTA closing review — boundary NaN-state-poisoning guard + clean-with-evidence attestation (ADR-040 edge crate, ~70 modules).** Closing pass of the security campaign over the last untouched sizeable crate. **One real finding fixed (LOW / source-analysis + reproduced):** the two WASM↔host frame boundaries (`lib.rs::on_frame`/`on_timer` and `bin/ghost_hunter.rs::on_frame`) read raw IEEE-754 `f32` from the `csi_get_phase`/`csi_get_amplitude`/`csi_get_variance`/`csi_get_motion_energy` host imports **without any finiteness check** — the entire crate had **zero** `is_finite`/`is_nan` guards, and the in-crate `clamp` helpers propagate NaN (`NaN < lo` and `NaN > hi` are both false). A single non-finite value (firmware DSP bug, uninitialised buffer, or hostile host) latches NaN into the long-lived per-module accumulators (EMA, Welford, phasor sums, anomaly baselines); once latched, every downstream comparison evaluates `false`, so detectors fail **degraded** (stuck gate state, silently-disabled anomaly checks) — silent corruption, not a crash (WASM `panic=abort` is *not* tripped: no indexing/`unwrap` on the poisoned value). Threat model is a **semi-trusted** boundary (the Tier-2 DSP firmware supplies the imports, not direct network/JS), hence LOW severity / defense-in-depth. **Fix:** added `sanitize_host_f32()` (maps non-finite→`0.0`, `core`-only so it holds in `no_std`) applied at every `host_get_*` float read — a single chokepoint covering all ~70 downstream modules, mirroring the existing M-01 negative-`n_subcarriers` boundary clamp. **Pinned by** `boundary_tests::{sanitize_passes_finite_values_through, sanitize_maps_non_finite_to_zero, coherence_monitor_nan_latches_without_sanitize_but_not_with}` — the last asserts on the *current* `CoherenceMonitor` that a raw NaN frame latches the smoothed score (documents the hazard) while the boundary-sanitized path stays finite. **Dimensions attested CLEAN with evidence (source-analysis):** (a) **panic-on-input** — every non-test `unwrap()`/`expect()` is either `#[cfg(test)]` or in the `std`-gated RVF *builder* host tool writing to an in-memory `Vec` (infallible); no `panic!`/`unreachable!`/`todo!`/`get_unchecked` in any hot path. (b) **shape/bounds** — all frame-buffer access is `min()`-clamped (`MAX_SC=32`, `DTW_MAX_LEN`, `LCS_WINDOW`, `PATTERN_LEN`), all index-by-cast sites (`feature_id as usize`, `conclusion_id`, `minute_counter`, `plan_step`) are either compile-time-const-bounded or `if idx <`/`%`-guarded; negative `n_subcarriers` already mapped to 0 (M-01). (c) **memory/leak** — no `move ||` closures, no `mem::forget`/`Box::leak`/`.leak()`; the only `Box::new` is in the `std`-gated `skill_registry` (one-time init, bounded). (d) **secrets** — none (grep clean). **MEASURED build/test evidence:** host `cargo test --features std,medical-experimental` = **672 passed / 0 failed** (was 669 pre-fix; +3 new tests); the real deployment artifacts all build clean on the actual target — `cargo build --target wasm32-unknown-unknown --release` (no_std/panic=abort default lib), `--bin ghost_hunter --no-default-features --features standalone-bin`, and `--features medical-experimental` (toolchain 1.89 per `rust-toolchain.toml`). No ADR slot needed — a single LOW defense-in-depth boundary fix; CHANGELOG attestation suffices. - **ADR-131 HOMECORE-UI BFF gateway — public-PR review fixes (PR #1082).** (1) **HIGH — path-traversal / confused-deputy SSRF closed in the `/api/cal/*` reverse-proxy** (`homecore-server/src/gateway.rs`). The wildcard proxy path was interpolated straight into the upstream URL while `proxy()` attaches the server-side calibration bearer, so `/api/cal/v1/../../x` (and percent-encoded `..%2f`, `%2e%2e`, leading `/`, backslash, double-encoded `%252e`) could escape the `…/api/` scope **with the privileged token**. Now `validate_proxy_path()` decode-then-checks and rejects absolute/backslash/dot-segment/encoded-traversal paths with a typed **400 BEFORE the URL is built** (applies to GET **and** POST); legit `v1/...` paths still pass. Pinned by `cal_proxy_rejects_traversal_with_400_before_upstream` (fails on old code) + `validate_proxy_path_rejects_traversal_variants`. (2) **CORS + request-tracing now cover the gateway routes.** `/api/homecore/*` and `/api/cal/*` were `.merge()`d **outside** the layers `homecore-api::router()` applies, leaving them with no CORS allowlist and untraced; the audited `build_cors_layer()` (HC-05) + `TraceLayer` are now applied to the whole merged surface in `main.rs`. Pinned by `gateway_routes_are_cors_covered_after_merge` (Vite-dev-origin preflight succeeds on a gateway route). (3) **Fabricated-data honesty (§6 invariant 3):** the gateway no longer injects a hardcoded `anomaly.threshold: 0.5` — it passes through the REAL upstream threshold or emits `null` (withheld); the dashboard renders a not-available `—` instead of `"null%"`/`"null°C"` for null appliance metrics; the COG panel's Hailo-worker pill reflects the real appliance probe instead of a hardcoded `"connected"`; `rooms.js` treats a null anomaly threshold as withheld, not a fake `0.8` default. (4) **Robustness:** a forwarded `hef` that is a string (not an array) no longer throws in the COG panel; the calibration wizard guards `frames/target` against `NaN%`/`Infinity%` and clears its baseline poll timer on Restart / panel teardown (leaked `setTimeout` loop fixed). (5) **Perf:** per-bank RoomState fetches and the appliance service probes now run concurrently (`futures::join_all`; async `tokio::net::TcpStream` + `timeout` replaces the blocking `connect_timeout` that parked a worker per probe); the mock fixture module is now a dynamic `import()` gated on demo mode so production never bundles it. **Note (workspace-wide, not fixed here):** `homecore-server` requests `reqwest`'s `rustls-tls` only, but cargo feature-unification means a sibling crate enabling the default `native-tls` re-introduces OpenSSL into the final binary regardless — a true "no OpenSSL on the appliance" guarantee requires aligning every reqwest-pulling crate on rustls-only. **Note (pre-existing, out of scope):** DEV-mode `allow_any_non_empty()` bearer auth when `HOMECORE_TOKENS` is unset on `0.0.0.0` is unchanged; the loud `warn!` at boot is retained — provision real tokens before network exposure. **Verified:** `cargo test -p homecore-server --no-default-features` = **18/18 pass**, `cargo build -p homecore-server` clean, UI suite (`node tests`) all green, Python proof VERDICT PASS (hash unchanged). diff --git a/docker/Dockerfile.rust b/docker/Dockerfile.rust index 14c02efc42..07f518cc28 100644 --- a/docker/Dockerfile.rust +++ b/docker/Dockerfile.rust @@ -34,12 +34,42 @@ RUN cargo build --release -p wifi-densepose-sensing-server --features mqtt 2>&1 && cargo build --release -p homecore-server 2>&1 \ && strip target/release/sensing-server target/release/cog-ha-matter target/release/homecore-server -# Stage 2: Runtime -FROM debian:bookworm-slim +# Copy the UI assets into the builder and sanity-check the full set the runtime +# serves (regression guard for #520/#514 — the published image must include the +# observatory and pose-fusion dashboards, not just the legacy `index.html` set, +# plus the three release binaries). This runs here in the builder because the +# runtime stage is now distroless (#1140) and has no shell to RUN checks in. A +# missing asset or non-executable binary fails the build, so a stale image can't +# be silently pushed. +COPY ui/ /build/ui/ +RUN set -e; \ + for f in /build/ui/index.html /build/ui/observatory.html /build/ui/pose-fusion.html /build/ui/viz.html; do \ + test -f "$f" || { echo "FATAL: missing UI asset $f"; exit 1; }; \ + done; \ + for d in /build/ui/observatory /build/ui/pose-fusion /build/ui/components /build/ui/services; do \ + test -d "$d" || { echo "FATAL: missing UI directory $d"; exit 1; }; \ + done; \ + test -x /build/target/release/sensing-server || { echo "FATAL: sensing-server is not executable"; exit 1; }; \ + test -x /build/target/release/cog-ha-matter || { echo "FATAL: cog-ha-matter is not executable"; exit 1; }; \ + test -x /build/target/release/homecore-server || { echo "FATAL: homecore-server is not executable"; exit 1; }; \ + echo "image assets OK" -RUN apt-get update && apt-get install -y --no-install-recommends \ - ca-certificates \ - && rm -rf /var/lib/apt/lists/* +# Stage 2: Runtime — distroless (Issue #1140). +# +# Previously debian:bookworm-slim, which carries the full Debian base package set +# (perl-base, apt, tar, gnupg, pam, shadow, systemd libs, ...). Docker Scout +# flagged 37 base-image CVEs against those source packages. None are reachable — +# the container runs a single Rust binary, with no Perl/Python/compiler — but +# they pad every scan, and none have a fix in Debian 12, so re-pulling the slim +# base clears nothing. Distroless cc-debian12 ships only glibc + libgcc/libstdc++ +# + libssl + the CA bundle (everything a glibc-linked Rust binary needs) and +# drops the large majority of those findings at the source. +# +# We use the :debug variant on purpose: the entrypoint is a /bin/sh script with +# real security logic (#864 fail-closed auth, cog-ha-matter/homecore routing). +# :debug bundles a static busybox shell at /busybox/sh, through which the +# entrypoint is invoked explicitly below. +FROM gcr.io/distroless/cc-debian12:debug WORKDIR /app @@ -51,22 +81,6 @@ COPY --from=builder /build/target/release/homecore-server /app/homecore-server # Copy UI assets COPY ui/ /app/ui/ -# Sanity-check the assets the runtime actually serves (regression guard for -# #520/#514 — the published image must include the observatory and pose-fusion -# dashboards, not just the legacy `index.html` set). Build fails if any of -# these are missing, so a stale image can't be silently pushed. -RUN set -e; \ - for f in /app/ui/index.html /app/ui/observatory.html /app/ui/pose-fusion.html /app/ui/viz.html; do \ - test -f "$f" || { echo "FATAL: missing UI asset $f"; exit 1; }; \ - done; \ - for d in /app/ui/observatory /app/ui/pose-fusion /app/ui/components /app/ui/services; do \ - test -d "$d" || { echo "FATAL: missing UI directory $d"; exit 1; }; \ - done; \ - test -x /app/sensing-server || { echo "FATAL: /app/sensing-server is not executable"; exit 1; }; \ - test -x /app/cog-ha-matter || { echo "FATAL: /app/cog-ha-matter is not executable"; exit 1; }; \ - test -x /app/homecore-server || { echo "FATAL: /app/homecore-server is not executable"; exit 1; }; \ - echo "image assets OK" - # Optional bearer-token auth on /api/v1/*: leave unset for LAN-mode (default), # set to enforce `Authorization: Bearer ` (see bearer_auth module, #443). # docker run -e RUVIEW_API_TOKEN=$(openssl rand -hex 32) ... @@ -103,5 +117,9 @@ COPY docker/docker-entrypoint.sh /app/docker-entrypoint.sh # Exec-form ENTRYPOINT so Docker appends user arguments correctly. # Pass flags directly: docker run --source esp32 --tick-ms 500 # Or use env vars: docker run -e CSI_SOURCE=esp32 -ENTRYPOINT ["/app/docker-entrypoint.sh"] +# +# Invoked through the distroless busybox shell (#1140): the runtime base has no +# /bin/sh, so the entrypoint script is run explicitly via /busybox/sh rather than +# relying on its `#!/bin/sh` shebang. +ENTRYPOINT ["/busybox/sh", "/app/docker-entrypoint.sh"] CMD []