Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- **Multistatic fusion guard interval is now operator-configurable — fixes permanent trust demotion with WiFi-synced ESP32 nodes (#1049).** Two independently-clocked ESP32-S3 boards on ESP-NOW sync drift 10–150 ms (typ. ~70 ms) — the 100 ms beacon + WiFi-MAC jitter cannot hold them within the published 60 ms default guard, so the governed-trust cycle permanently demoted to `Restricted`, suppressed all pose output, and spun the error counter to 200k+ with **no escape hatch but a container restart**. Added a **direct `WDP_GUARD_INTERVAL_US` override** (+ optional `WDP_SOFT_GUARD_US`) to `multistatic_guard_config_from_env`, so a deployment can lift the hard guard past its measured spread (e.g. `WDP_GUARD_INTERVAL_US=200000`) without having to know its exact TDM schedule. Precedence is most-specific-wins: a direct override beats the existing `WDP_TDM_SLOTS`+`WDP_TDM_SLOT_US` schedule-derived guard, which beats the 60 ms/20 ms default; the override is applied on top of whichever base is selected, the soft band is always clamped strictly below the hard guard, and a malformed/zero value is ignored (falls back to the base rather than breaking fusion). The effective guard is now logged at startup. Pinned by 6 new tests (`multistatic_guard_config_tests`): direct-override-wins / beats-TDM-derived / soft-clamped-below-hard / lowering-hard-pulls-soft-down / malformed-or-zero-falls-back / default-when-unset. `wifi-densepose-sensing-server` bin tests **449 → 455**, 0 failed; Python proof VERDICT PASS, hash unchanged (off the signal proof path).

### Security
- **Docker image `ruvnet/wifi-densepose:latest` — runtime base moved from `debian:bookworm-slim` to distroless `gcr.io/distroless/cc-debian12:debug`, shrinking the OS attack surface (#1140).** Docker Scout flagged **37 base-image CVEs** (1 Critical / 2 High / 4 Medium / 28 Low / 2 unspecified) against the Debian base packages (`perl`, `tar`, `pam`, `glibc`, `systemd`, `util-linux`, …). Analysis confirmed every CVE is real but **none is reachable in this image**: the runtime executes a single Rust binary with no Perl/Python/compiler, the 1 Critical + 2 High are all `perl` and 6 of the 8 perl CVEs (incl. both Highs) live in modules (`IO::Compress`/`IO::Uncompress`/`HTTP::Tiny`) that aren't even installed in the slim base, and the remainder is the standard Debian baseline long-tail — all with **no fix available in Debian 12**, so re-pulling the slim base would clear nothing. The fix swaps the runtime stage to distroless `cc-debian12`, which ships only glibc + libgcc/libstdc++ + libssl + the CA bundle (everything a glibc-linked Rust binary needs) and drops `perl`, `apt`, `tar`, `gnupg`, `pam`, `shadow`, `systemd` libs, etc. at the source — eliminating the large majority of the 37 findings. The `:debug` variant is used because the entrypoint is a `/bin/sh` script carrying real security logic (#864 fail-closed auth + cog-ha-matter/homecore routing); it bundles a static busybox shell at `/busybox/sh`, through which the entrypoint is invoked explicitly. The `#520/#514` UI-asset + binary regression check moved from the (now shell-less) runtime stage into the builder stage. **MEASURED (linux/arm64, Docker 29.5.3):** image builds clean (80.3 MB); the busybox entrypoint executes the #864 fail-closed logic (default 0.0.0.0 + no token → `exit 64`); the Rust binary boots in distroless (glibc + libssl3 resolve, no missing-shared-library errors) and serves `GET /ui/index.html` → 200 and live `GET /api/v1/sensing/latest` JSON under `CSI_SOURCE=simulated`; the runtime's tracked dpkg set drops to **10 packages** (`base-files, gcc-12-base, libc6, libgcc-s1, libgomp1, libssl3, libstdc++6, media-types, netbase, tzdata`) — `perl`/`apt`/`tar`/`gnupg`/`pam`/`shadow`/`systemd`/`util-linux`/`coreutils` all absent, eliminating the 1 Critical + 2 High + 4 Medium and ~21 of the 28 Low findings; only the irreducible glibc/openssl/gcc-base baseline Lows remain. The base-image swap does not touch CSI ingestion, so on-hardware ESP32 behaviour is unchanged from the prior image.
- **`wifi-densepose-occworld-candle` — beyond-SOTA security + correctness review (Milestone #9, crate 4/4).** (1) **HIGH (MEASURED) — checkpoint-load crash on any int32 tensor** (`model.rs::safetensor_dtype_to_candle`). `safetensors::Dtype::I32` was mapped to `candle_core::DType::I64` and the raw int32 byte buffer (4 bytes/elem) was then handed to `Tensor::from_raw_buffer(.., I64, shape, ..)`. Candle derives `elem_count = data.len() / dtype.size_in_bytes()`, so the I64 path halved the element count while keeping the *original* shape — yielding a tensor whose declared shape claims twice as many elements as its backing storage holds. Reading it **panics** (`range end index 6 out of range for slice of length 3` — slice OOB inside candle-core) on any attacker-supplied or PyTorch-exported checkpoint containing an int32 tensor (common: index/buffer tensors). Fixed by mapping `I32 → DType::I32` (and `I16 → DType::I16`), both first-class candle dtypes. Reproduction recorded on old code; pinned by `tests/checkpoint_loading.rs::int32_tensor_loads_with_consistent_shape_and_values` (panics on old, passes on new) plus F32/I64/corrupt-file control cases. (2) **LOW (MEASURED) — `predict()` lacked frame/batch validation at the input boundary** (`inference.rs`). It validated H/W/D but not the externally-supplied frame count; an `f_in > num_frames*2` over-indexed the temporal positional embedding deep in the transformer and surfaced as a cryptic candle "gather" `InvalidIndex` (returned error, not a panic — candle bounds-checks), and a zero frame/batch dim fed a zero-element tensor into the pipeline. Now rejected at the boundary with a clear `ShapeMismatch`. Pinned by `predict_rejects_zero_frames` / `predict_rejects_too_many_frames` / `predict_accepts_frame_count_at_capacity`. (3) **LOW (MEASURED) — divide-by-zero panic on a degenerate input to the public `VQCodebook::encode`** (`vqvae.rs`): a rank-0 / empty-last-dim tensor made `last == 0` and panicked on `elem_count() / last`. Now fails closed with a clear error. Pinned by `encode_rejects_scalar_without_panicking`. **Dimensions confirmed CLEAN with evidence:** panic surface — zero `unwrap()`/`expect()`/`panic!`/`unreachable!` in production code paths (grep evidence; all error handling via `?`/`map_err`); NaN-state-poisoning — N/A (engine is stateless between `predict` calls, input is `u8` class indices so non-finite input is structurally impossible, no persistent world-model buffer to latch into); unbounded-alloc / shape-data mismatch from malformed weights — defended upstream by `safetensors::validate()` (overflow-checked `nelements*dtype.size()` vs declared byte range, rejected before reaching candle); secrets — none (grep clean, only `token_h`/`token_w` config fields match). `unsafe_code = forbid` in the crate manifest. **Build/validation status (MEASURED on Windows):** crate builds and tests under `cargo test -p wifi-densepose-occworld-candle --no-default-features` — **29/29 pass** (20 unit + 4 checkpoint_loading + 3 predict_honesty + 2 doc) after fixes; `cargo test --workspace --no-default-features` = 0 failed across all crates (lone `wifi-densepose-desktop` `api_integration` failure was a Windows "Access is denied (os error 5)" file-lock flake — re-ran in isolation **21/21 pass**); Python proof VERDICT PASS, hash `f8e76f21…446f7a` unchanged. *Warrants ADR slot 179 (parent to author).*
- **`wifi-densepose-wasm-edge` beyond-SOTA closing review — boundary NaN-state-poisoning guard + clean-with-evidence attestation (ADR-040 edge crate, ~70 modules).** Closing pass of the security campaign over the last untouched sizeable crate. **One real finding fixed (LOW / source-analysis + reproduced):** the two WASM↔host frame boundaries (`lib.rs::on_frame`/`on_timer` and `bin/ghost_hunter.rs::on_frame`) read raw IEEE-754 `f32` from the `csi_get_phase`/`csi_get_amplitude`/`csi_get_variance`/`csi_get_motion_energy` host imports **without any finiteness check** — the entire crate had **zero** `is_finite`/`is_nan` guards, and the in-crate `clamp` helpers propagate NaN (`NaN < lo` and `NaN > hi` are both false). A single non-finite value (firmware DSP bug, uninitialised buffer, or hostile host) latches NaN into the long-lived per-module accumulators (EMA, Welford, phasor sums, anomaly baselines); once latched, every downstream comparison evaluates `false`, so detectors fail **degraded** (stuck gate state, silently-disabled anomaly checks) — silent corruption, not a crash (WASM `panic=abort` is *not* tripped: no indexing/`unwrap` on the poisoned value). Threat model is a **semi-trusted** boundary (the Tier-2 DSP firmware supplies the imports, not direct network/JS), hence LOW severity / defense-in-depth. **Fix:** added `sanitize_host_f32()` (maps non-finite→`0.0`, `core`-only so it holds in `no_std`) applied at every `host_get_*` float read — a single chokepoint covering all ~70 downstream modules, mirroring the existing M-01 negative-`n_subcarriers` boundary clamp. **Pinned by** `boundary_tests::{sanitize_passes_finite_values_through, sanitize_maps_non_finite_to_zero, coherence_monitor_nan_latches_without_sanitize_but_not_with}` — the last asserts on the *current* `CoherenceMonitor` that a raw NaN frame latches the smoothed score (documents the hazard) while the boundary-sanitized path stays finite. **Dimensions attested CLEAN with evidence (source-analysis):** (a) **panic-on-input** — every non-test `unwrap()`/`expect()` is either `#[cfg(test)]` or in the `std`-gated RVF *builder* host tool writing to an in-memory `Vec` (infallible); no `panic!`/`unreachable!`/`todo!`/`get_unchecked` in any hot path. (b) **shape/bounds** — all frame-buffer access is `min()`-clamped (`MAX_SC=32`, `DTW_MAX_LEN`, `LCS_WINDOW`, `PATTERN_LEN`), all index-by-cast sites (`feature_id as usize`, `conclusion_id`, `minute_counter`, `plan_step`) are either compile-time-const-bounded or `if idx <`/`%`-guarded; negative `n_subcarriers` already mapped to 0 (M-01). (c) **memory/leak** — no `move ||` closures, no `mem::forget`/`Box::leak`/`.leak()`; the only `Box::new` is in the `std`-gated `skill_registry` (one-time init, bounded). (d) **secrets** — none (grep clean). **MEASURED build/test evidence:** host `cargo test --features std,medical-experimental` = **672 passed / 0 failed** (was 669 pre-fix; +3 new tests); the real deployment artifacts all build clean on the actual target — `cargo build --target wasm32-unknown-unknown --release` (no_std/panic=abort default lib), `--bin ghost_hunter --no-default-features --features standalone-bin`, and `--features medical-experimental` (toolchain 1.89 per `rust-toolchain.toml`). No ADR slot needed — a single LOW defense-in-depth boundary fix; CHANGELOG attestation suffices.
- **ADR-131 HOMECORE-UI BFF gateway — public-PR review fixes (PR #1082).** (1) **HIGH — path-traversal / confused-deputy SSRF closed in the `/api/cal/*` reverse-proxy** (`homecore-server/src/gateway.rs`). The wildcard proxy path was interpolated straight into the upstream URL while `proxy()` attaches the server-side calibration bearer, so `/api/cal/v1/../../x` (and percent-encoded `..%2f`, `%2e%2e`, leading `/`, backslash, double-encoded `%252e`) could escape the `…/api/` scope **with the privileged token**. Now `validate_proxy_path()` decode-then-checks and rejects absolute/backslash/dot-segment/encoded-traversal paths with a typed **400 BEFORE the URL is built** (applies to GET **and** POST); legit `v1/...` paths still pass. Pinned by `cal_proxy_rejects_traversal_with_400_before_upstream` (fails on old code) + `validate_proxy_path_rejects_traversal_variants`. (2) **CORS + request-tracing now cover the gateway routes.** `/api/homecore/*` and `/api/cal/*` were `.merge()`d **outside** the layers `homecore-api::router()` applies, leaving them with no CORS allowlist and untraced; the audited `build_cors_layer()` (HC-05) + `TraceLayer` are now applied to the whole merged surface in `main.rs`. Pinned by `gateway_routes_are_cors_covered_after_merge` (Vite-dev-origin preflight succeeds on a gateway route). (3) **Fabricated-data honesty (§6 invariant 3):** the gateway no longer injects a hardcoded `anomaly.threshold: 0.5` — it passes through the REAL upstream threshold or emits `null` (withheld); the dashboard renders a not-available `—` instead of `"null%"`/`"null°C"` for null appliance metrics; the COG panel's Hailo-worker pill reflects the real appliance probe instead of a hardcoded `"connected"`; `rooms.js` treats a null anomaly threshold as withheld, not a fake `0.8` default. (4) **Robustness:** a forwarded `hef` that is a string (not an array) no longer throws in the COG panel; the calibration wizard guards `frames/target` against `NaN%`/`Infinity%` and clears its baseline poll timer on Restart / panel teardown (leaked `setTimeout` loop fixed). (5) **Perf:** per-bank RoomState fetches and the appliance service probes now run concurrently (`futures::join_all`; async `tokio::net::TcpStream` + `timeout` replaces the blocking `connect_timeout` that parked a worker per probe); the mock fixture module is now a dynamic `import()` gated on demo mode so production never bundles it. **Note (workspace-wide, not fixed here):** `homecore-server` requests `reqwest`'s `rustls-tls` only, but cargo feature-unification means a sibling crate enabling the default `native-tls` re-introduces OpenSSL into the final binary regardless — a true "no OpenSSL on the appliance" guarantee requires aligning every reqwest-pulling crate on rustls-only. **Note (pre-existing, out of scope):** DEV-mode `allow_any_non_empty()` bearer auth when `HOMECORE_TOKENS` is unset on `0.0.0.0` is unchanged; the loud `warn!` at boot is retained — provision real tokens before network exposure. **Verified:** `cargo test -p homecore-server --no-default-features` = **18/18 pass**, `cargo build -p homecore-server` clean, UI suite (`node tests`) all green, Python proof VERDICT PASS (hash unchanged).
Expand Down
62 changes: 40 additions & 22 deletions docker/Dockerfile.rust
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,42 @@ RUN cargo build --release -p wifi-densepose-sensing-server --features mqtt 2>&1
&& cargo build --release -p homecore-server 2>&1 \
&& strip target/release/sensing-server target/release/cog-ha-matter target/release/homecore-server

# Stage 2: Runtime
FROM debian:bookworm-slim
# Copy the UI assets into the builder and sanity-check the full set the runtime
# serves (regression guard for #520/#514 — the published image must include the
# observatory and pose-fusion dashboards, not just the legacy `index.html` set,
# plus the three release binaries). This runs here in the builder because the
# runtime stage is now distroless (#1140) and has no shell to RUN checks in. A
# missing asset or non-executable binary fails the build, so a stale image can't
# be silently pushed.
COPY ui/ /build/ui/
RUN set -e; \
for f in /build/ui/index.html /build/ui/observatory.html /build/ui/pose-fusion.html /build/ui/viz.html; do \
test -f "$f" || { echo "FATAL: missing UI asset $f"; exit 1; }; \
done; \
for d in /build/ui/observatory /build/ui/pose-fusion /build/ui/components /build/ui/services; do \
test -d "$d" || { echo "FATAL: missing UI directory $d"; exit 1; }; \
done; \
test -x /build/target/release/sensing-server || { echo "FATAL: sensing-server is not executable"; exit 1; }; \
test -x /build/target/release/cog-ha-matter || { echo "FATAL: cog-ha-matter is not executable"; exit 1; }; \
test -x /build/target/release/homecore-server || { echo "FATAL: homecore-server is not executable"; exit 1; }; \
echo "image assets OK"

RUN apt-get update && apt-get install -y --no-install-recommends \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
# Stage 2: Runtime — distroless (Issue #1140).
#
# Previously debian:bookworm-slim, which carries the full Debian base package set
# (perl-base, apt, tar, gnupg, pam, shadow, systemd libs, ...). Docker Scout
# flagged 37 base-image CVEs against those source packages. None are reachable —
# the container runs a single Rust binary, with no Perl/Python/compiler — but
# they pad every scan, and none have a fix in Debian 12, so re-pulling the slim
# base clears nothing. Distroless cc-debian12 ships only glibc + libgcc/libstdc++
# + libssl + the CA bundle (everything a glibc-linked Rust binary needs) and
# drops the large majority of those findings at the source.
#
# We use the :debug variant on purpose: the entrypoint is a /bin/sh script with
# real security logic (#864 fail-closed auth, cog-ha-matter/homecore routing).
# :debug bundles a static busybox shell at /busybox/sh, through which the
# entrypoint is invoked explicitly below.
FROM gcr.io/distroless/cc-debian12:debug

WORKDIR /app

Expand All @@ -51,22 +81,6 @@ COPY --from=builder /build/target/release/homecore-server /app/homecore-server
# Copy UI assets
COPY ui/ /app/ui/

# Sanity-check the assets the runtime actually serves (regression guard for
# #520/#514 — the published image must include the observatory and pose-fusion
# dashboards, not just the legacy `index.html` set). Build fails if any of
# these are missing, so a stale image can't be silently pushed.
RUN set -e; \
for f in /app/ui/index.html /app/ui/observatory.html /app/ui/pose-fusion.html /app/ui/viz.html; do \
test -f "$f" || { echo "FATAL: missing UI asset $f"; exit 1; }; \
done; \
for d in /app/ui/observatory /app/ui/pose-fusion /app/ui/components /app/ui/services; do \
test -d "$d" || { echo "FATAL: missing UI directory $d"; exit 1; }; \
done; \
test -x /app/sensing-server || { echo "FATAL: /app/sensing-server is not executable"; exit 1; }; \
test -x /app/cog-ha-matter || { echo "FATAL: /app/cog-ha-matter is not executable"; exit 1; }; \
test -x /app/homecore-server || { echo "FATAL: /app/homecore-server is not executable"; exit 1; }; \
echo "image assets OK"

# Optional bearer-token auth on /api/v1/*: leave unset for LAN-mode (default),
# set to enforce `Authorization: Bearer <token>` (see bearer_auth module, #443).
# docker run -e RUVIEW_API_TOKEN=$(openssl rand -hex 32) ...
Expand Down Expand Up @@ -103,5 +117,9 @@ COPY docker/docker-entrypoint.sh /app/docker-entrypoint.sh
# Exec-form ENTRYPOINT so Docker appends user arguments correctly.
# Pass flags directly: docker run <image> --source esp32 --tick-ms 500
# Or use env vars: docker run -e CSI_SOURCE=esp32 <image>
ENTRYPOINT ["/app/docker-entrypoint.sh"]
#
# Invoked through the distroless busybox shell (#1140): the runtime base has no
# /bin/sh, so the entrypoint script is run explicitly via /busybox/sh rather than
# relying on its `#!/bin/sh` shebang.
ENTRYPOINT ["/busybox/sh", "/app/docker-entrypoint.sh"]
CMD []
Loading