mesh(iot): AWS IoT provisioning hardening (CA pin + thing-name regex + scoped policy) (8/9 of #195 split)#228
Conversation
…+ scoped policy) The IoT path is an alternative wire transport with its own threat model: cloud-side certs, IoT policy wildcards, CA-substitution MITM, camera URL capture. This PR closes those vectors. Modified: - strands_robots/mesh/iot/provision.py (+350/-10) -- CA pinning (defeats CA-substitution MITM), strict thing-name regex (anchored, not just match), per-recv timeout bound, IoT policy scope tightening (no Resource: '*' wildcards, per-thing topic prefixes only). - strands_robots/mesh/iot/camera_offload.py (+35/-12) -- privacy kill-switch (STRANDS_MESH_CAMERA_OFFLOAD_DISABLE), schema validation, ACL on S3 PutObject path. - strands_robots/mesh/iot/shadow.py (+2/-2) -- minor type fix. - strands_robots/mesh/iot/__init__.py (+5/-5) -- export surface. Carries review fixes from strands-labs#195: iot-CA (CA-substitution MITM defence), iot-policy-scope (no wildcards), R22 (camera offload privacy kill- switch + ACL). Tests (896 LOC across 7 files): CA pin verification, thing-name regex anchored, IoT policy scope (no '*' Resource), camera offload kill- switch, S3 ACL, schema round-trip. Independent of LAN-side Zenoh changes -- depends only on PR-2 (is_safe_* host validators) and PR-4 (audit emit). Can land in parallel with PR-5/6/7. Tracking: strands-labs#219 Source PR: strands-labs#195 Lands after: strands-labs#220 (PR-1), PR-2, PR-4
yinsong1986
left a comment
There was a problem hiding this comment.
Summary
PR-8 hardens the AWS IoT provisioning surface: pins the Amazon Root CA1 SHA-256, replaces the wildcard iot:Receive/Subscribe resources in both robot and operator policies with per-Thing scoped resources, validates Thing names against a strict alphanumeric subset before any AWS or filesystem call, and adds a per-socket-timed CA download path that avoids socket.setdefaulttimeout global mutation. The CameraOffloader default presigned-URL TTL drops from 3600s to 60s with a 1-hour cap.
The security story is mostly solid and well-tested (CA pin verification, on-disk re-use raw-checks the pin even under break-glass, O_NOFOLLOW on both read paths, scoped Receive policy regression-pinned). However, two of the four bullets in the PR description do not match what the diff actually does, which is the dominant reviewable concern.
What's good
- CA pinning with break-glass-doesn't-apply-to-on-disk semantics is the right design and is regression-tested.
_THING_NAME_RErationale (strict subset of AWS's charset due to NTFS / classic Mac:semantics) is well-documented in both the regex comment and the user-facingprovision_robotdocstring.- Scoped-Receive replacement is regression-pinned in
test_iot_policy_scope.pyso a future refactor that re-introducestopic/strands/*fails loudly. _download_with_per_socket_timeoutcorrectly avoidssocket.setdefaulttimeoutglobal mutation — a non-obvious but correct improvement.- 64 KiB body cap on the CA download defeats captive-portal-returns-multi-MB-HTML DoS.
- Symlink-refusal on
verify_ca_pincloses the asymmetric gap with_ensure_ca.
Concerns
- PR description claims do not match the diff. Two bullets in the description are unsupported by code in this branch:
- "camera_offload.py — privacy kill-switch (
STRANDS_MESH_CAMERA_OFFLOAD_DISABLE)" — no code instrands_robots/readsSTRANDS_MESH_CAMERA_OFFLOAD_DISABLEorSTRANDS_MESH_CAMERA_DISABLED. The kill-switch tests pass for incidental reasons (see inline). If the kill-switch lands in a sibling PR, say so; otherwise this needs adding. - "ACL on S3 PutObject path" —
s3.put_object(...)incamera_offload.py:131-136is unchanged and passes noACL=kwarg. If bucket policy / ownership controls satisfy the threat model, drop the bullet; if PutObject ACL was intended, it's missing.
- "camera_offload.py — privacy kill-switch (
- Reformat noise in
__init__.py. Lines 5-7 / 23 / 30 collapse aligned columns to single-spaced text — unrelated to the PR's stated scope. Either keep the original alignment or call it out as deliberate cleanup; right now it just adds diff noise. - CA pin rotation operationally fragile. The pin tuple is hard-coded with one entry; the comment says rotation "ships as a code change" plus optional
STRANDS_MESH_CA_PINSenv var. Worth a follow-up issue to either auto-fetch from a signed manifest or document the runbook so on-call doesn't have to re-derive the recompute command at 3 AM.
Verification suggestions
# Confirm no production reader for the advertised kill-switch env var
rg -n 'STRANDS_MESH_CAMERA_(OFFLOAD_)?DISABLED' strands_robots/
# Confirm scoped-Receive really replaces the wildcard everywhere
python -c 'from strands_robots.mesh.iot.provision import _ROBOT_POLICY_DOC, _OPERATOR_POLICY_DOC; import json; print(json.dumps([_ROBOT_POLICY_DOC, _OPERATOR_POLICY_DOC], indent=2))' | rg ':topic/strands/\*'
# Round-trip the CA pin against the canonical URL
python -c "import hashlib, urllib.request as u; print(hashlib.sha256(u.urlopen('https://www.amazontrust.com/repository/AmazonRootCA1.pem').read()).hexdigest())"
# Should match strands_robots.mesh.iot.provision._AMAZON_ROOT_CA1_PINS[0]
hatch run test tests/mesh/test_iot_ca_pin.py tests/mesh/test_iot_policy_scope.py tests/mesh/test_iot_provision.py -v| @@ -120,7 +134,7 @@ def upload_frame(self, peer_id: str, cam_name: str, jpeg_bytes: bytes, ts: float | |||
| Body=jpeg_bytes, | |||
| ContentType="image/jpeg", | |||
| ) | |||
| except Exception as exc: | |||
| except Exception as exc: # noqa: BLE001 -- boto3 raises ClientError, EndpointConnectionError, NoCredentialsError, etc.; offload is best-effort | |||
There was a problem hiding this comment.
PR description bullet says this PR adds "ACL on S3 PutObject path" but put_object(...) is unchanged and passes no ACL= (or ObjectOwnership / BucketOwnerEnforced enforcement) kwarg. Either:
- Drop the bullet from the description because bucket-level ownership controls already satisfy the threat model (and document that decision in the module docstring), or
- Add the intended hardening, e.g.
ACL="private"orChecksumAlgorithm="SHA256", with a regression test pinning the kwarg.
Reviewers should not have to grep to discover the description doesn't match the diff (AGENTS.md > Review Learnings (#86) > "Match docstrings to semantics").
| m = Mesh(MagicMock(), peer_id="cam-test") | ||
| with patch.object(m, "publish") as mock_put: | ||
| m._publish_cameras_once() | ||
| mock_put.assert_not_called() |
There was a problem hiding this comment.
This test does not actually verify any privacy kill-switch.
Mesh(MagicMock(), ...) makes m.robot a MagicMock, so inner = getattr(r, "robot", None) is a truthy MagicMock and cam_cfg = getattr(getattr(inner, "config", None), "cameras", None) is also a MagicMock — not a dict. _publish_cameras_once then returns at if not isinstance(cam_cfg, dict) (core.py:421), not because of any kill-switch. Removing the monkeypatch.setenv("STRANDS_MESH_CAMERA_DISABLED", "true") line leaves the test passing.
Grepping strands_robots/ for STRANDS_MESH_CAMERA_DISABLED / STRANDS_MESH_CAMERA_OFFLOAD_DISABLE returns zero hits in production code, only this test and test_camera_privacy_kill_switch.py (which exercises _bool_env directly, not any caller). Per AGENTS.md > Review Learnings (#85) > "Pin regression tests for reviewed fixes," this would have caught the missing implementation — instead it provides false reassurance.
Suggested fix: either land the kill-switch in _publish_cameras_once/_publish_cameras_once_with_offload and rewrite this test to set up a fake camera dict + mock inner.is_connected = True, or remove the test until the feature lands.
| os.getenv("STRANDS_MESH_CAMERA_PRESIGN_TTL", str(DEFAULT_PRESIGN_TTL_SECONDS)) | ||
| ) | ||
| ttl_raw = presign_ttl or int(os.getenv("STRANDS_MESH_CAMERA_PRESIGN_TTL", str(DEFAULT_PRESIGN_TTL_SECONDS))) | ||
| if ttl_raw > MAX_PRESIGN_TTL_SECONDS: |
There was a problem hiding this comment.
presign_ttl=0 (operator passing zero explicitly) is falsy, so the or short-circuits to int(os.getenv(...)) and returns the env default — not the floor of 1 the docstring/test suggests. test_zero_or_negative_clamped_up even asserts presign_ttl == DEFAULT_PRESIGN_TTL_SECONDS despite the test name implying clamping to 1.
This is a footgun: an operator who explicitly disables presigned URLs by passing presign_ttl=0 (a reasonable thing to try) silently gets a 60-second TTL, and the kwarg's documented "override" semantics quietly stop working. The same bug applies to presign_ttl=None vs. an explicit 0.
Suggested: ttl_raw = presign_ttl if presign_ttl is not None else int(os.getenv("STRANDS_MESH_CAMERA_PRESIGN_TTL", str(DEFAULT_PRESIGN_TTL_SECONDS))) and let the existing < 1 → 1 clamp handle the rest. Update test_zero_or_negative_clamped_up to actually assert presign_ttl == 1.
| # up after rollout. Operators can also extend the accepted pins via | ||
| # ``STRANDS_MESH_CA_PINS`` (comma-separated 64-char lowercase hex). The | ||
| # env var augments the built-in tuple; it does not replace it. | ||
| _AMAZON_ROOT_CA1_PINS: tuple[str, ...] = ("2c43952ee9e000ff2acc4e2ed0897c0a72ad5fa72c3d934e81741cbd54f05bd1",) |
There was a problem hiding this comment.
Single-pin tuple at code level + optional env-var augmentation makes CA rotation operationally awkward: an AWS root rotation requires a tagged release of strands_robots, then waiting for fleets to upgrade, or every operator setting STRANDS_MESH_CA_PINS out-of-band.
Not a blocker for this PR, but worth a tracked follow-up issue per AGENTS.md > Project Dashboard ("create GitHub issues and add them to this board"): either ship a signed pins.json manifest the library can fetch, or at minimum document the rotation runbook (recompute command + grace-period strategy + how to land the new pin while the old one is still valid) in README.md. The recompute one-liner in the docstring is a start; an actual playbook is what on-call needs at 3 AM.
🎯 Pentest evidence for this PR (#228 — IoT hardening)Live run on 2026-05-26 against This PR's scope covers 5 confirmed findings, of which 1 is CRITICAL and 3 are HIGH. Full evidence + reproduction in Findings → fix mapping in this PR
Pinned regression tests waiting for these fixesFrom the pentest repo, ready to copy into
Assertion messages embed the F-NN/B-NN tags + Posture confirmed by this PR (also pin as positive controls)
The CC'ing this and the other 6 mesh-security PRs (#221/#222/#223/#224/#225/#227) in |
Part 8 / 9 of the split of #195 — tracked by #219.
Drafted until PR-2 (#223) and PR-4 (#221) land.
The IoT path is an alternative wire transport with its own threat model: cloud-side certs, IoT policy wildcards, CA-substitution MITM, camera URL capture. This PR closes those vectors.
What's in this PR
strands_robots/mesh/iot/provision.py(+350/-10) — CA pinning (defeats CA-substitution MITM), strict thing-name regex (anchored, not justmatch), per-recv timeout bound, IoT policy scope tightening (noResource: '*'wildcards, per-thing topic prefixes only).strands_robots/mesh/iot/camera_offload.py(+35/-12) — privacy kill-switch (STRANDS_MESH_CAMERA_OFFLOAD_DISABLE), schema validation, ACL on S3 PutObject path.strands_robots/mesh/iot/shadow.py(+2/-2),iot/__init__.py(+5/-5).Reviewer focus
^[a-zA-Z0-9:_-]+$, not unanchoredmatch().Resource, never*.Carries review fixes from #195
iot-CA (CA-substitution MITM defence), iot-policy-scope (no wildcards), R22 (camera offload privacy kill-switch + ACL).
Stacking note
Independent of LAN-side Zenoh changes — depends only on PR-1 (#220), PR-2 (#223) for
is_safe_*host validators, PR-4 (#221) for audit emit. Can land in parallel with PR-5/6/7. CI on this branch in isolation is expected red until #223 and #221 land.Landing order: PR-1 → PR-2/3/4/5 → PR-6 → PR-7 → PR-8 (parallel with 6/7) → PR-9. Full plan:
PR_LIST.md. Tracking: #219.