mesh(iot): AWS IoT provisioning hardening (CA pin + thing-name regex + scoped policy) (8/9 of #195 split) by cagataycali · Pull Request #228 · strands-labs/robots

cagataycali · 2026-05-25T20:38:21Z

Part 8 / 9 of the split of #195 — tracked by #219.

Drafted until PR-2 (#223) and PR-4 (#221) land.

The IoT path is an alternative wire transport with its own threat model: cloud-side certs, IoT policy wildcards, CA-substitution MITM, camera URL capture. This PR closes those vectors.

What's in this PR

strands_robots/mesh/iot/provision.py (+350/-10) — CA pinning (defeats CA-substitution MITM), strict thing-name regex (anchored, not just match), per-recv timeout bound, IoT policy scope tightening (no Resource: '*' wildcards, per-thing topic prefixes only).
strands_robots/mesh/iot/camera_offload.py (+35/-12) — privacy kill-switch (STRANDS_MESH_CAMERA_OFFLOAD_DISABLE), schema validation, ACL on S3 PutObject path.
strands_robots/mesh/iot/shadow.py (+2/-2), iot/__init__.py (+5/-5).
7 test files (896 LOC).

Reviewer focus

CA-substitution MITM defence — CA pinning verifies the AWS IoT chain is the one the operator pinned, not whatever the broker presents.
Thing-name regex anchored — ^[a-zA-Z0-9:_-]+$, not unanchored match().
Per-recv timeout — prevents a malicious broker from holding a connection open indefinitely.
IoT policy scope — explicit per-thing prefix in Resource, never *.
Camera S3 offload privacy kill-switch — operator can disable cloud camera publishing without code changes.

Carries review fixes from #195

iot-CA (CA-substitution MITM defence), iot-policy-scope (no wildcards), R22 (camera offload privacy kill-switch + ACL).

Stacking note

Independent of LAN-side Zenoh changes — depends only on PR-1 (#220), PR-2 (#223) for is_safe_* host validators, PR-4 (#221) for audit emit. Can land in parallel with PR-5/6/7. CI on this branch in isolation is expected red until #223 and #221 land.

Landing order: PR-1 → PR-2/3/4/5 → PR-6 → PR-7 → PR-8 (parallel with 6/7) → PR-9. Full plan: PR_LIST.md. Tracking: #219.

…+ scoped policy) The IoT path is an alternative wire transport with its own threat model: cloud-side certs, IoT policy wildcards, CA-substitution MITM, camera URL capture. This PR closes those vectors. Modified: - strands_robots/mesh/iot/provision.py (+350/-10) -- CA pinning (defeats CA-substitution MITM), strict thing-name regex (anchored, not just match), per-recv timeout bound, IoT policy scope tightening (no Resource: '*' wildcards, per-thing topic prefixes only). - strands_robots/mesh/iot/camera_offload.py (+35/-12) -- privacy kill-switch (STRANDS_MESH_CAMERA_OFFLOAD_DISABLE), schema validation, ACL on S3 PutObject path. - strands_robots/mesh/iot/shadow.py (+2/-2) -- minor type fix. - strands_robots/mesh/iot/__init__.py (+5/-5) -- export surface. Carries review fixes from strands-labs#195: iot-CA (CA-substitution MITM defence), iot-policy-scope (no wildcards), R22 (camera offload privacy kill- switch + ACL). Tests (896 LOC across 7 files): CA pin verification, thing-name regex anchored, IoT policy scope (no '*' Resource), camera offload kill- switch, S3 ACL, schema round-trip. Independent of LAN-side Zenoh changes -- depends only on PR-2 (is_safe_* host validators) and PR-4 (audit emit). Can land in parallel with PR-5/6/7. Tracking: strands-labs#219 Source PR: strands-labs#195 Lands after: strands-labs#220 (PR-1), PR-2, PR-4

yinsong1986

Summary

PR-8 hardens the AWS IoT provisioning surface: pins the Amazon Root CA1 SHA-256, replaces the wildcard iot:Receive/Subscribe resources in both robot and operator policies with per-Thing scoped resources, validates Thing names against a strict alphanumeric subset before any AWS or filesystem call, and adds a per-socket-timed CA download path that avoids socket.setdefaulttimeout global mutation. The CameraOffloader default presigned-URL TTL drops from 3600s to 60s with a 1-hour cap.

The security story is mostly solid and well-tested (CA pin verification, on-disk re-use raw-checks the pin even under break-glass, O_NOFOLLOW on both read paths, scoped Receive policy regression-pinned). However, two of the four bullets in the PR description do not match what the diff actually does, which is the dominant reviewable concern.

What's good

CA pinning with break-glass-doesn't-apply-to-on-disk semantics is the right design and is regression-tested.
_THING_NAME_RE rationale (strict subset of AWS's charset due to NTFS / classic Mac : semantics) is well-documented in both the regex comment and the user-facing provision_robot docstring.
Scoped-Receive replacement is regression-pinned in test_iot_policy_scope.py so a future refactor that re-introduces topic/strands/* fails loudly.
_download_with_per_socket_timeout correctly avoids socket.setdefaulttimeout global mutation — a non-obvious but correct improvement.
64 KiB body cap on the CA download defeats captive-portal-returns-multi-MB-HTML DoS.
Symlink-refusal on verify_ca_pin closes the asymmetric gap with _ensure_ca.

Concerns

PR description claims do not match the diff. Two bullets in the description are unsupported by code in this branch:
1. "camera_offload.py — privacy kill-switch (STRANDS_MESH_CAMERA_OFFLOAD_DISABLE)" — no code in strands_robots/ reads STRANDS_MESH_CAMERA_OFFLOAD_DISABLE or STRANDS_MESH_CAMERA_DISABLED. The kill-switch tests pass for incidental reasons (see inline). If the kill-switch lands in a sibling PR, say so; otherwise this needs adding.
2. "ACL on S3 PutObject path" — s3.put_object(...) in camera_offload.py:131-136 is unchanged and passes no ACL= kwarg. If bucket policy / ownership controls satisfy the threat model, drop the bullet; if PutObject ACL was intended, it's missing.
Reformat noise in __init__.py. Lines 5-7 / 23 / 30 collapse aligned columns to single-spaced text — unrelated to the PR's stated scope. Either keep the original alignment or call it out as deliberate cleanup; right now it just adds diff noise.
CA pin rotation operationally fragile. The pin tuple is hard-coded with one entry; the comment says rotation "ships as a code change" plus optional STRANDS_MESH_CA_PINS env var. Worth a follow-up issue to either auto-fetch from a signed manifest or document the runbook so on-call doesn't have to re-derive the recompute command at 3 AM.

Verification suggestions

# Confirm no production reader for the advertised kill-switch env var
rg -n 'STRANDS_MESH_CAMERA_(OFFLOAD_)?DISABLED' strands_robots/

# Confirm scoped-Receive really replaces the wildcard everywhere
python -c 'from strands_robots.mesh.iot.provision import _ROBOT_POLICY_DOC, _OPERATOR_POLICY_DOC; import json; print(json.dumps([_ROBOT_POLICY_DOC, _OPERATOR_POLICY_DOC], indent=2))' | rg ':topic/strands/\*'

# Round-trip the CA pin against the canonical URL
python -c "import hashlib, urllib.request as u; print(hashlib.sha256(u.urlopen('https://www.amazontrust.com/repository/AmazonRootCA1.pem').read()).hexdigest())"
# Should match strands_robots.mesh.iot.provision._AMAZON_ROOT_CA1_PINS[0]

hatch run test tests/mesh/test_iot_ca_pin.py tests/mesh/test_iot_policy_scope.py tests/mesh/test_iot_provision.py -v

yinsong1986 · 2026-05-25T20:59:15Z

@@ -120,7 +134,7 @@ def upload_frame(self, peer_id: str, cam_name: str, jpeg_bytes: bytes, ts: float
                Body=jpeg_bytes,
                ContentType="image/jpeg",
            )
-        except Exception as exc:
+        except Exception as exc:  # noqa: BLE001 -- boto3 raises ClientError, EndpointConnectionError, NoCredentialsError, etc.; offload is best-effort


PR description bullet says this PR adds "ACL on S3 PutObject path" but put_object(...) is unchanged and passes no ACL= (or ObjectOwnership / BucketOwnerEnforced enforcement) kwarg. Either:

Drop the bullet from the description because bucket-level ownership controls already satisfy the threat model (and document that decision in the module docstring), or

Add the intended hardening, e.g. ACL="private" or ChecksumAlgorithm="SHA256", with a regression test pinning the kwarg.

Reviewers should not have to grep to discover the description doesn't match the diff (AGENTS.md > Review Learnings (#86) > "Match docstrings to semantics").

yinsong1986 · 2026-05-25T20:59:15Z

+        m = Mesh(MagicMock(), peer_id="cam-test")
+        with patch.object(m, "publish") as mock_put:
+            m._publish_cameras_once()
+        mock_put.assert_not_called()


This test does not actually verify any privacy kill-switch.

Mesh(MagicMock(), ...) makes m.robot a MagicMock, so inner = getattr(r, "robot", None) is a truthy MagicMock and cam_cfg = getattr(getattr(inner, "config", None), "cameras", None) is also a MagicMock — not a dict. _publish_cameras_once then returns at if not isinstance(cam_cfg, dict) (core.py:421), not because of any kill-switch. Removing the monkeypatch.setenv("STRANDS_MESH_CAMERA_DISABLED", "true") line leaves the test passing.

Grepping strands_robots/ for STRANDS_MESH_CAMERA_DISABLED / STRANDS_MESH_CAMERA_OFFLOAD_DISABLE returns zero hits in production code, only this test and test_camera_privacy_kill_switch.py (which exercises _bool_env directly, not any caller). Per AGENTS.md > Review Learnings (#85) > "Pin regression tests for reviewed fixes," this would have caught the missing implementation — instead it provides false reassurance.

Suggested fix: either land the kill-switch in _publish_cameras_once/_publish_cameras_once_with_offload and rewrite this test to set up a fake camera dict + mock inner.is_connected = True, or remove the test until the feature lands.

yinsong1986 · 2026-05-25T20:59:15Z

-            os.getenv("STRANDS_MESH_CAMERA_PRESIGN_TTL", str(DEFAULT_PRESIGN_TTL_SECONDS))
-        )
+        ttl_raw = presign_ttl or int(os.getenv("STRANDS_MESH_CAMERA_PRESIGN_TTL", str(DEFAULT_PRESIGN_TTL_SECONDS)))
+        if ttl_raw > MAX_PRESIGN_TTL_SECONDS:


presign_ttl=0 (operator passing zero explicitly) is falsy, so the or short-circuits to int(os.getenv(...)) and returns the env default — not the floor of 1 the docstring/test suggests. test_zero_or_negative_clamped_up even asserts presign_ttl == DEFAULT_PRESIGN_TTL_SECONDS despite the test name implying clamping to 1.

This is a footgun: an operator who explicitly disables presigned URLs by passing presign_ttl=0 (a reasonable thing to try) silently gets a 60-second TTL, and the kwarg's documented "override" semantics quietly stop working. The same bug applies to presign_ttl=None vs. an explicit 0.

Suggested: ttl_raw = presign_ttl if presign_ttl is not None else int(os.getenv("STRANDS_MESH_CAMERA_PRESIGN_TTL", str(DEFAULT_PRESIGN_TTL_SECONDS))) and let the existing < 1 → 1 clamp handle the rest. Update test_zero_or_negative_clamped_up to actually assert presign_ttl == 1.

yinsong1986 · 2026-05-25T20:59:15Z

+# up after rollout. Operators can also extend the accepted pins via
+# ``STRANDS_MESH_CA_PINS`` (comma-separated 64-char lowercase hex). The
+# env var augments the built-in tuple; it does not replace it.
+_AMAZON_ROOT_CA1_PINS: tuple[str, ...] = ("2c43952ee9e000ff2acc4e2ed0897c0a72ad5fa72c3d934e81741cbd54f05bd1",)


Single-pin tuple at code level + optional env-var augmentation makes CA rotation operationally awkward: an AWS root rotation requires a tagged release of strands_robots, then waiting for fleets to upgrade, or every operator setting STRANDS_MESH_CA_PINS out-of-band.

Not a blocker for this PR, but worth a tracked follow-up issue per AGENTS.md > Project Dashboard ("create GitHub issues and add them to this board"): either ship a signed pins.json manifest the library can fetch, or at minimum document the rotation runbook (recompute command + grace-period strategy + how to land the new pin while the old one is still valid) in README.md. The recompute one-liner in the docstring is a start; an actual playbook is what on-call needs at 3 AM.

cagataycali · 2026-05-27T02:10:10Z

🎯 Pentest evidence for this PR (#228 — IoT hardening)

Live run on 2026-05-26 against cagataycali/robots-pentest@dbfe2b0 (us-west-2 sandbox account 947951559549).

This PR's scope covers 5 confirmed findings, of which 1 is CRITICAL and 3 are HIGH. Full evidence + reproduction in BUGS.md + RESULTS.md.

Findings → fix mapping in this PR

Finding	Sev	What	Where to fix
F-15 / B-09	High	Robot-A successfully published forged response on `strands/pentest-robot-b/response/<turn>` (broker rc=0). Operator matches by `turn_id` only — first response wins.	`provision.py:_ROBOT_POLICY_DOC` `AllowResponseToAnyOperator.Resource` — scope to `${iot:Connection.Thing.ThingName}/response/` instead of `/response/*`
F-16 / B-10	High	20 estops from a stolen leaf cert → 20 Lambda invocations. CloudWatch shows `sender=unknown` — Lambda can't tell legit operator from attacker.	`_ROBOT_POLICY_DOC.AllowSafetyEstop` — either remove from robot policy entirely or restrict via IoT Rule SQL `WHERE peer_id IN <operator-things>`. Lambda extract `principal()` and reject non-operator.
F-19 / B-13	CRITICAL	Walked the full leaked-claim-cert flow → registered `pentest-rogue-1779842156` Thing with `strands-mesh-role=robot` attribute and `strands-robot` IoT policy attached. Rogue Thing left in account as evidence.	`bootstrap.py:_ensure_provisioning_template` — always pass `preProvisioningHook` ARN. Default Lambda body returns `allowProvisioning=False` until operator overrides. Or refuse to enable template unless `STRANDS_MESH_PROVISIONING_HOOK_ARN` env is set.
F-20 / B-14	High	Operator wrote `shadow.reported.poc06` on every Thing in account (incl. non-strands Things).	`_OPERATOR_POLICY_DOC.OperatorShadow.Resource` — scope to `arn:aws:iot:::thing/strands-` not `arn:aws:iot:::thing/`
F-21 / B-15	Medium	All 3 IAM trust policies (`strands-mesh-iot-action-role`, `strands-mesh-lambda-role`, `strands-mesh-provisioning-role`) lack `aws:SourceAccount` / `aws:SourceArn` conditions. Lambda's inline `iot:Publish` uses `arn:aws:iot:::topic/strands/*` (region/account wildcards).	`bootstrap.py:_ensure__role` — add `Condition: {StringEquals: {aws:SourceAccount: <bootstrapping-account>}, ArnLike: {aws:SourceArn: arn:aws:iot:<region>:<account>:rule/strands_}}`. Pin `iot:Publish` resources to bootstrapping region/account.

Pinned regression tests waiting for these fixes

From the pentest repo, ready to copy into tests/mesh/iot/ of this PR once the corresponding fix lands:

test_pentest_b12_iot_subscription_scope.py — 4 tests for B-09 / B-12 / B-14 / F-15 / F-17 / F-20 (currently 2 pass / 2 fail against main)
test_pentest_b13_provisioning_hook.py — 1 test for F-19 / B-13 (currently fails against main)

Assertion messages embed the F-NN/B-NN tags + RESULTS.md URLs for inline evidence.

Posture confirmed by this PR (also pin as positive controls)

F-17 / B-12 — per-robot iot:Subscribe correctly denies state/health/cmd/input/+/camera/+/$aws/things/+/shadow/get/accepted — only 3 of 9 SUBACK granted to a stolen-leaf attacker. ✅
F-22 / B-16 — cross-region cert reuse blocked at TLS handshake (us-west-2 cert vs us-east-1 broker). ✅

The test_robot_policy_denies_high_value_subscribes and test_robot_policy_allows_documented_recon_topics tests in test_pentest_b12_iot_subscription_scope.py already pass and pin this posture.

CC'ing this and the other 6 mesh-security PRs (#221/#222/#223/#224/#225/#227) in PLAN.md. Happy to push fix-commits to your branch if you want me to take a swing.

This was referenced May 25, 2026

feat(mesh): security hardening on Zenoh built-ins (mTLS + ACL + DoS bounds) #195

Draft

[mesh] Security hardening (Zenoh-native): umbrella tracker for split of #195 #219

Open

cagataycali marked this pull request as ready for review May 25, 2026 20:41

cagataycali added this to Strands Labs - Robots May 25, 2026

github-project-automation Bot moved this to Backlog in Strands Labs - Robots May 25, 2026

yinsong1986 reviewed May 25, 2026

View reviewed changes

yinsong1986 mentioned this pull request May 25, 2026

docs(mesh): README env-var matrix + CHANGELOG entry (9/9 of #195 split) #226

Open

cagataycali added security mesh Zenoh mesh networking / fleet coordination iot labels May 26, 2026

cagataycali mentioned this pull request May 27, 2026

mesh(audit): tamper-evident audit log (HMAC + per-peer seq + rotation) (4/9 of #195 split) #221

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mesh(iot): AWS IoT provisioning hardening (CA pin + thing-name regex + scoped policy) (8/9 of #195 split)#228

mesh(iot): AWS IoT provisioning hardening (CA pin + thing-name regex + scoped policy) (8/9 of #195 split)#228
cagataycali wants to merge 1 commit into
strands-labs:mainfrom
cagataycali:pr8/mesh-iot-provisioning-hardening

cagataycali commented May 25, 2026

Uh oh!

yinsong1986 left a comment

Uh oh!

yinsong1986 May 25, 2026

Uh oh!

yinsong1986 May 25, 2026

Uh oh!

yinsong1986 May 25, 2026

Uh oh!

yinsong1986 May 25, 2026

Uh oh!

cagataycali commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cagataycali commented May 25, 2026

What's in this PR

Reviewer focus

Carries review fixes from #195

Stacking note

Uh oh!

yinsong1986 left a comment

Choose a reason for hiding this comment

Summary

What's good

Concerns

Verification suggestions

Uh oh!

yinsong1986 May 25, 2026

Choose a reason for hiding this comment

Uh oh!

yinsong1986 May 25, 2026

Choose a reason for hiding this comment

Uh oh!

yinsong1986 May 25, 2026

Choose a reason for hiding this comment

Uh oh!

yinsong1986 May 25, 2026

Choose a reason for hiding this comment

Uh oh!

cagataycali commented May 27, 2026

🎯 Pentest evidence for this PR (#228 — IoT hardening)

Findings → fix mapping in this PR

Pinned regression tests waiting for these fixes

Posture confirmed by this PR (also pin as positive controls)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants