Skip to content

mesh(iot): AWS IoT provisioning hardening (CA pin + thing-name regex + scoped policy) (8/9 of #195 split)#228

Open
cagataycali wants to merge 1 commit into
strands-labs:mainfrom
cagataycali:pr8/mesh-iot-provisioning-hardening
Open

mesh(iot): AWS IoT provisioning hardening (CA pin + thing-name regex + scoped policy) (8/9 of #195 split)#228
cagataycali wants to merge 1 commit into
strands-labs:mainfrom
cagataycali:pr8/mesh-iot-provisioning-hardening

Conversation

@cagataycali
Copy link
Copy Markdown
Member

Part 8 / 9 of the split of #195 — tracked by #219.

Drafted until PR-2 (#223) and PR-4 (#221) land.

The IoT path is an alternative wire transport with its own threat model: cloud-side certs, IoT policy wildcards, CA-substitution MITM, camera URL capture. This PR closes those vectors.

What's in this PR

  • strands_robots/mesh/iot/provision.py (+350/-10) — CA pinning (defeats CA-substitution MITM), strict thing-name regex (anchored, not just match), per-recv timeout bound, IoT policy scope tightening (no Resource: '*' wildcards, per-thing topic prefixes only).
  • strands_robots/mesh/iot/camera_offload.py (+35/-12) — privacy kill-switch (STRANDS_MESH_CAMERA_OFFLOAD_DISABLE), schema validation, ACL on S3 PutObject path.
  • strands_robots/mesh/iot/shadow.py (+2/-2), iot/__init__.py (+5/-5).
  • 7 test files (896 LOC).

Reviewer focus

  • CA-substitution MITM defence — CA pinning verifies the AWS IoT chain is the one the operator pinned, not whatever the broker presents.
  • Thing-name regex anchored^[a-zA-Z0-9:_-]+$, not unanchored match().
  • Per-recv timeout — prevents a malicious broker from holding a connection open indefinitely.
  • IoT policy scope — explicit per-thing prefix in Resource, never *.
  • Camera S3 offload privacy kill-switch — operator can disable cloud camera publishing without code changes.

Carries review fixes from #195

iot-CA (CA-substitution MITM defence), iot-policy-scope (no wildcards), R22 (camera offload privacy kill-switch + ACL).

Stacking note

Independent of LAN-side Zenoh changes — depends only on PR-1 (#220), PR-2 (#223) for is_safe_* host validators, PR-4 (#221) for audit emit. Can land in parallel with PR-5/6/7. CI on this branch in isolation is expected red until #223 and #221 land.


Landing order: PR-1 → PR-2/3/4/5 → PR-6 → PR-7 → PR-8 (parallel with 6/7) → PR-9. Full plan: PR_LIST.md. Tracking: #219.

…+ scoped policy)

The IoT path is an alternative wire transport with its own threat
model: cloud-side certs, IoT policy wildcards, CA-substitution MITM,
camera URL capture. This PR closes those vectors.

Modified:
- strands_robots/mesh/iot/provision.py (+350/-10) -- CA pinning
  (defeats CA-substitution MITM), strict thing-name regex (anchored,
  not just match), per-recv timeout bound, IoT policy scope tightening
  (no Resource: '*' wildcards, per-thing topic prefixes only).
- strands_robots/mesh/iot/camera_offload.py (+35/-12) -- privacy
  kill-switch (STRANDS_MESH_CAMERA_OFFLOAD_DISABLE), schema validation,
  ACL on S3 PutObject path.
- strands_robots/mesh/iot/shadow.py (+2/-2) -- minor type fix.
- strands_robots/mesh/iot/__init__.py (+5/-5) -- export surface.

Carries review fixes from strands-labs#195: iot-CA (CA-substitution MITM defence),
iot-policy-scope (no wildcards), R22 (camera offload privacy kill-
switch + ACL).

Tests (896 LOC across 7 files): CA pin verification, thing-name regex
anchored, IoT policy scope (no '*' Resource), camera offload kill-
switch, S3 ACL, schema round-trip.

Independent of LAN-side Zenoh changes -- depends only on PR-2
(is_safe_* host validators) and PR-4 (audit emit). Can land in
parallel with PR-5/6/7.

Tracking: strands-labs#219
Source PR: strands-labs#195
Lands after: strands-labs#220 (PR-1), PR-2, PR-4
Copy link
Copy Markdown
Contributor

@yinsong1986 yinsong1986 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

PR-8 hardens the AWS IoT provisioning surface: pins the Amazon Root CA1 SHA-256, replaces the wildcard iot:Receive/Subscribe resources in both robot and operator policies with per-Thing scoped resources, validates Thing names against a strict alphanumeric subset before any AWS or filesystem call, and adds a per-socket-timed CA download path that avoids socket.setdefaulttimeout global mutation. The CameraOffloader default presigned-URL TTL drops from 3600s to 60s with a 1-hour cap.

The security story is mostly solid and well-tested (CA pin verification, on-disk re-use raw-checks the pin even under break-glass, O_NOFOLLOW on both read paths, scoped Receive policy regression-pinned). However, two of the four bullets in the PR description do not match what the diff actually does, which is the dominant reviewable concern.

What's good

  • CA pinning with break-glass-doesn't-apply-to-on-disk semantics is the right design and is regression-tested.
  • _THING_NAME_RE rationale (strict subset of AWS's charset due to NTFS / classic Mac : semantics) is well-documented in both the regex comment and the user-facing provision_robot docstring.
  • Scoped-Receive replacement is regression-pinned in test_iot_policy_scope.py so a future refactor that re-introduces topic/strands/* fails loudly.
  • _download_with_per_socket_timeout correctly avoids socket.setdefaulttimeout global mutation — a non-obvious but correct improvement.
  • 64 KiB body cap on the CA download defeats captive-portal-returns-multi-MB-HTML DoS.
  • Symlink-refusal on verify_ca_pin closes the asymmetric gap with _ensure_ca.

Concerns

  • PR description claims do not match the diff. Two bullets in the description are unsupported by code in this branch:
    1. "camera_offload.py — privacy kill-switch (STRANDS_MESH_CAMERA_OFFLOAD_DISABLE)" — no code in strands_robots/ reads STRANDS_MESH_CAMERA_OFFLOAD_DISABLE or STRANDS_MESH_CAMERA_DISABLED. The kill-switch tests pass for incidental reasons (see inline). If the kill-switch lands in a sibling PR, say so; otherwise this needs adding.
    2. "ACL on S3 PutObject path" — s3.put_object(...) in camera_offload.py:131-136 is unchanged and passes no ACL= kwarg. If bucket policy / ownership controls satisfy the threat model, drop the bullet; if PutObject ACL was intended, it's missing.
  • Reformat noise in __init__.py. Lines 5-7 / 23 / 30 collapse aligned columns to single-spaced text — unrelated to the PR's stated scope. Either keep the original alignment or call it out as deliberate cleanup; right now it just adds diff noise.
  • CA pin rotation operationally fragile. The pin tuple is hard-coded with one entry; the comment says rotation "ships as a code change" plus optional STRANDS_MESH_CA_PINS env var. Worth a follow-up issue to either auto-fetch from a signed manifest or document the runbook so on-call doesn't have to re-derive the recompute command at 3 AM.

Verification suggestions

# Confirm no production reader for the advertised kill-switch env var
rg -n 'STRANDS_MESH_CAMERA_(OFFLOAD_)?DISABLED' strands_robots/

# Confirm scoped-Receive really replaces the wildcard everywhere
python -c 'from strands_robots.mesh.iot.provision import _ROBOT_POLICY_DOC, _OPERATOR_POLICY_DOC; import json; print(json.dumps([_ROBOT_POLICY_DOC, _OPERATOR_POLICY_DOC], indent=2))' | rg ':topic/strands/\*'

# Round-trip the CA pin against the canonical URL
python -c "import hashlib, urllib.request as u; print(hashlib.sha256(u.urlopen('https://www.amazontrust.com/repository/AmazonRootCA1.pem').read()).hexdigest())"
# Should match strands_robots.mesh.iot.provision._AMAZON_ROOT_CA1_PINS[0]

hatch run test tests/mesh/test_iot_ca_pin.py tests/mesh/test_iot_policy_scope.py tests/mesh/test_iot_provision.py -v

@@ -120,7 +134,7 @@ def upload_frame(self, peer_id: str, cam_name: str, jpeg_bytes: bytes, ts: float
Body=jpeg_bytes,
ContentType="image/jpeg",
)
except Exception as exc:
except Exception as exc: # noqa: BLE001 -- boto3 raises ClientError, EndpointConnectionError, NoCredentialsError, etc.; offload is best-effort
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description bullet says this PR adds "ACL on S3 PutObject path" but put_object(...) is unchanged and passes no ACL= (or ObjectOwnership / BucketOwnerEnforced enforcement) kwarg. Either:

  1. Drop the bullet from the description because bucket-level ownership controls already satisfy the threat model (and document that decision in the module docstring), or
  2. Add the intended hardening, e.g. ACL="private" or ChecksumAlgorithm="SHA256", with a regression test pinning the kwarg.

Reviewers should not have to grep to discover the description doesn't match the diff (AGENTS.md > Review Learnings (#86) > "Match docstrings to semantics").

m = Mesh(MagicMock(), peer_id="cam-test")
with patch.object(m, "publish") as mock_put:
m._publish_cameras_once()
mock_put.assert_not_called()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test does not actually verify any privacy kill-switch.

Mesh(MagicMock(), ...) makes m.robot a MagicMock, so inner = getattr(r, "robot", None) is a truthy MagicMock and cam_cfg = getattr(getattr(inner, "config", None), "cameras", None) is also a MagicMock — not a dict. _publish_cameras_once then returns at if not isinstance(cam_cfg, dict) (core.py:421), not because of any kill-switch. Removing the monkeypatch.setenv("STRANDS_MESH_CAMERA_DISABLED", "true") line leaves the test passing.

Grepping strands_robots/ for STRANDS_MESH_CAMERA_DISABLED / STRANDS_MESH_CAMERA_OFFLOAD_DISABLE returns zero hits in production code, only this test and test_camera_privacy_kill_switch.py (which exercises _bool_env directly, not any caller). Per AGENTS.md > Review Learnings (#85) > "Pin regression tests for reviewed fixes," this would have caught the missing implementation — instead it provides false reassurance.

Suggested fix: either land the kill-switch in _publish_cameras_once/_publish_cameras_once_with_offload and rewrite this test to set up a fake camera dict + mock inner.is_connected = True, or remove the test until the feature lands.

os.getenv("STRANDS_MESH_CAMERA_PRESIGN_TTL", str(DEFAULT_PRESIGN_TTL_SECONDS))
)
ttl_raw = presign_ttl or int(os.getenv("STRANDS_MESH_CAMERA_PRESIGN_TTL", str(DEFAULT_PRESIGN_TTL_SECONDS)))
if ttl_raw > MAX_PRESIGN_TTL_SECONDS:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

presign_ttl=0 (operator passing zero explicitly) is falsy, so the or short-circuits to int(os.getenv(...)) and returns the env default — not the floor of 1 the docstring/test suggests. test_zero_or_negative_clamped_up even asserts presign_ttl == DEFAULT_PRESIGN_TTL_SECONDS despite the test name implying clamping to 1.

This is a footgun: an operator who explicitly disables presigned URLs by passing presign_ttl=0 (a reasonable thing to try) silently gets a 60-second TTL, and the kwarg's documented "override" semantics quietly stop working. The same bug applies to presign_ttl=None vs. an explicit 0.

Suggested: ttl_raw = presign_ttl if presign_ttl is not None else int(os.getenv("STRANDS_MESH_CAMERA_PRESIGN_TTL", str(DEFAULT_PRESIGN_TTL_SECONDS))) and let the existing < 1 → 1 clamp handle the rest. Update test_zero_or_negative_clamped_up to actually assert presign_ttl == 1.

# up after rollout. Operators can also extend the accepted pins via
# ``STRANDS_MESH_CA_PINS`` (comma-separated 64-char lowercase hex). The
# env var augments the built-in tuple; it does not replace it.
_AMAZON_ROOT_CA1_PINS: tuple[str, ...] = ("2c43952ee9e000ff2acc4e2ed0897c0a72ad5fa72c3d934e81741cbd54f05bd1",)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single-pin tuple at code level + optional env-var augmentation makes CA rotation operationally awkward: an AWS root rotation requires a tagged release of strands_robots, then waiting for fleets to upgrade, or every operator setting STRANDS_MESH_CA_PINS out-of-band.

Not a blocker for this PR, but worth a tracked follow-up issue per AGENTS.md > Project Dashboard ("create GitHub issues and add them to this board"): either ship a signed pins.json manifest the library can fetch, or at minimum document the rotation runbook (recompute command + grace-period strategy + how to land the new pin while the old one is still valid) in README.md. The recompute one-liner in the docstring is a start; an actual playbook is what on-call needs at 3 AM.

@cagataycali
Copy link
Copy Markdown
Member Author

🎯 Pentest evidence for this PR (#228 — IoT hardening)

Live run on 2026-05-26 against cagataycali/robots-pentest@dbfe2b0 (us-west-2 sandbox account 947951559549).

This PR's scope covers 5 confirmed findings, of which 1 is CRITICAL and 3 are HIGH. Full evidence + reproduction in BUGS.md + RESULTS.md.

Findings → fix mapping in this PR

Finding Sev What Where to fix
F-15 / B-09 High Robot-A successfully published forged response on strands/pentest-robot-b/response/<turn> (broker rc=0). Operator matches by turn_id only — first response wins. provision.py:_ROBOT_POLICY_DOC AllowResponseToAnyOperator.Resource — scope to ${iot:Connection.Thing.ThingName}/response/* instead of */response/*
F-16 / B-10 High 20 estops from a stolen leaf cert → 20 Lambda invocations. CloudWatch shows sender=unknown — Lambda can't tell legit operator from attacker. _ROBOT_POLICY_DOC.AllowSafetyEstop — either remove from robot policy entirely or restrict via IoT Rule SQL WHERE peer_id IN <operator-things>. Lambda extract principal() and reject non-operator.
F-19 / B-13 CRITICAL Walked the full leaked-claim-cert flow → registered pentest-rogue-1779842156 Thing with strands-mesh-role=robot attribute and strands-robot IoT policy attached. Rogue Thing left in account as evidence. bootstrap.py:_ensure_provisioning_template — always pass preProvisioningHook ARN. Default Lambda body returns allowProvisioning=False until operator overrides. Or refuse to enable template unless STRANDS_MESH_PROVISIONING_HOOK_ARN env is set.
F-20 / B-14 High Operator wrote shadow.reported.poc06 on every Thing in account (incl. non-strands Things). _OPERATOR_POLICY_DOC.OperatorShadow.Resource — scope to arn:aws:iot:*:*:thing/strands-* not arn:aws:iot:*:*:thing/*
F-21 / B-15 Medium All 3 IAM trust policies (strands-mesh-iot-action-role, strands-mesh-lambda-role, strands-mesh-provisioning-role) lack aws:SourceAccount / aws:SourceArn conditions. Lambda's inline iot:Publish uses arn:aws:iot:*:*:topic/strands/* (region/account wildcards). bootstrap.py:_ensure_*_role — add Condition: {StringEquals: {aws:SourceAccount: <bootstrapping-account>}, ArnLike: {aws:SourceArn: arn:aws:iot:<region>:<account>:rule/strands_*}}. Pin iot:Publish resources to bootstrapping region/account.

Pinned regression tests waiting for these fixes

From the pentest repo, ready to copy into tests/mesh/iot/ of this PR once the corresponding fix lands:

Assertion messages embed the F-NN/B-NN tags + RESULTS.md URLs for inline evidence.

Posture confirmed by this PR (also pin as positive controls)

  • F-17 / B-12 — per-robot iot:Subscribe correctly denies state/health/cmd/input/+/camera/+/$aws/things/+/shadow/get/accepted — only 3 of 9 SUBACK granted to a stolen-leaf attacker. ✅
  • F-22 / B-16 — cross-region cert reuse blocked at TLS handshake (us-west-2 cert vs us-east-1 broker). ✅

The test_robot_policy_denies_high_value_subscribes and test_robot_policy_allows_documented_recon_topics tests in test_pentest_b12_iot_subscription_scope.py already pass and pin this posture.


CC'ing this and the other 6 mesh-security PRs (#221/#222/#223/#224/#225/#227) in PLAN.md. Happy to push fix-commits to your branch if you want me to take a swing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

iot mesh Zenoh mesh networking / fleet coordination security

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

2 participants