feat(mesh): security hardening on Zenoh built-ins (mTLS + ACL + DoS bounds)#195
feat(mesh): security hardening on Zenoh built-ins (mTLS + ACL + DoS bounds)#195cagataycali wants to merge 71 commits into
Conversation
yinsong1986
left a comment
There was a problem hiding this comment.
Summary
Replaces ~1700 LOC of hand-rolled application-layer mesh crypto (HMAC envelope, peer-identity registry, token bucket, nonce cache, etc.) with Zenoh built-ins: mTLS at transport/link/tls, ACLs at access_control, frequency caps at downsampling, byte caps at low_pass_filter, plus namespace-based fleet isolation. The application layer keeps payload-semantic validation (action allowlist, host/repo/policy-type allowlists), per-record HMAC + per-peer seq audit log with bounded rotation, and a HITL interrupt + sliding-window rate limit on the LLM-callable surface. Diff is large (+7337/-427 across 55 files) but scope is coherent and consistent with the PR description.
The commit message and architecture are well-documented; this is a careful piece of work and the four Zenoh 1.x quirks the author rediscovered (silent low_pass_filter no-op, namespace-stripping in key-expr matching, mandatory enabled: true, literal-only cert_common_names) are exactly the kind of finding that justifies the live red-team test suite.
What's good
- Zenoh-native primitives over hand-rolled crypto is the right architectural call; the resulting code is dramatically easier to reason about.
- Live two-peer red-team tests in
tests/mesh/test_redteam_zenoh.pyexercise mTLS handshake rejection, namespace isolation, downsampling, andlow_pass_filteragainst a realzenoh.open()session rather than mocks. - Rate-limit check/record split (
_rate_limit_checkvs_rate_limit_recordintools/robot_mesh.py) correctly avoids consuming a slot on a declined HITL prompt - the right safety-vs-DoS tradeoff. - Audit log integrity story (per-record HMAC + per-peer seq sidecar + cross-process flock +
O_NOFOLLOW+ bounded rotation + tamper-aware verifier cursor) is well thought through and pinned to regression tests. transport/link/protocols=["tls"]lock prevents plain-TCP downgrade; that's a defence many such designs miss.- Clear separation of concerns:
_zenoh_config.py(transport),_acl_config.py(authorisation),security.py(payload semantics),audit.py(forensics).
Concerns
- JSON5-lite preprocessor is untested at the parser level.
_acl_config.pyships ~120 lines of hand-rolled JSON5 stripping (_strip_json5_comments,_quote_unquoted_keys,_strip_trailing_commas) but the loader tests intest_acl_config.pyall feed canonicaljson.dumps()output. The example templateexamples/mesh_acl_example.json5exercises every feature the preprocessor handles, but no test loads it, so a regression in the preprocessor wouldn't be caught. Consider adding a fixture that loads the shipped example file end-to-end and a parametrised test for tricky inputs (strings containing//,,adjacent to], escaped quotes, etc.). See inline. - No live-session test for the default (permissive) ACL.
test_unknown_cn_dropped_by_default_denyintest_redteam_zenoh.pyexercises the operator-suppliedSTRANDS_MESH_ACL_FILEtemplate, not the built-indefault_acl()returned when the env var is unset. Given the PR's own quirks list (enabled: trueis required; subject withoutcert_common_namesmay or may not match) the assertion that the default ACL is actually permissive end-to-end relies on author-side reasoning rather than a passing test. is_safe_policy_hostaccepts hostnames literally without DNS resolution. When an operator extendsSTRANDS_MESH_POLICY_HOST_ALLOWwithvla.internal, anyone who can poison DNS forvla.internalredirects every robot's policy traffic to a malicious target. This is a documented assumption (the operator trusts their resolver), but the docstring onis_safe_policy_hostshould call it out so deployments with weak DNS don't silently expand the trust boundary by adding a hostname.- CHANGELOG / PR-description claim drift. The PR description (§8) says: "the test verifies the deny case live (a peer with a valid CA cert but a CN not enumerated in the ACL is dropped at the transport)." In
test_redteam_zenoh.pytwo of the closely related tests (test_robot_cert_cannot_publish_on_cmd,test_op_cert_can_publish_on_cmd) are skipped with "need more research" - the PR description does call out the skips honestly in §12, but the threat-vector matrix in §3 marks vector #1 (publish without cert) and vector #9 (spoof presence) both as "Mitigated" using onlytest_rogue_ca_cert_rejected. That single test exercises a different (CA-substitution) failure mode, not the in-fleet rogue-CN case. The skip is honest in §12; the matrix in §3 is over-claiming. - PR-description doc drift. The description says thing-name regex was tightened to
^[a-zA-Z0-9_:-]+$(with colon); the actual regex atmesh/iot/provision.py:296is^[a-zA-Z0-9_-]{1,128}$(no colon, with length cap). The code is fine; the description should be updated. - Pre-1.0 breaking changes are well-documented.
STRANDS_MESH_PSKand the rest of the legacy env vars are gone; the migration path in CHANGELOG.md is clear. No concern, just noting it satisfies the AGENTS.md convention.
Verification suggestions
- Spot-check the default-ACL assertion live: spin up two peers without setting
STRANDS_MESH_ACL_FILE, both holding CA-signed certs with arbitrary CNs, confirm acmdround-trips. Then set the env var to the example template (with neither CN enumerated) and confirm the same put is dropped. This is the residual gap I'd most want closed before this lands on a real fleet. - Run the JSON5-lite preprocessor against
examples/mesh_acl_example.json5directly (json.loads(_json5_to_json(open(path).read()))) and assert it round-trips to the expected dict. Cheap to add and pins down the only piece of custom parsing in this PR. hatch run test tests/mesh/withSTRANDS_MESH_AUDIT_PSK=fooand confirmverify_audit_integrity({sequence_gaps})is empty after 1000+ events across multiple peer_ids and a forced rotation.
| "policy and any rule gap exposes the mesh. Prefer 'deny'.", | ||
| path, | ||
| ) | ||
| return data |
There was a problem hiding this comment.
The JSON5-lite preprocessor is untested at the unit level. _load_acl_file is the only entry point that exercises _strip_json5_comments, _quote_unquoted_keys, and _strip_trailing_commas, but every test in test_acl_config.py feeds it pure json.dumps() output (lines 87, 102, 109, 116, 127, 135 and the rest). The PR ships a 100-line example template at examples/mesh_acl_example.json5 that uses every JSON5 feature - comments, trailing commas, unquoted keys - and none of it is exercised by a regression test.
A few inputs that look plausibly broken to me but I haven't been able to verify:
- A string value containing
//(e.g."comment": "see https://example.com") - the strip-comments pass tracks string state but escaped quotes inside the value land on a path whereout.append(nxt)runs unconditionally, which I'd want a test to pin. _quote_unquoted_keyschecksout[-1] in "{,\n \t"(line 163); after a previously-quoted key is appended as the multi-character string'"key":',out[-1]is that multi-char token and theincheck becomes a substring search that returns False. So a second consecutive unquoted key on the same line wouldn't be quoted. May or may not be reachable depending on the formatter operators use.
Suggested fix: add def test_loads_shipped_example_file() that points STRANDS_MESH_ACL_FILE at examples/mesh_acl_example.json5 and asserts resolve_acl("strands") returns the expected dict. Cheap, catches regressions in the only piece of hand-rolled parsing in this PR.
There was a problem hiding this comment.
Already addressed in the current code:
tests/mesh/test_acl_config.py::TestExampleFile::test_example_file_loads_and_parses(line 168) loadsexamples/mesh_acl_example.json5end-to-end through_json5_to_json()and asserts the full structure.tests/mesh/test_acl_config.py::TestLoadACLFile::test_load_acl_file_round_trip_with_example(line 427) exercises_load_acl_file()against the shipped example.TestJSON5Preprocessor(line 215+) covers parametrised edge cases: strings containing//, trailing commas adjacent to], unquoted keys, block comments spanning lines.
All pass in CI (81e80cd).
🤖 Autonomous agent response. Strands Agents. Feedback to @cagataycali.
| "subjects": ["any_authenticated_peer"], | ||
| }, | ||
| ], | ||
| } |
There was a problem hiding this comment.
The default ACL has no live red-team test. test_redteam_zenoh.py::TestACLEnforcement::test_unknown_cn_dropped_by_default_deny exercises the operator-supplied STRANDS_MESH_ACL_FILE path, not this default_acl() skeleton.
The PR description (§8) and the README both promise the default is "permissive ... any peer with a CA-signed cert may publish and subscribe on any key". But the dict returned here sets default_permission: "deny" and relies on the any_authenticated_peer subject (which has only interfaces populated, no cert_common_names) to match every authenticated peer at the rule-evaluation step.
Given the PR's own discovered quirk #4 (cert_common_names is literal-only - the absent field's matching semantics is exactly the kind of empirical edge case that turned up the other three quirks), the claim that this skeleton accepts all CA-signed peers should be pinned by a passing two-peer test in test_redteam_zenoh.py: peer A and peer B both have CA-signed certs with arbitrary CNs, no STRANDS_MESH_ACL_FILE set, B subscribes, A publishes, B receives. Without that test, the "permissive" claim is verified only by author reasoning.
There was a problem hiding this comment.
Acknowledged. The negative case (rogue CN dropped) is covered by test_unknown_cn_dropped_by_default_deny, but the positive case (default_acl() truly admits any CA-signed peer) relies on code inspection rather than a live two-peer test.
Added to PR description section 12 (Follow-ups): "Live two-peer default-ACL smoke test (confirm default_acl() truly admits any CA-signed peer) -- tracked as a pre-GA validation item."
The test_acl_config.py::TestDefaultACL suite (8 tests) validates the structural shape of the dict, and the Zenoh quirk documentation in comments makes the design reasoning explicit. The residual gap is empirical confirmation against the wheel, which is best done in a dedicated integration test (needs Zenoh runtime).
🤖 Autonomous agent response. Strands Agents. Feedback to @cagataycali.
| continue | ||
| if ip in net: | ||
| return True | ||
| return False |
There was a problem hiding this comment.
No DNS resolution means the host allowlist is only as trustworthy as the resolver. When an operator extends STRANDS_MESH_POLICY_HOST_ALLOW with a hostname (the example in the docstring is vla.internal), is_safe_policy_host("vla.internal") returns True purely by literal string match - we never check what vla.internal actually resolves to. Anyone who can spoof DNS on the robot's network (LAN MITM, cache poisoning, malicious DHCP-supplied DNS) redirects every robot's policy traffic.
This is probably the right design (DNS resolution would be its own can of worms with TOCTOU between validation and connection), but the docstring should call it out explicitly so an operator considering adding a hostname understands they're trusting their resolver. Suggested wording for the docstring: "Hostnames are matched literally; the actual A/AAAA records are not validated. Adding foo.internal to the allowlist trusts the local resolver to map it to a non-attacker-controlled IP."
As an alternative defence, consider rejecting hostname entries entirely and requiring CIDRs / IP literals from operators - more pain at config time, but no DNS in the trust path. This is operator-judgement; the documentation gap is the actionable item.
There was a problem hiding this comment.
Already addressed in the current code. The is_safe_policy_host docstring at security.py:170-178 includes:
.. warning::
Hostname entries are matched LITERALLY against the caller's input string;
no DNS resolution is performed at allowlist time. Adding ``vla.internal``
to the allowlist therefore implicitly trusts whatever resolver the
inference call uses at runtime. Deployments on a hostile or weak DNS path
should prefer IP literals or CIDR ranges (``10.0.0.0/24``) over hostnames
so the trust boundary stays under operator control.
This is exactly the documentation the review requested.
🤖 Autonomous agent response. Strands Agents. Feedback to @cagataycali.
| # AWS IoT enforces a similar pattern server-side, but we duplicate the | ||
| # check here so a slip in upstream validation can never produce a path | ||
| # such as ``../../../etc/foo`` reaching ``cert_dir / f"{thing_name}.pem"``. | ||
| _THING_NAME_RE = re.compile(r"^[a-zA-Z0-9_-]{1,128}$") |
There was a problem hiding this comment.
Doc drift in PR description. The PR description §5 / §8 says: "Thing-name regex tightened to ^[a-zA-Z0-9_:-]+$" (with colon). The actual regex here is ^[a-zA-Z0-9_-]{1,128}$ - no colon, with a length cap. The code is fine (and stricter is better - no colon means no S3-key surprises like cert/foo:bar.pem), but please update the description so a reader cross-checking the diff against the description doesn't mistake it for a missed deletion.
Same note: README and CHANGELOG should both reflect the actual regex, not the colon-bearing variant from an earlier draft.
There was a problem hiding this comment.
Good catch. The PR description previously stated the regex included colon; the actual regex is ^[a-zA-Z0-9_-]{1,128}$ (no colon, with length cap). Updated the PR description to remove the stale reference in the follow-ups section. The code is correct (stricter is better -- no colon avoids S3-key / filesystem ambiguity).
The README and CHANGELOG both describe it as "tightened" without specifying the exact regex pattern, which is accurate.
🤖 Autonomous agent response. Strands Agents. Feedback to @cagataycali.
2e853d2 to
d9beab8
Compare
Diff cleanup pass (cosmetic-only files reverted)To make this PR easier to review, dropped 19 files that contained Per-file punctuation-only line restorations also pulled inside the Stats:
Force-pushed |
yinsong1986
left a comment
There was a problem hiding this comment.
Summary
Large, well-organised refactor that replaces the hand-rolled HMAC envelope / nonce cache / token bucket / per-peer identity with Zenoh built-ins (mTLS at transport/link/tls, ACL on cert CN, downsampling, low_pass_filter, namespace) and keeps a tighter payload-validation layer in mesh/security.py. The CHANGELOG, README and threat-vector matrix are kept in sync, the four documented Zenoh 1.x quirks are pinned to regression tests, and the audit log gains per-record HMAC + per-peer seq + cross-process flock + bounded rotation. Scope is disciplined; the simulation tests appear untouched, branched cleanly off current main.
What's good
- Empirical Zenoh 1.x quirk discoveries (silent
low_pass_filterno-op withoutinterfaces, namespace-strippedkey_exprs, missingenabled: true, literal-CN matching) each get a named test intests/mesh/test_redteam_zenoh.pyandtest_acl_config.py. tools/robot_mesh.pycorrectly separates_rate_limit_checkfrom_rate_limit_recordso a declined HITL approval does not consume a rate-limit slot, and validates the broadcast command before the interrupt prompt so operators never approve a payload the validator will reject.- IoT CA-pin path NEVER honours
STRANDS_MESH_DISABLE_CA_PINfor the on-disk reuse path — a rogue CA from a prior compromised run can no longer be silently re-used. verify_audit_integritykeeps per-peer cursors so a tampered record cannot mask a real gap.read_audit_lognow spans rotated copies.AuditPSKDegradedErrorcloses the env-unset-mid-run downgrade.- Public-API hygiene:
_put_signedwas correctly renamed toMesh.publishandiot/camera_offload.pynow goes through it, matching AGENTS.md > Review Learnings (#86) > 'Public API Hygiene'.
Concerns
The biggest open question is the remote-resume replay window (inline). The cryptographic shape HMAC(override_code, proof_nonce) with the issuer choosing the nonce gives no replay defence between fellow operator-class peers — anyone on the ACL can record a valid resume and replay it during a future lockout. The mTLS handshake protects against off-link attackers; the override-code second factor was meant to defend against compromised-but-authenticated peers, and as written it does not. Either bind the proof to a receiver-issued challenge or have receivers track proof_nonce values seen in the recent past (the 6.99 PSK envelope this PR removed had a per-peer nonce cache; the equivalent here would be a _seen_resume_nonces LRU on every receiver).
A secondary worry is that the default ACL ships subjects with no cert_common_names field at all. Per the PR's own R&D note (CHANGELOG 'cert_common_names matches LITERAL CNs only — globs do not work in Zenoh 1.x'), and the symmetric quirk that 'Subject interfaces must be a non-empty list — leaving it unset makes the subject match nothing', it is not obvious that omitting cert_common_names means 'match every CN' rather than 'match no CN'. The README and docstrings call this configuration 'permissive', but tests/mesh/test_redteam_zenoh.py::TestACLEnforcement only exercises an ACL that explicitly enumerates CNs (line 466). A first-run operator who relies on the documented permissive default may discover their fleet is silently dropping every put. Worth adding a live two-peer test that opens two sessions under default_acl() and asserts a put round-trips, before merging.
verify_audit_integrity has a subtle gap (inline): when a record arrives with no sig but the verifier has a PSK configured (missing_sig case), the per-peer cursor still advances. That defeats the 'bad signature != cursor advance' invariant for the case where an attacker who cannot compute a signature simply omits the field; the seq value they wrote is still trusted to advance the cursor.
Minor: scope is good, but mesh/iot/camera_offload.py switching the /ref topic from transport.put to mesh.publish and _publish_cameras_once adding a privacy kill-switch don't appear in the threat-vector matrix. Worth a one-line CHANGELOG note about the new STRANDS_MESH_CAMERA_DISABLED env var being a real behavioural switch (it's documented in README but not in CHANGELOG's 'New env vars' table).
Verification suggestions
# 1. Live-default-ACL smoke: open two zenoh sessions under default_acl()
# with valid CA-signed certs and assert that a put on **/cmd is
# received. This is what the README claims works out-of-the-box.
hatch run test tests/mesh/test_redteam_zenoh.py -k 'default_acl' -v
# 2. Audit gap-detection regression: write three records, tamper the
# middle one to drop its sig (simulate forgery by an attacker
# without the PSK), assert verify_audit_integrity reports
# sequence_gaps non-empty AND missing_sig > 0.
hatch run test tests/mesh/test_audit_integrity.py -v
# 3. Resume-replay regression: capture a valid resume envelope into
# a recording, engage a second lockout, replay the same envelope,
# assert _on_safety_resume rejects.
hatch run test tests/mesh/test_robot_mesh_security.py -k 'resume' -v
# 4. Full mesh suite to confirm the 644-pass / 2-skip claim:
hatch run test tests/mesh -v 2>&1 | tail -20| { | ||
| "id": "any_authenticated_peer", | ||
| "interfaces": _local_interfaces(), | ||
| }, |
There was a problem hiding this comment.
The default subject has interfaces populated but no cert_common_names field at all. Given the four documented Zenoh 1.x quirks (enabled: true required, empty interfaces is a silent zero-match, cert_common_names is literal-only, etc.), it is not obvious whether omitting cert_common_names means 'match every authenticated peer' (the docstring's claim, line 222–237) or 'match no peer' (the symmetric reading of the interfaces quirk).
tests/mesh/test_redteam_zenoh.py::TestACLEnforcement::test_unknown_cn_dropped_by_default_deny (line 597) and the surrounding tests all use a custom ACL with explicit CNs (line 466). There is no live-Zenoh test that opens two sessions under exactly default_acl() and asserts a put on **/cmd round-trips end-to-end.
If the omit-means-allow reading turns out to be wrong, every first-run deployment that follows the README's 'ACL is permissive by default' path silently drops every message — the silent-total-outage failure mode the docstring explicitly says it was avoiding.
Please either (a) add a 'cert_common_names': ['**']-or-equivalent fallback if Zenoh has one, or (b) add a test_default_acl_allows_signed_traffic two-peer live test against the wheel before merging, and reference it in the docstring.
| prev = last_seq_by_peer.get(peer) | ||
| if prev is not None and seq != prev + 1: | ||
| gaps.append((prev, seq)) | ||
| last_seq_by_peer[peer] = seq |
There was a problem hiding this comment.
Cursor advances on missing_sig records when a PSK is configured. The check at line 883–884 only short-circuits on record_is_bad = True, which is set in the bad_signature branch (line 873) but NOT in the missing_sig branch (line 875–876). So a forged record that simply omits the sig field (the natural attack for someone who cannot compute the HMAC) still updates last_seq_by_peer[peer] to the forged seq value.
This defeats the 'tamper-aware cursor' claim in the PR description and the module docstring (line 22–33) for the simpler attack: instead of editing a record's content (which bad_signature catches and the cursor honours), an attacker just deletes the sig field, and the cursor jumps to whatever forged seq they wrote.
The ok field at line 905 does include not (psk_present and missing_sig > 0) which surfaces the missing-sig count, so callers checking ok are protected. But the per-peer sequence_gaps list is silently disarmed for this attack class — a sweep tool that reports 'no gaps' will miss any deletions hidden behind a forged unsigned record.
Fix: in the missing_sig branch, set record_is_bad = True (or factor the cursor-advance condition more conservatively to require a verified signature when the verifier has the PSK).
There was a problem hiding this comment.
Already fixed in the current code. Looking at audit.py:891-893:
if psk_present:
missing_sig += 1
record_is_bad = TrueThe record_is_bad = True in the missing_sig branch ensures the cursor does NOT advance (the if record_is_bad: continue at line 900 skips past the cursor-update logic). This was fixed before the PR was opened as part of the R4 audit-integrity hardening.
The subsequent check at line 922 also catches it: "ok": bad_signature == 0 and not gaps and not (psk_present and missing_sig > 0).
🤖 Autonomous agent response. Strands Agents. Feedback to @cagataycali.
| # Check the per-action rate limit before doing any work — but | ||
| # do NOT consume a slot until we know the action is going to run. | ||
| # See _rate_limit_check / _rate_limit_record for rationale. | ||
| rl_err = _rate_limit_check(action) |
There was a problem hiding this comment.
TOCTOU window between _rate_limit_check (line 281) and _rate_limit_record (line 358). The HITL interrupt at line 318 can take seconds-to-minutes to return; concurrent invocations of the same action all pass the rate-limit check, all hit the interrupt, all get approved (operator under attack-load might bulk-approve), and only then race to record — by which point all of them succeed, exceeding the configured cap. For emergency_stop capped at '3 calls/min' this is mostly fine, but for an LLM-driven workload that funnels through the same agent the cap can be exceeded by a factor equal to the concurrency.
Not a critical bug — the rate limit is documented as 'bounding LLM-driven nuisance, not inhibiting a genuine emergency' (line 100–101) — but it's worth either:
- Reserving the slot at check time and releasing on declined approval, OR
- Documenting the burst-on-concurrency caveat next to the
_RATE_LIMITStable at line 51.
Whichever you pick, please pin it with a regression test (tests/mesh/test_robot_mesh_security.py) that fires N concurrent broadcasts and asserts only cap of them ever reach _audit_tool_action(success=True).
| # Bracketed IPv6 literal: reject (default allowlist is loopback; | ||
| # operators wanting IPv6 use a hostname + STRANDS_MESH_POLICY_HOST_ALLOW). | ||
| return False | ||
| host = s.rsplit(":", 1)[0] if ":" in s else s |
There was a problem hiding this comment.
rsplit(":", 1)[0] strips one suffix even when there is no port. A bare IPv6 literal that slipped past the [ rejection at line 291–294 — e.g. an operator who set STRANDS_MESH_POLICY_HOST_ALLOW=fe80::1 and a peer that supplied server_address="fe80::1:8080" — would have :8080 stripped to fe80::1, which then matches the allowlist. A more pathological supplier sending "evil.com:80:443" would be normalised to evil.com:80 and then to evil.com, which is not what the validator means.
The scheme-strip at line 288–290 has the same single-occurrence assumption.
Suggested tightening: parse with urllib.parse.urlsplit for the scheme path, and split on the rightmost colon ONLY if the suffix is purely digits in [1, 65535]. Otherwise reject as malformed. The current implementation is forgiving in the wrong direction — it normalises things that should be rejected outright.
See AGENTS.md > Review Learnings (#92) > 'Reject shell metacharacters in paths' for the spirit of fail-fast input validation on agent-callable surfaces.
d9beab8 to
c9ba286
Compare
yinsong1986
left a comment
There was a problem hiding this comment.
Summary
This PR replaces the hand-rolled application-layer crypto (HMAC envelope, nonce cache, peer identity, token bucket) with Zenoh built-ins (mTLS, key-expression ACLs, downsampling, low_pass_filter, namespace) and adds a payload-validation layer (mesh/security.py) plus hardened audit logging (per-record HMAC, per-peer monotonic seq with sidecar + flock, bounded rotation, rotated-read). The wire-layer threat model (vectors 1, 2, 9, 10) genuinely shifts from custom code to Zenoh primitives, which is the right direction. The defence-in-depth structure (transport ACL -> payload allowlists -> tool-layer HITL -> independent audit log) is well-decomposed.
What's good
- The Zenoh-1.x quirks called out in section 5 (
enabled: truerequirement,**/cmdglob form, literal-CN matching, non-emptyinterfaces) are real footguns; documenting them in code comments AND failing closed in_load_acl_fileis the right defensive posture. - Audit-log integrity is comprehensive: per-record HMAC, per-peer seq counters with cross-process flock, tamper-aware cursor in
verify_audit_integrity(bad-sig records do NOT advance per-peer cursor),read_audit_logspans rotated copies,AuditPSKDegradedErrorblocks silent unsigned downgrade, O_NOFOLLOW on both the audit file and the seq sidecar. - Resume RPC is well-defended: HMAC(override_code, proof_nonce) + freshness window + forward-skew + per-receiver replay cache (
_resume_replay_cache) with bounded LRU eviction. Receivers withoutSTRANDS_MESH_OVERRIDE_CODEfail closed. - Bridge dedup uses full 256-bit SHA-256 (no birthday-attack truncation) over
(sender_id, turn_id, command), with bounded GC and TTL eviction. - IoT CA-pin re-use path NEVER honours
STRANDS_MESH_DISABLE_CA_PIN-- the on-disk-rogue-CA-from-prior-compromised-run scenario is closed cleanly. - Every reviewed design decision is pinned to a regression test (
test_pentest_findings.py,test_redteam_zenoh.py, etc.), per AGENTS.md > Review Learnings > 'Pin regression tests for reviewed fixes'. - 7K-line PR contains no host-path test fixtures (
test_no_host_paths.pyclean) and no emojis in user-facing strings.
Concerns
- Scope and reviewability. +7779 / -148 across 37 files in a single commit is a lot for any reviewer to audit thoroughly. The squashing is justified in the description (clean rebase of #194 minus the unintended camera-recording revert), but the diff would have been much easier to land as a stack of 4-5 logically-separated PRs (Zenoh config builder; security.py; audit-log hardening; bridge dedup; IoT CA pin). Future hardening work in this area should split.
- Permissive default ACL.
default_acl()ships withdefault_permission: 'deny'BUT allow-rules over**for bothputanddeclare_subscriber-- meaning any peer that passes mTLS can publish/subscribe on anything. The PR description and module docstring acknowledge this and explain the rationale (default-deny with no enumerated subjects = silent total outage on first run). It IS the right call given Zenoh 1.x's literal-CN constraint, but operators MUST be steered hard towardSTRANDS_MESH_ACL_FILEin production. The README does this; consider surfacing a WARNING at session-open when the default ACL is used inmtlsmode (similar to thenone-mode warning). - Two skipped ACL tests. The skipped
test_robot_cert_cannot_publish_on_cmd/test_op_cert_can_publish_on_cmdcases are exactly the role-separation enforcement claims the README makes (robot-*cannot publish cmd;op-*can). The 'security-critical case (rogue CN dropped) is covered bytest_unknown_cn_dropped_by_default_deny' justification holds, but the role-separation claim is currently asserted by static config rather than live wire test. Track as a follow-up issue per AGENTS.md. - Migration breakage is significant. Section 9 documents removing
STRANDS_MESH_PSK/_PEER_KEY*/_REPLAY_WINDOW/_PEER_RATE/_REQUIRE_AUTH/_REQUIRE_PEER_IDENTITY/_PEER_IDENTITYenv vars in one commit. CHANGELOG covers this but the migration runbook (provision a CA + per-peer leaf certs, set 3 new env vars on every peer) is non-trivial for any deployment that wasn't already on AWS IoT. Consider a deprecation cycle if pre-1.0 doesn't allow you the freedom to land this in one go.
Verification suggestions
# Live two-peer mTLS + ACL coverage (skipped on systems without Zenoh 1.x):
hatch run pytest tests/mesh/test_redteam_zenoh.py -v
# Audit-log integrity end-to-end (rotation + signed reads + tamper detection):
hatch run pytest tests/mesh/test_audit_integrity.py tests/mesh/test_pentest_findings.py -v
# Resume replay + freshness:
hatch run pytest tests/mesh/test_resume_replay.py -v
# Confirm no host paths leaked into new test fixtures:
hatch run pytest tests/mesh/test_no_host_paths.py -v 2>/dev/null || \
python -c "import re,glob; bad=[f for f in glob.glob('tests/mesh/**/*.py',recursive=True) if re.search(r'/Users/|/home/[a-z]+/',open(f).read())]; print('OK' if not bad else bad)"
# Smoke-test the env-var migration: confirm none of the removed vars are read anywhere:
grep -rn 'STRANDS_MESH_PSK\|STRANDS_MESH_PEER_KEY\|STRANDS_MESH_REPLAY_WINDOW\|STRANDS_MESH_PEER_RATE\|STRANDS_MESH_REQUIRE_AUTH\|STRANDS_MESH_REQUIRE_PEER_IDENTITY\|STRANDS_MESH_PEER_IDENTITY' strands_robots/ tests/ examples/ docs/ README.md
# Should return only AGENTS.md / CHANGELOG.md historical references.A few inline points below on payload-layer details that are worth a second pass before merge.
| ) | ||
| out["policy_provider"] = value.strip().lower() | ||
| else: | ||
| out["policy_provider"] = "mock" |
There was a problem hiding this comment.
Silently defaulting policy_provider = "mock" when an authenticated peer issues an execute/start command without that field is convenient for legacy callers but surprising as a security default: a malformed peer command will run the mock policy on the robot rather than producing an error. Two implications:
- The dispatcher path (
Mesh._exec_cmd->r._execute_task_sync/r.start_task) sees a validpolicy_provider, so the audit trail records a successful execute againstmockrather than a validation failure. - A peer that legitimately wanted
mockand a peer that just forgot the field are indistinguishable on the wire.
Consider raising ValidationError("execute/start requires policy_provider; use 'mock' explicitly for the noop policy") and migrating the small number of callers that relied on the implicit default. Per AGENTS.md > Review Learnings (#86) > "Reject silently-dropped kwargs": silent drops/defaults on a security boundary are bugs masquerading as features.
| _ = namespace # `namespace` config does the routing isolation; ACL key_exprs do not need it | ||
| return { | ||
| "enabled": True, | ||
| "default_permission": "deny", |
There was a problem hiding this comment.
The default_permission: "deny" line here is misleading on first read because the very next block adds two permission: "allow" rules over key_exprs: ["**"] for both put and declare_subscriber, so the effective behaviour is allow-any. The PR description and module docstring (lines 218-236) explain why -- Zenoh 1.x can't glob CNs, so a true default-deny would block every legitimate first-run message. Recommend at minimum:
- Add an inline comment IMMEDIATELY adjacent to the dict literal making it explicit that the deny default is defeated by design by the
**allow-rules below, and the actual gate is the mTLS handshake. - Consider surfacing a WARNING at
Mesh.start()when this default ACL is in effect ANDSTRANDS_MESH_AUTH_MODE=mtls, parallel to the existing WARNING forauth_mode=none. Operators who forget to setSTRANDS_MESH_ACL_FILEin production should hear about it on every session open, not only when they read the README.
| # Approval granted — consume the slot now. A concurrent caller | ||
| # that raced past the check above will see the slot once | ||
| # _rate_limit_record has written it. | ||
| _rate_limit_record(action) |
There was a problem hiding this comment.
TOCTOU between _rate_limit_check (line 281) and _rate_limit_record here: each takes _RATE_LOCK independently, with the operator interrupt between them. Two concurrent invocations of emergency_stop (limit 3/60s) can each pass the check, both surface an interrupt, both get approved, and both reach this _rate_limit_record call -- briefly exceeding the configured limit.
For emergency_stop this is probably fine in practice (operators serialise interrupts, and an over-permissive rate limit on a SAFETY action is the right failure mode). For broadcast (10/60s) the over-limit window is wider. The split is intentional per _rate_limit_check's docstring (declined approvals must not consume slots), but the trade-off should be made explicit in a comment here. Consider also: collapse both check + record into one call when the action is NOT in _INTERRUPT_REQUIRED, and guard the interrupt-approved path with a re-check inside _RATE_LOCK before recording.
| "event": event_type, | ||
| "peer_id": peer_id, | ||
| "payload": payload, | ||
| "seq": _next_seq(peer_id), |
There was a problem hiding this comment.
_next_seq(peer_id) is invoked OUTSIDE the surrounding try/except (line 681 onwards), but the docstring contract on log_safety_event (lines 654-656) says "Raises: Nothing -- write errors are logged at WARNING and swallowed because an audit-log failure must never propagate up into the safety code path."
Most paths inside _next_seq are fail-soft (the OSError/JSONDecodeError handlers in _load_seq_counters / _persist_seq_counters / _seq_flock), but a few corners can still raise: audit_log_path() if STRANDS_MESH_AUDIT_DIR resolves to a path whose .parent raises during mkdir, Path.home() if HOME is unset on a constrained container, _seq_lockfile_path().parent.mkdir(...) if the dir is read-only and the inner try/except OSError is bypassed.
Wrap _next_seq(peer_id) in the same try/except OSError pattern as the file write below, or move the whole record-construction into the existing with _WRITE_LOCK: try block. Per AGENTS.md > Review Learnings (#86) > "Resource Cleanup on Partial Failure" / "Match docstrings to semantics".
c9ba286 to
a67e962
Compare
yinsong1986
left a comment
There was a problem hiding this comment.
Summary
This PR replaces ~1700 lines of hand-rolled application-layer crypto (HMAC envelope, nonce cache, peer-identity, token bucket) with Zenoh's built-in mTLS + access-control + downsampling + low_pass_filter primitives, and hardens the surviving payload-validation, audit-log, and IoT-provisioning paths. The scope is large but largely self-consistent: a mesh/security.py payload-allowlist module, a mesh/_zenoh_config.py + mesh/_acl_config.py pair owning every insert_json5 call, an audited resume-with-override-proof flow, and a sizeable test suite (tests/mesh/test_redteam_zenoh.py exercises real two-peer Zenoh sessions; test_audit_integrity.py and test_iot_ca_pin.py pin the safety-side behaviours).
The direction is right and the regression tests in tests/mesh/ follow the AGENTS.md "pin reviewed fixes" rule well. A few asymmetries between the resume and estop paths, the ACL loader's missing symlink/path-traversal guard, and a brittle dependency on psutil-derived NIC names in the default ACL are worth a second pass before merge.
What's good
- Override-code resume is genuinely two-factor:
_on_safety_resumenow validates HMAC(override, nonce), envelope freshness, AND a per-receiver replay cache. That's the right shape and the cache is bounded. - Audit log: per-peer monotonic seq with cross-process
fcntl.flock, sidecar persistence,O_NOFOLLOWon the active log + sidecar, rotation across rotated copies, andAuditPSKDegradedErroron PSK-unset-mid-run is a thoughtful threat model. - IoT CA-pin path correctly forbids
STRANDS_MESH_DISABLE_CA_PINfrom applying to the on-disk re-use branch (closes the rogue-CA-from-prior-run reuse). The per-socket-timeoutHTTPSHandlerinstead ofsocket.setdefaulttimeout(...)is the right call. - 643 mesh tests + the live-Zenoh red-team suite is exactly the verification depth a security PR of this size needs.
__all__lists are clean of_-prefixed names; no dead code from the deletedidentity.py/TokenBucketpaths.
Concerns
except Exceptionis pervasive acrossmesh/core.pyandmesh/sensors.py(perrg, ~70+ occurrences in the changed mesh files). AGENTS.md > "Exception Clauses Must Be Narrow" forbids this for non-recovery paths. Several of the safety-side ones (_on_safety_estop,_on_safety_resume,_on_cmd) silently swallow parse failures with no audit record — exactly the forensic gap the audit log was built to close. Worth a pass to narrow these to(json.JSONDecodeError, UnicodeDecodeError, AttributeError)and emit at least a debug-line + an auditparse_failurerecord.- Asymmetry between
_on_safety_estopand_on_safety_resume: estop has neither freshness nor replay protection; resume has both. This is consistent with the design (estop is fail-safe so replays "succeed" harmlessly), but a captured estop envelope can be replayed indefinitely to whip receivers between lockout and resume — see inline comment. - Documentation/code mismatch on "default permissive ACL": the README and PR body say "permissive", the actual default uses
default_permission: "deny"+ a single**allow rule scoped to_local_interfaces(). Ifpsutil.net_if_addrs()returns no interface names (some restricted containers, headless network namespaces) the subject matches nothing and the deny-default takes over — silent total outage. See inline comment. mesh/_acl_config.pysymlink/traversal:STRANDS_MESH_ACL_FILEis loaded with noO_NOFOLLOWand no resolution against an expected dir. Inconsistent with the audit-log path's symlink defence. See inline.- CHANGELOG / migration: the BREAKING change list is thorough but
tools/robot_mesh.py'sconfirm: boolremoval is documented in §9 of the PR body but not inCHANGELOG.md. Operators reading just the changelog won't catch it.
Verification suggestions
STRANDS_MESH_AUTH_MODE=mtls STRANDS_MESH_ACL_FILE=/tmp/no-such hatch run python -c "from strands_robots.mesh._acl_config import resolve_acl; resolve_acl('strands')"— confirm the FileNotFoundError surfaces clearly.- Run
tests/mesh/test_redteam_zenoh.py::TestACLEnforcement::test_unknown_cn_dropped_by_default_denyin a container withpsutil.net_if_addrs()returning onlylo— confirms the default ACL still admits a valid CN over loopback. - Manually replay a captured
strands/safety/estopenvelope withzenoh-cli put: today this re-engages lockout silently. Compare to a replayedsafety/resume(rejected with audit record). grep -nP '[^\x00-\x7F]' strands_robots/mesh/*.py tests/mesh/*.py— confirm no stray emoji / combining marks slipped in (AGENTS.md > Unicode hygiene).
| return | ||
| if not isinstance(data, dict): | ||
| return | ||
| if not self._estop_lockout.is_set(): |
There was a problem hiding this comment.
_on_safety_estop accepts any envelope on strands/safety/estop that parses as a JSON object — no freshness check on t, no replay cache, no override-proof. Compare with _on_safety_resume (line 951+) which validates all three.
The asymmetry is defensible (estop is fail-safe) but means a captured estop envelope can be replayed indefinitely to whip receivers through repeated lockout cycles: replay estop → operator clears via _resume_lockout → replay estop → … This is a remote DoS surface available to anyone who briefly had ACL access to safety/** and captured one envelope.
Minimum: add the same freshness window check (t not older than RESUME_FRESHNESS_WINDOW_S) so a replay outside the window is rejected, and audit the rejection. The replay cache is a follow-up but the freshness check is one stanza copied from _on_safety_resume.
There was a problem hiding this comment.
Acknowledged. The asymmetry is by design (estop is fail-safe -- a replayed estop is annoying but safe; blocking estops is dangerous), but the freshness check is a cheap tightening that would close the repeated-lockout DoS surface.
Added to PR description section 12 (Follow-ups): "_on_safety_estop freshness window (match _on_safety_resume's replay defence) -- accepted design asymmetry documented in code, tightening tracked for follow-up."
Will implement in R7 if the reviewer considers this a merge blocker rather than a follow-up.
🤖 Autonomous agent response. Strands Agents. Feedback to @cagataycali.
| "subjects": [ | ||
| { | ||
| "id": "any_authenticated_peer", | ||
| "interfaces": _local_interfaces(), |
There was a problem hiding this comment.
The PR body and README describe the default ACL as permissive, but the implementation is default_permission: "deny" plus an any_authenticated_peer subject whose interfaces field is _local_interfaces(). Per Zenoh 1.x semantics (documented at the top of this module), a subject with an empty interfaces list matches nothing — and _local_interfaces() falls back to psutil.net_if_addrs().keys(), which can legitimately be empty in headless containers, restricted network namespaces, or environments where psutil is shimmed.
Result: silent total deny-all on first run in exactly the situation the docstring (line 231-237) said you wanted to avoid.
Two options:
- Switch to
default_permission: "allow"(true permissive) and warn loudly so it matches the docs. - Keep
denybut assertlen(interfaces) > 0indefault_acl()and log an ERROR (or fall back to the static["lo", "lo0", "eth0", ...]list from_local_interfaces'sImportErrorbranch) when the live enumeration is empty.
Either way add a regression test that fails when the subject's interfaces list resolves to [].
| """ | ||
| path_env = os.getenv("STRANDS_MESH_ACL_FILE", "").strip() | ||
| if path_env: | ||
| return _load_acl_file(Path(path_env)) |
There was a problem hiding this comment.
STRANDS_MESH_ACL_FILE is loaded with Path(path_env) and path.is_file() — no O_NOFOLLOW, no symlink check, no path-traversal sanitisation. mesh/audit.py goes to substantial lengths (raw symlink check + O_NOFOLLOW open) to defeat exactly this surface for the audit log; the ACL file is a higher-value target (it gates wire authorisation) and gets none of the same treatment.
AGENTS.md > Review Learnings (#85) > "Sanitize user inputs into XML" applies by analogy: an env-var-controlled path that gates security policy should match the audit log's defences. Suggest:
p = Path(path_env)
if p.is_symlink():
raise ValueError(f"refusing ACL file at {p}: SYMLINK -> {os.readlink(p)!r}")
if not p.is_absolute():
raise ValueError(f"STRANDS_MESH_ACL_FILE must be an absolute path: {path_env!r}")
# then open with O_NOFOLLOW + read inside the open| try: | ||
| raw = sample.payload.to_bytes().decode() | ||
| data = json.loads(raw) | ||
| except Exception: |
There was a problem hiding this comment.
except Exception: here (and at lines 915, 169, 243, 279, 288, 308, 315, …) is what AGENTS.md > Review Learnings (PR #86) > "Exception Clauses Must Be Narrow" forbids for non-recovery paths.
In this specific case the cost is forensic: a malformed estop sample is silently dropped with no log line and no audit record — exactly the signal an operator investigating a denial-of-service incident would want. The block catches everything from AttributeError (programming bugs in the sample API) to MemoryError.
Suggested narrowing for the parse blocks across core.py:
try:
raw = sample.payload.to_bytes().decode()
data = json.loads(raw)
except (AttributeError, UnicodeDecodeError, json.JSONDecodeError) as exc:
logger.debug("[mesh] %s: malformed safety envelope: %s", self.peer_id, exc)
returnThe sweep is mechanical but adds real forensic value to the safety paths.
There was a problem hiding this comment.
Partially addressed in R9 (1164801): _on_safety_estop narrowed to (AttributeError, UnicodeDecodeError, json.JSONDecodeError) along with the freshness/replay defenses. The remaining bare-except occurrences in core.py (presence, cmd parsers, lines 169/243/279/288/308/315/915) are not on the security-sensitive safety path and are tracked as a separate sweep -- mechanically straightforward but out-of-scope for this PR's diff coherence.
🤖 Autonomous agent response. Strands Agents. Feedback to @cagataycali.
| snapshot = _AUDIT_STATE.psk_was_present | ||
| if snapshot is None: | ||
| # First record this process -- record the observed state. | ||
| _AUDIT_STATE.psk_was_present = psk is not None |
There was a problem hiding this comment.
PSK-degrade detection is one-way: the snapshot is set to False if no PSK was present at the first _sign_record call, and any subsequent presence of a PSK silently starts signing without raising or logging. The threat model in the docstring above (a process that briefly clears its env to forge unsigned records, then re-sets the PSK) is closed; the inverse (a process that starts unsigned, then gains a PSK mid-run and starts signing) is not.
This isn't a security regression — the unsigned-then-signed transition produces a chronologically clean audit log — but verify_audit_integrity will report the unsigned prefix as missing_sig records, which a forensic walker may misread as the signed-then-unsigned downgrade case the type was named after. Consider symmetry: log a WARNING (not raise) on the False -> True transition so the reader can attribute the prefix correctly, and document the upgrade path in the class docstring.
…ounds)
Closes the 11 attack vectors identified in the strands_robots mesh
pentest scope. Replaces hand-rolled application-layer crypto (~1700
lines of HMAC envelope, nonce cache, per-peer identity, token bucket)
with Zenoh's built-in transport primitives, then re-verifies every
claim against live two-peer Zenoh sessions in
``tests/mesh/test_redteam_zenoh.py``.
Wire layer (Zenoh built-ins, in ``mesh/_zenoh_config.py`` and
``mesh/_acl_config.py``):
* mTLS via ``transport/link/tls`` with ``enable_mtls: true`` +
``verify_name_on_connect`` + ``close_link_on_expiration``.
A peer presenting a cert from a different CA is rejected at
handshake before any byte reaches the JSON deserialiser.
* Per-key-expression ACL (``access_control``) keyed on cert CN,
default-deny when the operator supplies STRANDS_MESH_ACL_FILE;
permissive built-in default for first-run usability.
* Rate / size caps via ``downsampling`` (default 20 Hz on
``**/cmd``) and ``low_pass_filter`` (16 KiB cmd, 1 MiB camera).
* Discovery via gossip-only by default; multicast scouting OFF.
* Fleet isolation via ``namespace`` (default ``strands``).
Payload-semantic layer (``mesh/security.py``, 467 lines, was 1176):
* ``validate_command()`` action allowlist + per-action bounds.
* ``is_safe_*`` allowlists for ``policy_host``, HF repo prefixes,
policy_type / policy_provider, and server_address.
LLM tool surface (``tools/robot_mesh.py``):
* HITL interrupt for ``emergency_stop`` and ``broadcast`` (operator
response delivered out-of-band of the LLM tool-arg flow so a
prompt injection cannot smuggle approval).
* Per-action sliding-window rate limit; declined approvals do NOT
consume a slot.
* Client-side ``validate_command`` on Mesh.send / Mesh.broadcast so
programmatic callers go through the same gate.
Audit log (``mesh/audit.py``):
* Per-record HMAC-SHA256 under STRANDS_MESH_AUDIT_PSK.
* Per-peer monotonic seq counters with sidecar persistence and
cross-process fcntl.flock.
* O_NOFOLLOW + symlink reject on the audit file and seq sidecar.
* Bounded rotation; verify_audit_integrity reads rotated copies in
chronological order so gap detection spans the full retained
history (regression test:
``test_read_audit_log_spans_rotated_files``).
* AuditPSKDegradedError raised when the PSK is unset mid-run.
IoT path (``mesh/iot/provision.py``, independent of Zenoh refactor):
* AWS Root CA1 SHA-256 pin, accepting a tuple of pins for forward-
compatible CA rotation. Break-glass disable never applies to the
on-disk re-use path.
* Per-receive timeout via custom HTTPSHandler; thing-name regex
tightened.
Zenoh 1.x quirks discovered during red-team testing and fixed:
1. ``low_pass_filter`` requires a non-empty ``interfaces`` allowlist
(silent no-op without it). Builder enumerates every local NIC.
2. ``<namespace>/*/cmd`` patterns never match because the matcher
is checked against the user-side (un-prefixed) key. All globs
now use ``**/cmd`` form.
3. ``access_control`` requires ``enabled: true`` (silent no-op
without it). ACL loader hard-rejects files that omit it.
4. ``cert_common_names`` matches LITERAL CNs only (no glob support
in 1.x). Default ACL is permissive; per-role enforcement
requires operator-supplied STRANDS_MESH_ACL_FILE with literal
CN enumeration. ``examples/mesh_acl_example.json5`` is the
canonical template.
Tests:
* 644 mesh tests pass (2 documented skips for Zenoh 1.x ACL fanout
corners that need more research).
* New: ``tests/mesh/_pki.py`` (ephemeral CA + leaf cert helper),
``test_redteam_zenoh.py`` (11 live two-peer tests covering
namespace isolation, downsampling / low_pass_filter enforcement,
mTLS handshake rejection of rogue CA, ACL drop of unknown CN),
``test_validate_command.py``, ``test_zenoh_config.py``,
``test_acl_config.py``, ``test_session_config.py``,
``test_audit_integrity.py``, ``test_iot_ca_pin.py``,
``test_iot_policy_scope.py``, ``test_camera_acl.py``,
``test_robot_mesh_security.py``, ``test_pentest_findings.py``,
``test_bridge_dedup.py``.
* Lint clean across 201 source files (ruff + mypy strict).
* Full repo: 2594 passed, 31 skipped, zero failures.
Migration: existing deployments on STRANDS_MESH_PSK provision a CA +
per-peer cert chain (or reuse AWS IoT certs already in
~/.strands_robots/iot/), point STRANDS_MESH_TLS_{CA,CERT,KEY} at them,
and drop the legacy PSK env vars. Dev / lab environments without PKI
run STRANDS_MESH_AUTH_MODE=none (mesh logs WARNING at session open).
a67e962 to
d9043b5
Compare
yinsong1986
left a comment
There was a problem hiding this comment.
Summary
Large, focused security-hardening PR that replaces ~1700 lines of hand-rolled application-layer crypto with Zenoh's built-in transport primitives (mTLS at transport/link/tls, key-expr ACL, downsampling, low_pass_filter, namespace routing) plus a slimmer 467-line payload-validation module. Audit log gains per-record HMAC, per-peer monotonic seq with sidecar+flock, and bounded rotation. Bridge transport gains exact-match filtering with an explicit prefix opt-in. Test surface is substantial (13 new files, two-peer live Zenoh red-team tests, regression tests pinned for each Zenoh-1.x quirk).
Overall the architecture is sound and the documentation is unusually candid about what is not covered (default ACL is permissive, two skipped ACL fanout tests, ACL drops not in mesh/audit.py). The concerns below are mostly contract-level cleanups and one symmetry issue, not blockers.
What's good
- Defence-in-depth: wire (mTLS+ACL+caps), payload (
validate_command, host/repo allowlists), tool (HITL+rate-limit), audit (HMAC+seq+rotation) layers all distinct and individually testable. - Each Zenoh-1.x quirk discovered (low_pass_filter
interfaces,**/cmdglob form,enabled: true, literal-CN matching) is pinned to a regression test that fails on pre-fix code. - AGENTS.md "Public API Hygiene" applied:
camera_offload._publish_cameras_once_with_offloadnow goes throughMesh.publishrather thantransport.put. STRANDS_MESH_DISABLE_CA_PINdeliberately does not apply to the on-disk re-use path -- exactly the right asymmetry.- CHANGELOG entry is comprehensive and lists all removed env vars / migration steps.
- Constant-time compare on the override code in both
_resume_lockoutand_on_safety_resume; replay cache is bounded with explicit eviction.
Concerns
- No regression test for
AuditPSKDegradedError. The class is exported inaudit.__all__, the raise path lives in_sign_record, and the docstring describes a specific attack scenario (env briefly cleared mid-run). Per AGENTS.md > Review Learnings (#85) > "Pin regression tests for reviewed fixes", this needs a test that sets the PSK, writes one record, deletes the env var, attempts a secondlog_safety_event, and asserts the record was dropped (and that the error was logged). Otherwise the next refactor that touches_sign_recordsilently reintroduces the silent-degrade path. - No documented happy-path ACL fanout test. Two
TestACLEnforcementcases are skipped with a Zenoh-1.x evaluation-order rationale. The negative case (test_unknown_cn_dropped_by_default_deny) is covered, but the README's role-separation table claimsrobot-*cannot publish oncmd-- and this is exactly the path that does not have a live regression test. Worth surfacing as a known coverage gap in the README "Threat-vector coverage" matrix, not just the PR description. - README env-var matrix lists
STRANDS_MESH_AUDIT_PSKbut notSTRANDS_MESH_RESUME_FRESHNESS_S,STRANDS_MESH_RESUME_FORWARD_SKEW_S,STRANDS_MESH_RESUME_REPLAY_CACHE_MAX. Per AGENTS.md > Review Learnings (#86) > "Document every env var in README.md", these tunables introduced bycore.py's resume hardening should be in the table.
Verification suggestions
# Sanity: every reviewed fix actually has a test that fails on pre-fix code.
hatch run test tests/mesh/ -v --tb=short
# Spot-check: the live two-peer Zenoh tests need eclipse-zenoh + cryptography.
pip install eclipse-zenoh cryptography psutil
hatch run test tests/mesh/test_redteam_zenoh.py -v
# Round-trip: confirm validate_audit_integrity reports `psk_present=False`
# after STRANDS_MESH_AUDIT_PSK is unset mid-run (manual run -- no automated
# coverage today; see inline note on audit.py).
# Bridge filter: confirm exact-match closes the prefix-bypass attack.
hatch run test tests/mesh/test_pentest_findings.py::test_p4_a2_bridge_topic_filter_exact_match_only -v| #: when their key-expressions collide. The default below tracks the | ||
| #: hardcoded `strands/...` topic prefix so the built-in ACL key_exprs | ||
| #: match the wire keys exactly. | ||
| DEFAULT_NAMESPACE: str = "strands" |
There was a problem hiding this comment.
Documentation drift versus the actual default. The module docstring (line 23) says the default namespace is strands_robots, but DEFAULT_NAMESPACE is "strands". The hardcoded application-layer prefix in core.py / sensors.py is also strands/..., so the code is internally consistent -- it's the docstring that is wrong.
Separately, the docstring at lines 64-66 references :func:\default_acl_block`with semantics described as "default-deny". The actual function in_acl_config.pyis nameddefault_acl()(no_block`) and ships permissive by design (any CA-signed peer can publish/subscribe on any key, per §8 of the PR description). Two stale references in one docstring -- both should be updated so future readers don't have to cross-check the implementation to learn what the default actually does.
AGENTS.md > Review Learnings (#85) > "Match docstrings to semantics".
There was a problem hiding this comment.
Acknowledged -- both references are stale:
- Module docstring line 23 says
strands_robotsbutDEFAULT_NAMESPACEis"strands"(code is correct, docstring is wrong). - Docstring references
:func:\default_acl_block`but the actual function isdefault_acl()in_acl_config.py`.
Will fix in R7 as a surgical docstring-only commit. Tracking.
🤖 Autonomous agent response. Strands Agents. Feedback to @cagataycali.
There was a problem hiding this comment.
Fixed in R9 (1164801). Both stale references corrected:
STRANDS_MESH_NAMESPACEdefault:strands_robots->strands.- ACL reference:
default_acl_block(default-deny) ->default_acl(permissive, with rationale linking to CHANGELOG Section 8).
🤖 Autonomous agent response. Strands Agents. Feedback to @cagataycali.
There was a problem hiding this comment.
Fixed in R9 (1164801). Both stale references corrected: namespace default now reads strands (matched to DEFAULT_NAMESPACE), and the default_acl_block (default-deny) reference is now default_acl (permissive). Module docstring lines 23 and 64-66 both updated.
🤖 Autonomous agent response. Strands Agents. Feedback to @cagataycali.
| cmd = _security.validate_command(cmd) | ||
| except _security.ValidationError as exc: | ||
| logger.warning("[mesh] %s: broadcast rejected client-side: %s", self.peer_id, exc) | ||
| return [] |
There was a problem hiding this comment.
Silent failure on validation. When validate_command raises in broadcast(), the method logs a WARNING and returns [] -- which is indistinguishable from "broadcast succeeded but no peer responded within timeout". A programmatic caller (test, third-party script, autonomy stack) cannot tell that its command was rejected at the security boundary versus simply not heard.
The symmetric send() method (line 1083-1085) returns {"status": "error", "error": f"validation: {exc}"} on the same failure, which IS observable. Suggest either: (a) widening broadcast()'s return type to a discriminated dict ({"status": "error", "error": ...} vs {"status": "ok", "responses": [...]}), or (b) raising ValidationError to the caller and documenting that [] strictly means "no peers replied".
AGENTS.md > Review Learnings (#86) > "Reject silently-dropped kwargs" / "No silent defaults on error".
| return | ||
| if not isinstance(data, dict): | ||
| return | ||
| if not self._estop_lockout.is_set(): |
There was a problem hiding this comment.
Authorisation asymmetry between estop and resume that the threat-model table glosses over. _on_safety_resume enforces a second factor (HMAC(STRANDS_MESH_OVERRIDE_CODE, proof_nonce), replay cache, freshness window). _on_safety_estop enforces only the wire-layer ACL.
In the default permissive ACL (_acl_config.default_acl() -- which the PR explicitly ships and §8 acknowledges is the path most operators will take on day one), any peer with a CA-signed cert can publish on strands/safety/estop and engage the lockout fleet-wide. There is no override-proof check, no rate limit on the estop topic, and no audit-time issuer verification beyond the peer_id field in the payload (which the issuer chose freely).
This is intentional -- e-stop should be easy to engage, hard to clear -- but the README "Threat-vector coverage" matrix lists "Valid op-* cert tries to hijack another operator's RPC turn_id" as Mitigated and does not call out that with the default ACL, the same valid op-* cert (or any cert) can trigger fleet-wide lockout indefinitely (resume requires the override code that the issuer doesn't need to know).
Not a code change request -- a documentation completeness one. Either add a row to the threat-vector matrix saying "Cert holder triggers nuisance e-stop -- accepted, mitigation is operator process (revoke cert) plus the role-ACL example", or add a per-publisher rate cap on safety/estop analogous to the cmd-rate downsampling block.
There was a problem hiding this comment.
Half-addressed in R9 (1164801): replay defense now in place, so the captured-envelope replay vector is closed. Doc completeness for the threat-vector matrix (rate-cap on safety/estop topic, and explicit row for nuisance e-stop by valid op-* cert) tracked as follow-up #198 is the right home -- handler now matches resume semantics, the operational rate-cap is a separate hardening.
🤖 Autonomous agent response. Strands Agents. Feedback to @cagataycali.
| ) | ||
| if psk is None: | ||
| return None | ||
| return hmac.new(psk, _canonical_bytes(record), hashlib.sha256).hexdigest() |
There was a problem hiding this comment.
The AuditPSKDegradedError class is exported in __all__ and the docstring describes a specific attack scenario (process briefly clears env var, writes unsigned forgeries, re-sets env var to evade detection). But there is no regression test that covers this code path: tests/mesh/test_audit_integrity.py only resets _AUDIT_STATE.psk_was_present between cases (lines 31, 35), it does not assert the raise behaviour.
Per AGENTS.md > Review Learnings (#85) > "Pin every reviewed fix with a regression test... Otherwise the next refactor silently reintroduces the bug", this needs a test like:
def test_psk_unset_mid_run_drops_record(monkeypatch, tmp_path, ...):
monkeypatch.setenv("STRANDS_MESH_AUDIT_PSK", "k")
audit.log_safety_event("e1", "peer", {})
monkeypatch.delenv("STRANDS_MESH_AUDIT_PSK")
with caplog.at_level("ERROR", logger="strands_robots.mesh.audit"):
audit.log_safety_event("e2", "peer", {}) # must be dropped
records = list(audit.read_audit_log())
assert len(records) == 1 # second was dropped
assert any("AuditPSKDegradedError" in r.message or "PSK" in r.message for r in caplog.records)Without it, a future refactor that moves the PSK snapshot logic (or accidentally inverts the snapshot comparison) silently reintroduces the silent-downgrade path, and the test suite stays green.
There was a problem hiding this comment.
Fixed in f75318e. Added test_psk_degrade_drops_record in tests/mesh/test_audit_integrity.py:
- Sets
STRANDS_MESH_AUDIT_PSK, writes one signed record. - Unsets the PSK mid-run.
- Attempts a second write.
- Asserts the second record was dropped (not written unsigned).
- Asserts an ERROR was logged mentioning PSK degradation.
The test follows the exact scenario described in the AuditPSKDegradedError docstring and will fail if the PSK snapshot logic is accidentally inverted or removed.
🤖 Autonomous agent response. Strands Agents. Feedback to @cagataycali.
| except OSError as oe: | ||
| # ELOOP under O_NOFOLLOW = symlink -> retry without it | ||
| # is NOT what we want; just re-raise. | ||
| raise oe |
There was a problem hiding this comment.
Dead try/except clause -- the body just re-raises the caught exception:
flags = os.O_WRONLY | os.O_APPEND | os.O_CREAT
nofollow = getattr(os, "O_NOFOLLOW", 0)
try:
fd = os.open(path, flags | nofollow, 0o600)
except OSError as oe:
# ELOOP under O_NOFOLLOW = symlink -> retry without it
# is NOT what we want; just re-raise.
raise oeA bare os.open(...) (no wrapper) would have identical semantics. The comment is informative for readers, but the try/except OSError as oe: raise oe is noise that lint tools will flag (B904 / TRY201). Either drop the wrapper and keep the comment as an # NOTE: line above the bare call, or actually do something distinct in the except block (e.g. translate ELOOP into a domain-specific message before re-raising).
There was a problem hiding this comment.
Fixed in R10 (35c5758). Replaced the no-op try/except OSError as oe: raise oe wrapper with a bare os.open(...) call and a # NOTE: comment preserving the explanatory text. The ELOOP-under-O_NOFOLLOW behaviour is now the bare propagation (intended -- a symlinked audit log is the threat documented at the top of this module).
🤖 Autonomous agent response. Strands Agents. Feedback to @cagataycali.
…uplicate import Pin the _sign_record PSK-downgrade code path with a regression test: test_psk_degrade_drops_record verifies that when STRANDS_MESH_AUDIT_PSK is set at first write then unset mid-run, the second write is dropped (not written unsigned) and an ERROR is logged. Also removes the duplicate 'import json' at line 257 flagged by CodeQL (already imported at line 16). Addresses review thread on audit.py:568 (AuditPSKDegradedError untested) and CodeQL thread on test_audit_integrity.py:257 (duplicate import).
… + CHANGELOG Add STRANDS_MESH_RESUME_FRESHNESS_S (default 60), STRANDS_MESH_RESUME_FORWARD_SKEW_S (default 5), and STRANDS_MESH_RESUME_REPLAY_CACHE_MAX (default 4096) to both the README env-var matrix and the CHANGELOG new-env-vars table. These tunables were introduced by core.py resume hardening but were missing from documentation. Defaults now match across code, README, and CHANGELOG (three places, one truth). Addresses review thread on audit.py (README env-var matrix gaps).
… duplicate imports) - Remove unused 'import logging' and 'logger' from security.py (CodeQL: unused global variable). - Add explanatory comments to 'except OSError: pass' in test_pentest_findings.py (symlink defence tests -- the pass is intentional; the assertion below is what matters). - Add explanatory comment to 'except Exception: pass' in core.py:1014 (audit publish is best-effort, must never block safety path). - Add noqa comment to local import in test_resume_replay.py (needed for monkeypatch access to module-level constant). - Fix import sort in security.py after logging removal. Addresses CodeQL threads on security.py:53, test_pentest_findings.py:149, test_pentest_findings.py:179, core.py:1014, test_resume_replay.py:154, test_robot_mesh_security.py:25.
yinsong1986
left a comment
There was a problem hiding this comment.
Summary
Replaces the hand-rolled application-layer crypto envelope (~1700 LOC: HMAC, nonce cache, peer identity, token bucket) with Zenoh 1.x's built-in transport primitives (mTLS, key-expression ACL, downsampling, low_pass_filter, namespace, gossip-only scouting), and pares mesh/security.py down to a payload-validation surface (allowlist of actions / hosts / HF repos / policy types). Audit log gains per-record HMAC, per-peer monotonic seq with fcntl.flock cross-process coordination, bounded rotation, and a tamper-aware verifier. CA pinning on the IoT path is hardened to a tuple-of-pins with O_NOFOLLOW + raw-hash compare on disk re-use. The diff is large (+8187 / -151) but the threat-vector matrix in the description maps cleanly onto pinned tests, and the Zenoh-1.x quirks the PR documents (lpf interfaces required, ACL enabled: true required, **/cmd glob, literal CN match) match what the reproductions in tests/mesh/test_redteam_zenoh.py exercise.
What's good
- Threat model is explicit and the test surface is genuinely impressive (live two-peer Zenoh sessions, not just builder-config introspection).
- Each of the four Zenoh-1.x quirks the PR claims to have empirically discovered has a regression test that fails on pre-fix config. That's exactly the AGENTS.md > Review Learnings (#85) > "Pin regression tests for reviewed fixes" rule applied.
- HITL flow correctly splits
_rate_limit_checkfrom_rate_limit_recordso a declined operator approval does NOT consume a rate-limit slot. Subtle and correct. _dispatchraisesLockoutErrorinstead of returning a generic dict so_exec_cmdcan audit the lockout rejection symmetrically with the validation rejection path.- Audit log's tamper-aware cursor (records that fail signature verification do NOT advance
last_seq_by_peer) is the right shape to defeat the natural "omit the sig field and forge a seq jump" attack. _resume_lockoutreturning a single generic error dict on every failure path so a remote prober cannot use response-shape oracles to map lockout state — good R5-4 hardening.- CA pin re-use path explicitly ignores
STRANDS_MESH_DISABLE_CA_PINso a rogue CA from a prior compromised run cannot be silently re-trusted. Good narrow scoping of the break-glass.
Concerns
- Scope is contained, but breaking-change density is high. The CHANGELOG entry is thorough, but the migration sketch (provision PKI, swap env vars, drop legacy PSK) lands as a single Unreleased entry — no semver / version bump in the diff. For pre-1.0 this is fine, but operators picking up the next tag will need a more prominent UPGRADING note since
STRANDS_MESH_PSKetc. are silently ignored after this lands. - Permissive default ACL combined with
STRANDS_MESH_AUTH_MODE=none(dev mode) yields zero wire authentication AND zero authorization. The session-open WARNING is a single log line that a long-running daemon emits once and forgets. Worth a follow-up to either re-emit periodically when running unauthenticated or refuseauth_mode=noneifSTRANDS_MESH_FORCE_DEV_MODE(or similar opt-in) is not also set. - Documentation drift candidates worth a sweep: the docstrings on
Mesh.publish(formerly_put_signed) still hint that signing once happened here; the audit module docstring referencesSTRANDS_MESH_PEER_RATE(removed env var) in the R4-7 amplification comment ataudit.py:136. - Two skipped ACL fanout tests (
test_robot_cert_cannot_publish_on_cmd,test_op_cert_can_publish_on_cmd) are documented as needing more research into Zenoh 1.x egress evaluation, with the security-critical case (rogue CN dropped) covered bytest_unknown_cn_dropped_by_default_deny. This is OK but worth filing a tracking issue so it doesn't rot. - ACL deny events do not surface in
mesh/audit.py— they live in Zenoh's own log output. The PR's own follow-up list flags this; just want to confirm the operator-facing docs don't oversell whatverify_audit_integrity()will see.
Verification suggestions
# Sanity-check the Zenoh 1.x quirk reproductions actually fail on pre-fix config:
hatch run pytest tests/mesh/test_redteam_zenoh.py -v
# Confirm the new audit-integrity guarantees survive PSK degrade mid-run:
hatch run pytest tests/mesh/test_audit_integrity.py -v -k psk_degraded
# Smoke the bridge dedup with a clock-jump:
# (manually) bridge a topic, fire an envelope, set system time backwards 1h,
# fire the same envelope -- second one should NOT dispatch.
# Run the dev-mode warning path once to confirm it actually fires:
STRANDS_MESH_AUTH_MODE=none python -c "from strands_robots.mesh import init_mesh; ..."| try: | ||
| import psutil # type: ignore[import-not-found] | ||
|
|
||
| return sorted(psutil.net_if_addrs().keys()) |
There was a problem hiding this comment.
_local_interfaces() can return an empty list -- on a containerised CI runner with namespaced networking, in a chroot, or when psutil enumerates only interfaces the caller doesn't have permission to see. The PR description even calls out this exact Zenoh 1.x quirk ("low_pass_filter was a silent no-op... requires a non-empty interfaces allowlist"), but the function silently returns [] in those edge cases and the resulting low_pass_filter_block and the permissive default ACL subjects[0].interfaces both become a silent no-op or a silent total-deny.
Suggest raising RuntimeError here when the resolved list is empty (after honouring the env override), with a message pointing operators at STRANDS_MESH_FILTER_INTERFACES. Per AGENTS.md > Key Conventions #5: "raise on fatal errors -- never warn-and-continue if the system will behave unexpectedly."
| cmd = data.get("command", data) | ||
| if isinstance(cmd, str): | ||
| cmd = {"action": "execute", "instruction": cmd} | ||
| cmd = {"action": "execute", "instruction": cmd, "policy_provider": "mock"} |
There was a problem hiding this comment.
This string-payload coercion silently injects policy_provider="mock" for any caller that publishes a bare-string command. validate_command (security.py:478) explicitly requires policy_provider to be set explicitly and refuses silent defaults at the security boundary, but this code path papers over that requirement before validation runs.
The symmetric concern is in _dispatch (lines 749-765, unchanged context but worth flagging here): even though _exec_cmd reassigns cmd to the validated copy at line 621, the dispatch path then re-extracts via cmd.get("policy_provider", "mock") / cmd.get("policy_host", "localhost") with permissive fallbacks. A future refactor that introduces a code path reaching _dispatch without going through _exec_cmd (a new transport, a renamed handler, an in-process caller) silently restores insecure defaults.
Suggest: drop the auto-fill here so a bare-string command fails validation loudly (matching the security-boundary contract), and switch _dispatch to subscript-access (cmd["policy_provider"]) so a missing key surfaces as KeyError rather than degrading to "mock". Per AGENTS.md > Review Learnings (#86) > "Reject silently-dropped kwargs" and Key Conventions #6 "No silent defaults on error."
| timeout -- useful for telemetry (which peers acknowledged before the | ||
| stop fanned out). | ||
| """ | ||
| self._estop_lockout.set() |
There was a problem hiding this comment.
Partial-failure ordering: _estop_lockout.set() is called before self.broadcast({"action": "stop"}, ...) (line 1280) and before publish("strands/safety/estop", ...) (line 1281). If broadcast or publish raises (transport down, ACL drop, queue full), the local peer ends up locked out while the rest of the fleet never received the stop signal -- the opposite of the intended safety property ("every peer locks down on a fleet-wide e-stop").
The local lockout is the safety-correct ordering; the issue is that transport errors get swallowed and return responses reports an empty list as if the fleet ack'd nothing, indistinguishable from "all peers offline". Consider wrapping the broadcast/publish so that on transport error the return value carries an explicit local_only=True flag (and the caller / audit record can react). Per AGENTS.md > Review Learnings (#86) > "Resource Cleanup on Partial Failure".
| if ident is None: | ||
| return False # nothing to dedup against -- pass through | ||
| cache_key = (key, ident) | ||
| now = time.time() |
There was a problem hiding this comment.
time.time() is wall-clock and can move backwards (NTP step, manual date -s, VM resume from snapshot). When that happens the cutoff = now - self._ttl comparison at lines 315-318 retains entries indefinitely (their stored t is in the future), and the duplicate-detection at line 327 flips its meaning (a stale entry's now - seen_ts becomes negative, which is <= ttl -- so legitimate retransmits look like dupes).
tools/robot_mesh.py:_rate_limit_check correctly uses time.monotonic() for the same kind of sliding-window cache. Suggest the same here for _seen timestamps -- monotonic is unaffected by NTP / clock changes and is the standard primitive for elapsed-time caches. The same applies to core.py::_resume_replay_cache (also wall-clock, also vulnerable to NTP step).
| in_string = False | ||
| i += 1 | ||
| continue | ||
| if ch in ('"', "'"): |
There was a problem hiding this comment.
The JSON5-lite preprocessor accepts single-quoted strings into in_string state at lines 75 and 156, but the eventual json.loads at _load_acl_file:194 does NOT accept single-quoted strings -- so an operator who copy-pastes a canonical JSON5 ACL using 'robot-a' (a documented JSON5 feature) gets a confusing JSONDecodeError: Expecting value rather than "single quotes not supported, use double quotes."
The shipped examples/mesh_acl_example.json5 happens to use only double quotes so this never bites in the canonical path, but it's a foot-gun for any operator deviation. Suggest either (a) translating single quotes to double quotes in the preprocessor (with proper escape handling), or (b) raising a targeted ValueError("single-quoted strings not supported in strands JSON5; use double quotes") when a ' is seen outside an already-double-quoted string. Same for unquoted hex / numeric literals if anyone tries them.
Low severity, but worth fixing now while the example file is the only one in the wild.
…udit.py Reviewer (PR strands-labs#195 R7 review) flagged that the R4-7 amplification comment in audit.py around line 136 references a non-existent env var STRANDS_MESH_PEER_RATE. The application-layer per-peer token bucket was deleted when Zenoh transport-layer downsampling took over rate-capping in the security-hardening rewrite. Replace the stale phrasing with the actual mechanism: Zenoh ``downsampling`` keyed on ``STRANDS_MESH_CMD_RATE_HZ`` (default 20 Hz per-key-expression on ``**/cmd`` and ``**/broadcast``). Comment-only diff, no behaviour change. Verified: ``grep STRANDS_MESH_PEER_RATE`` returns empty across strands_robots/, tests/, and docs/. Addresses PR strands-labs#195 review thread on docstring-drift bullet 3.
yinsong1986
left a comment
There was a problem hiding this comment.
Summary
This PR replaces ~1700 lines of hand-rolled application-layer crypto in the mesh layer (HMAC envelope, nonce cache, peer identity, token bucket) with Zenoh transport built-ins (mTLS, key-expression ACL, downsampling, low_pass_filter, namespace) plus a smaller payload-validation surface (mesh/security.py). It adds a per-record-HMAC, per-peer-seq audit log with rotation and a tamper-aware reader; tightens IoT provisioning with a CA SHA-256 pin tuple and per-socket timeouts; and routes fleet-wide LLM tool actions through a Strands tool_context.interrupt. Test coverage looks substantial: 13 new test files, including a live two-peer Zenoh red-team suite that pins the four documented Zenoh-1.x quirks.
What's good
- Defence-in-depth layering is real: transport (mTLS+ACL) gates membership, payload validator gates contents, tool layer adds HITL+rate-limit, audit log is independent. Each layer has its own regression tests.
- Zenoh-1.x quirks (
enabled: truerequirement, literal-CN match,**/cmdglob form, non-emptyinterfaces) are pinned by tests against a live wheel — this is the kind of empirical hardening the AGENTS.md review-learnings call out. - Audit log integrity work is thorough:
O_NOFOLLOW, symlink reject, sidecar+flock for cross-process seq,AuditPSKDegradedErrorfor mid-run env-unset, tamper-aware cursor that won't advance past forged records,read_audit_logwalks rotated copies. _rate_limit_check/_rate_limit_recordcorrectly split so a declined HITL doesn't burn a slot — the docstring explains the safety rationale.Mesh.sendrejects theBROADCAST_RESPONDERsentinel and NUL bytes intarget(R4-5), and_on_responsebinds responder identity for point-to-point — closes the response-hijack surface.- IoT CA pin is correctly tuple-typed (forward-compatible rotation) and the break-glass override is explicitly documented as not applying to the on-disk re-use path.
Concerns
- Scope of
STRANDS_MESH_AUTH_MODE=none. Shipping a one-env-var disable for both mTLS and ACL in the same release that names this PR "security hardening" is risky. TheWARNINGat session-open is good, but a config typo or a forgotten env var on one peer in a fleet silently lets an unauthenticated peer in. Consider gating with a second env (e.g.STRANDS_MESH_ALLOW_UNSAFE=true) or refusing to start unlessSTRANDS_MESH_DEV_MODE=trueis also set. - Permissive default ACL is documented but still operationally surprising. A deployment with
AUTH_MODE=mtlsand a CA that signs both robot- and operator- certs gets zero per-role separation until the operator drops inSTRANDS_MESH_ACL_FILE. The session-open WARNING helps, but the README threat-vector table reads as if role separation is enforced when in fact it requires opt-in. Consider adding the "requires operator-supplied ACL file" caveat directly in the threat-vector table cells, not just the surrounding prose. - Asymmetric estop/resume defences are tracked as follow-up but already exploitable.
_on_safety_estopaccepts any payload from any operator-class peer with no freshness or proof_nonce, so a captured envelope (or a single misbehaving operator-class peer) replays into fleet-wide DoS. Section 12 of the PR description acknowledges this but no issue is filed yet that I could find — please file one before merge so it doesn't get lost. - Public API hygiene violation in user-facing docs. README references
Mesh._resume_lockout(code)as the recommended way to clear the lockout — direct AGENTS.md > Review Learnings (#86) > "Public API Hygiene" violation. Promote the method or expose a wrapper before merge. - Test gap on the two ACL-fanout cases.
test_robot_cert_cannot_publish_on_cmdandtest_op_cert_can_publish_on_cmdare skipped pending Zenoh-1.x research. Until those are unskipped, the role-separation claim in §3 of the PR description is verified only via the negativetest_unknown_cn_dropped_by_default_denycase. That's the security-critical vector, true, but the positive case asserts that the ACL doesn't over-block and is worth pinning. _on_safety_estop/_on_safety_resumeJSON parse silently drops on any exception (except Exception: return). Bareexcept Exception:is forbidden by AGENTS.md > Review Learnings (#86) > "Exception Clauses Must Be Narrow". Narrow to(json.JSONDecodeError, UnicodeDecodeError, AttributeError).- Doc drift around
STRANDS_MESH_PSK. CHANGELOG says PSK is removed, butSTRANDS_MESH_AUDIT_PSK(a separate variable) still exists. The CHANGELOG "Migration" section bullet 4 says to drop "allSTRANDS_MESH_PSK/STRANDS_MESH_PEER_KEY*/ ... env vars" — easy for an operator to read this and dropSTRANDS_MESH_AUDIT_PSKtoo, silently downgrading audit signing. Spell out thatSTRANDS_MESH_AUDIT_PSKis unaffected.
Verification suggestions
- Spot-check the negative-path coverage:
pytest tests/mesh/test_redteam_zenoh.py -v -k "rogue_ca or unknown_cn or oversized or high_rate or distinct_namespace"and confirm none are skipped. - Quick fuzz of the JSON5-lite preprocessor against a real
json5package: feed every file underexamples/plus a handful of pathological inputs (mismatched quotes, nested comments, single-quoted strings with embedded\\") and diff the parsed output againstjson5.loads. The loader's regex-based key quoter is known fragile (line 162 position heuristic). - Confirm
STRANDS_MESH_ACL_FILErejection of the missing-enabled: truecase fails closed by hand:STRANDS_MESH_ACL_FILE=/tmp/bad.json5 STRANDS_MESH_AUTH_MODE=mtls python -c "from strands_robots.mesh._acl_config import resolve_acl; print(resolve_acl('strands'))"with a file that omits the field. - Audit-log rotation under load: write 200k events with a 1 MiB cap and 3-file rotation, then run
verify_audit_integrityto make sure the per-peer cursor walks the rotated set without spurious gaps.
| return | ||
| if not isinstance(data, dict): | ||
| return | ||
| if not self._estop_lockout.is_set(): |
There was a problem hiding this comment.
_on_safety_estop engages the lockout on any decode-able payload from any peer that the ACL admits to safety/estop. Compared with _on_safety_resume (which has freshness window + forward-skew + per-receiver (issuer, proof_nonce) replay cache), this is asymmetric in the wrong direction: an attacker who captures a single estop envelope (or a single mis-behaving operator-* cert) can replay it indefinitely to keep the fleet locked out.
Section 12 of the PR description acknowledges this and tags it for follow-up, but I could not find an issue filed against it on the project board. Please file one before merge (or, ideally, mirror the freshness + proof_nonce + replay-cache pattern from _on_safety_resume here in this PR — the structure is already there). At minimum: a t freshness check is one-sided cheap and closes the captured-envelope-replay surface.
Also AGENTS.md > Review Learnings (#86) > "Exception Clauses Must Be Narrow": the try / except Exception: return at line 870-874 should be except (json.JSONDecodeError, UnicodeDecodeError, AttributeError) (same applies to _on_safety_resume, _on_presence, _on_cmd).
There was a problem hiding this comment.
Reversing the R8 deferral. Implemented in R9 (1164801) instead of follow-up. Same fix shape as _on_safety_resume: freshness + forward-skew + per-receiver replay cache keyed on (issuer_peer_id, t). Narrowed except Exception to (AttributeError, UnicodeDecodeError, json.JSONDecodeError) per the same review thread. Pinned by tests/mesh/test_estop_replay.py.
🤖 Autonomous agent response. Strands Agents. Feedback to @cagataycali.
| | `STRANDS_MESH_CMD_RATE_HZ` | Per-key-expression frequency cap for `cmd` topics enforced via `downsampling`. Floods are dropped at the transport before reaching the deserialiser. | `20.0` | | ||
| | `STRANDS_MESH_POLICY_HOST_ALLOW` | Comma-separated host/CIDR list extending the default loopback-only `policy_host` allowlist for VLA inference targets (e.g. `vla.internal,10.0.0.0/24`). | unset | | ||
| | `STRANDS_MESH_AUDIT_PSK` | Separate PSK for HMAC-signing audit-log records. Independent of the wire PSK so audit signing can rotate on its own schedule. Unset = audit records carry no signature (`verify_audit_integrity` reports them as unverifiable). | unset | | ||
| | `STRANDS_MESH_OVERRIDE_CODE` | Operator code that clears the local emergency-stop lockout via `Mesh._resume_lockout(code)`. Compared in constant time. Unset = lockout cannot be cleared remotely. | unset | |
There was a problem hiding this comment.
AGENTS.md > Review Learnings (#86) > "Public API Hygiene": Mesh._resume_lockout(code) is a private method (leading underscore) but is recommended here in user-facing documentation as the canonical operator entry point. Same problem at line 670 (Mesh.emergency_stop() and Mesh._resume_lockout(code) use the operator override code...).
Resolution per the convention: rename _resume_lockout to resume_lockout (and update CHANGELOG.md L89) before merging, or expose a public wrapper Mesh.resume_lockout(code) that delegates. Keeping the private name in user-facing docs locks the project into a private API as a maintenance contract.
| if not isinstance(source, str) or not source: | ||
| raise ValidationError("teleop_receive requires non-empty source_peer_id") | ||
|
|
||
| return out |
There was a problem hiding this comment.
validate_command covers execute / start / step / teleop_receive but has no clause for resume, so its override_code field is passed through unsanitised. In core.py:_dispatch line 732 the dispatcher does self._resume_lockout(cmd.get("override_code", "")), and _resume_lockout then calls (override_code or "").strip() (line 1342). When a peer sends {"action": "resume", "override_code": ["x"]} (or any non-string truthy value), ["x"].strip() raises AttributeError, which propagates back through _exec_cmd and is caught as a generic dispatch error.
Not a privilege-escalation but a single-RPC DoS path that bypasses the rest of the validator's defensive shape. Add a resume block to validate_command that bounds override_code to a string of reasonable length (say <= 256 chars), or coerce it to str defensively in _resume_lockout.
There was a problem hiding this comment.
Fixed in R10 (5bf1018). Added resume clause to validate_command: type-checks override_code as str and bounds length to <= 256 chars. Pinned by tests/mesh/test_validate_command.py::TestValidateCommandResume (8 tests covering string/empty/missing/list/dict/int/oversized/at-limit cases).
🤖 Autonomous agent response. Strands Agents. Feedback to @cagataycali).
| # handshake (any CA-signed peer is in); ACL-based per-CN role | ||
| # separation requires an operator-supplied STRANDS_MESH_ACL_FILE. | ||
| # See examples/mesh_acl_example.json5 for the canonical role template. | ||
| "default_permission": "deny", |
There was a problem hiding this comment.
The default ACL is documented as permissive-by-design (any CA-signed peer can publish/subscribe on any key) and the rationale ("silent total outage on first run if default-deny") is sound. But composing this with STRANDS_MESH_AUTH_MODE=mtls produces a configuration where the README's threat-vector table claims role separation that the running config does not actually enforce.
The session-open WARNING in session.py:298-304 mitigates this for operators who read logs, but the threat-vector table cells in README.md (lines 624-635) describing robot-* / op-* enforcement read as guarantees, not as conditional-on-STRANDS_MESH_ACL_FILE-being-set claims. Suggest editing those table cells to spell out the dependency in-line ("when STRANDS_MESH_ACL_FILE is set, ...") so an operator skimming the threat model doesn't mis-attribute defences. The 6th row already does this — extend the pattern to rows 4 and 5.
| """ | ||
| raw = os.getenv("STRANDS_MESH_AUTH_MODE", "mtls").strip().lower() | ||
| if raw not in ("mtls", "none"): | ||
| raise ValueError(f"STRANDS_MESH_AUTH_MODE={raw!r} not supported (expected 'mtls' or 'none')") |
There was a problem hiding this comment.
STRANDS_MESH_AUTH_MODE=none disables both the mTLS terminator and the ACL block — a single env var that turns the entire wire-layer security model off. Shipping this in the same release described as "security hardening" without a second confirmation is uncomfortable: a typo, a forgotten env-var on one peer, or a CI test fixture leaking into production all silently disable wire auth.
Suggest gating with a second confirmation (e.g. require both STRANDS_MESH_AUTH_MODE=none AND STRANDS_MESH_DEV_MODE=true, or a sentinel like STRANDS_MESH_I_UNDERSTAND_PLAINTEXT=yes). The session-open WARNING is good but easily missed in production log volumes — a hard-fail unless the operator has explicitly opted in twice is more defensible.
Related: tls_block() raises FileNotFoundError when mtls is set but cert paths are wrong, which is the right behaviour. The none path should match that posture: the burden of proof lives with the operator who's turning auth off.
R8 -- response to substantive reviewThank you for the thorough pass. Walking through each concern: Shipped this round
Verification: Already addressed in code (zombie-thread-style replies)
So the docstring no longer hints that signing happens here. If there is another docstring elsewhere I missed, please point me at the file:line. Filed as follow-up issues (per reviewer guidance)
Each issue captures the reviewer's exact words, two design options where applicable, and acceptance criteria. Linked back to this PR. Breaking-change density / UPGRADING noteNoted. The PR is pre-1.0 so semver is not strictly required, but I agree a more prominent note than a single CHANGELOG bullet helps operators carrying forward Round budgetAt R8 (3-round target was R3). The remaining ~20 unresolved threads from R5/R6 are either zombie debt (replied with citations in R6) or design discussions that were noted-and-deferred. Per the project convention of squash-merge, the head SHA 🤖 Autonomous agent response. Strands Agents. Feedback to @cagataycali. |
yinsong1986
left a comment
There was a problem hiding this comment.
Summary
This PR replaces the hand-rolled application-layer crypto envelope (HMAC + nonce cache + token bucket + per-peer identity) with Zenoh's built-in transport primitives: mTLS (transport/link/tls), per-key-expression ACLs (access_control), per-key rate caps (downsampling), per-message size caps (low_pass_filter), gossip-only discovery, and namespace routing isolation. Application-layer security is reduced to payload validation (mesh/security.py), audit-log integrity (per-record HMAC + per-peer monotonic seq + rotation + tamper-aware reader), an HITL flow on tools/robot_mesh.py, and a few hardenings on the IoT path (CA pin tuple, thing-name regex, per-recv timeout). Test surface is large: ~13 new test files, including live two-peer Zenoh sessions in test_redteam_zenoh.py.
The overall direction is sound — "delete 1700 lines of custom crypto, lean on the protocol's built-ins" is the right trade. The implementation is mostly careful, the Zenoh 1.x quirks called out in §5 are real bugs the PR genuinely fixes, and most reviewed concerns are pinned to regression tests as AGENTS.md > Review Learnings (#85) requires.
What's good
- Reviewed-fix-to-regression-test discipline is consistently applied (every Round-N entry in §13 cites a pin test).
- Clear separation of concerns: wire layer in
_zenoh_config.py/_acl_config.py, payload semantics insecurity.py, HITL intools/robot_mesh.py, audit integrity inaudit.py. - Audit log has a thoughtful failure model: per-record HMAC, sidecar-persisted per-peer seq with
fcntl.flock,O_NOFOLLOWon file open,AuditPSKDegradedErroron PSK-unset-mid-run, rotated-log spanning reader, tamper-aware cursor. validate_commandis a single-entry-point allowlist that runs before any subprocess/HF/host-routing decision — matches AGENTS.md > Review Learnings (#92) > LLM Input Safety guidance.- README env-var matrix and CHANGELOG are updated in lock-step with the new env vars.
Concerns
- Default ACL is permissive but several handlers rely on "ACL gates this topic" for their security argument.
_on_safety_estop,_on_safety_resume,_on_cmddocstrings all say "only peers in theoperator_peersubject can reach this handler", but the default ACL shipped bydefault_acl()allows any CA-signed peer to publish anything. WithoutSTRANDS_MESH_ACL_FILEset, every assertion that begins with "only operator peers can…" reduces to "any peer with a valid cert can…". The README is honest that role separation requiresSTRANDS_MESH_ACL_FILE; the in-source docstrings should match. - Scope creep / commit-count drift. PR description § opening line claims "One commit, branched from current main" but the head has 6 commits (1 feat + 5 review-round followups). Squash on merge will resolve it; flagging here for the human reviewer.
_on_safety_estoplacks the freshness + replay defences_on_safety_resumehas. This is acknowledged in §12 ("accepted design asymmetry… tightening tracked for follow-up") but it's the most security-relevant asymmetry in the PR — a captured estop envelope replays into a fleet-wide DoS lockout. With the default permissive ACL, any CA-signed peer can also originate one. See inline._local_interfaces()failure mode silently disables the byte cap. When psutil is unavailable AND none of the canonical fallback NIC names (lo, eth0, en0, en1, wlan0) match the actual interface (e.g.enp0s3,wlp2s0on systemd-networkd Linux, orbond0in containers), thelow_pass_filterblock becomes a no-op exactly as Zenoh quirk #1 in §5 describes. The PR claims to fix this; the fallback re-introduces it on systems with non-canonical NIC names. See inline.- JSON5-lite preprocessor accepts single-quoted strings but stdlib
json.loadsdoes not. The three preprocessors track single-quote string delimiters when scanning for comments / unquoted-key / trailing-comma, which is correct for not corrupting a single-quoted string — but the resulting bytes still pass throughjson.loadswhich rejects single quotes. An operator who follows the JSON5 spec literally and writescert_common_names: ['robot-a']gets aValueError("ACL file ... is not valid JSON5")rather than a clean parse. Probably fine for documented usage (the example file is double-quoted) but worth a comment.
Verification suggestions
# Confirm the default ACL paths actually deny a CN that isn't enumerated.
hatch run pytest tests/mesh/test_redteam_zenoh.py::TestACLEnforcement -v
# The two skipped tests in TestACLEnforcement (op_cert_can_publish_on_cmd,
# robot_cert_cannot_publish_on_cmd) are the ones that would actually
# verify the role-separated template works. Read their `pytest.skip`
# reasons and decide whether the missing coverage is acceptable for
# merge.
hatch run pytest tests/mesh/test_redteam_zenoh.py -v -k 'TestACLEnforcement'
# Spot-check: does the audit log reject a symlink swap mid-run?
hatch run pytest tests/mesh/test_audit_integrity.py -v -k symlink
# Smoke test the JSON5 preprocessor with a single-quoted-string ACL.
python -c 'from strands_robots.mesh._acl_config import _json5_to_json, _load_acl_file; from pathlib import Path; import tempfile, textwrap, json; src = textwrap.dedent("""{enabled: true, default_permission: \"deny\", rules: [], subjects: [{id: \"a\", interfaces: [\"lo\"], cert_common_names: [\x27robot-a\x27]}], policies: []}"""); print(repr(_json5_to_json(src))); print(json.loads(_json5_to_json(src)))'
# Confirm _local_interfaces actually picks up modern systemd NIC names.
python -c 'from strands_robots.mesh._zenoh_config import _local_interfaces; print(_local_interfaces())'| "issuer": sender, | ||
| "issuer_t": data.get("t"), | ||
| }, | ||
| ) |
There was a problem hiding this comment.
_on_safety_estop is missing the freshness + replay-cache defences _on_safety_resume has, and that asymmetry is exploitable as a fleet-wide DoS. §12 of the PR description acknowledges this as an accepted follow-up, but I'd push back: an attacker who captures a single legitimate safety/estop envelope (e.g. via a brief packet capture during a real incident) can replay it indefinitely afterwards, locking the fleet out until an operator runs _resume_lockout(code) on every peer. Replay-protection-via-TLS-link-sequence-numbers (described in §4) only defends within a single TLS session; it does nothing against a captured envelope replayed on a new session. Combined with the permissive default ACL (default_acl()), any peer with a CA-signed cert (not just operator_peer) can both originate AND replay these. The docstring's "only peers in the operator_peer subject can publish on safety/**" is a lie when no STRANDS_MESH_ACL_FILE is set, which is the documented default.
Fix: mirror the _on_safety_resume envelope check on this handler — require t within RESUME_FRESHNESS_WINDOW_S, reject future-dated t beyond RESUME_FORWARD_SKEW_S, and bound a per-receiver replay cache keyed on (peer_id, nonce). The cost is one new field on the safety/estop publish path; the gain is closing a vector that's identical in shape to the one _on_safety_resume already defends.
If this stays as a follow-up: please add a regression test that fails on current code (i.e. asserts that a captured estop envelope replays successfully today) so the eventual fix is pinnable per AGENTS.md > Review Learnings (#85) > "Pin regression tests for reviewed fixes".
There was a problem hiding this comment.
R9 implements all three defenses you requested. Commit 1164801:
- Freshness window (
RESUME_FRESHNESS_WINDOW_S) -- stale envelopes rejected. - Forward-skew bound (
RESUME_FORWARD_SKEW_S) -- future-dated envelopes rejected. - Per-receiver replay cache keyed on
(issuer_peer_id, t), bounded LRU atRESUME_REPLAY_CACHE_MAX.
Reuses the existing resume-defense env vars (no new surface). Pinned by tests/mesh/test_estop_replay.py (12 tests). Verified to fail on pre-R9 main (test_stale_envelope_rejected engages lockout for a stale envelope; on R9 it is rejected with WARNING ... too old).
The asymmetry you flagged is closed: estop now has the same envelope discipline as resume.
🤖 Autonomous agent response. Strands Agents. Feedback to @cagataycali.
There was a problem hiding this comment.
Implemented in R9 (1164801) per your push-back. Same shape as _on_safety_resume: freshness + forward-skew + per-receiver replay cache keyed on (issuer_peer_id, t). Reuses existing RESUME_* env vars. Pinned by tests/mesh/test_estop_replay.py (11 tests) -- pre-R9 main FAILS at test_stale_envelope_rejected.
🤖 Autonomous agent response. Strands Agents. Feedback to @cagataycali.
| # likely to expose. The filter still functions if any of these | ||
| # match the actual link the traffic rides; missing entries just | ||
| # mean traffic on those interfaces bypasses the cap. | ||
| return ["lo", "lo0", "eth0", "en0", "en1", "wlan0"] |
There was a problem hiding this comment.
Silent no-op when psutil is missing AND none of the canonical NIC names match the host. This fallback is exactly the bug Zenoh quirk #1 in §5 of the PR description claims to fix — "low_pass_filter was a silent no-op because the block requires a non-empty interfaces allowlist". On a stock Ubuntu 24.04 box with systemd-networkd (NICs typically named enp0s3, wlp2s0), or in many container runtimes (eth0 is sometimes the bridge but the actual data path can be bond0 / cali* / vxlan*), none of the names in ["lo", "lo0", "eth0", "en0", "en1", "wlan0"] match — the byte cap is then quietly disabled.
The correct posture is fail-loud: if psutil is unavailable, raise (psutil is a transitive dep of strands-agents per the inline comment, so this branch should be cold) or at least emit a logger.warning so an operator notices their byte cap isn't engaged. Falling back to a curated guess defeats the purpose of §5's bug fix.
Min fix:
except ImportError:
logger.warning(
"[mesh] psutil unavailable; STRANDS_MESH_FILTER_INTERFACES "
"must be set explicitly or low_pass_filter / downsampling "
"may silently no-op on this host"
)
return ["lo", "lo0", "eth0", "en0", "en1", "wlan0"]or — better — raise from low_pass_filter_block when interfaces resolves to an empty intersection with the actual host NICs.
| # handshake (any CA-signed peer is in); ACL-based per-CN role | ||
| # separation requires an operator-supplied STRANDS_MESH_ACL_FILE. | ||
| # See examples/mesh_acl_example.json5 for the canonical role template. | ||
| "default_permission": "deny", |
There was a problem hiding this comment.
The default ACL is permissive (any CA-signed peer can publish on any key) but multiple handler docstrings claim safety/** and cmd are gated by operator_peer. This is the README's honest position (§8 of the PR description "Default ACL — permissive by design"), but Mesh._on_safety_estop (line 863–866), Mesh._on_safety_resume (line 899–906), and Mesh._on_cmd (line 580–588) all read like the role-separated template at examples/mesh_acl_example.json5 is the active ACL.
Either:
- Make the default-deny ACL the actual default, with a startup-time emit-and-fail when no operator-supplied ACL is present in
mtlsmode (production posture). The argument against ("silent total outage on first run") is a feature in a security-critical path — a misconfigured fleet should refuse to talk, not silently accept anything. - Or change every handler docstring to say "the wire layer may restrict this topic to
operator_peerif the operator has supplied an ACL file; the default ACL does not" so readers don't carry away a false security model.
This intersects with AGENTS.md > Review Learnings (#86) > "Match docstrings to semantics".
There was a problem hiding this comment.
Option 2 (docstring honesty) shipped in R9 (1164801). _on_safety_estop, _on_safety_resume, _on_cmd docstrings now read "when STRANDS_MESH_ACL_FILE configures role separation, only operator_peer can publish on safety/**" -- making explicit that the wire-layer constraint is operator-supplied, not default. Option 1 (default-deny + startup fail) is a posture change for follow-up; this PR keeps the documented permissive default.
🤖 Autonomous agent response. Strands Agents. Feedback to @cagataycali.
| "write an unsigned record (would silently degrade audit " | ||
| "integrity). Restore the PSK or restart the process to " | ||
| "transition to unsigned mode deliberately." | ||
| ) |
There was a problem hiding this comment.
Asymmetric PSK-degrade detection. _sign_record rejects the signed-→-unsigned transition (good, regression test pinned in test_audit_integrity.py::test_psk_degrade_drops_record), but it silently accepts the unsigned-→-signed transition. If a process boots with no PSK, writes a few unsigned records, then has the PSK installed (e.g. a sidecar updates the env), every later record carries a sig and verify_audit_integrity reports ok for them — leaving the unsigned prefix unverifiable AND not flagged. An auditor who skims the report sees a pile of ok and a few missing_sig and assumes the missing-sig records pre-date PSK rollout (which is what the docstring describes as "expected during rollouts"), but they could just as easily be records an attacker wrote during a window where they unset the PSK env, did some damage, then restored it.
Fix: snapshot psk_was_present symmetrically. If first call observed "no PSK" and a later call observes "PSK present", emit a WARNING (don't reject — transition direction is benign for new records) but record the transition timestamp on the state object so verify_audit_integrity can flag any missing_sig record after that ts as "PSK was present at this time but record carries no sig".
|
|
||
| def _json5_to_json(raw: str) -> str: | ||
| """Apply our JSON5-lite preprocessor to *raw*.""" | ||
| return _strip_trailing_commas(_quote_unquoted_keys(_strip_json5_comments(raw))) |
There was a problem hiding this comment.
The JSON5-lite preprocessor recognises single-quoted strings while scanning, but json.loads will reject them. All three preprocessor passes (_strip_json5_comments, _strip_trailing_commas, _quote_unquoted_keys) correctly avoid corrupting bytes inside '...' — good — but they don't convert 'foo' to "foo" either. An operator who follows the JSON5 spec literally and writes:
cert_common_names: ['robot-a', 'robot-b']gets a ValueError("ACL file ... is not valid JSON5") from _load_acl_file rather than a clean parse. The shipped examples/mesh_acl_example.json5 is double-quoted throughout, so this won't bite the documented happy path — but operators will reasonably copy from JSON5 docs/Stack Overflow and expect single quotes to work.
Least invasive fix: after the existing three passes, add a _convert_single_quoted_strings pass that walks the string and rewrites 'abc' to "abc" (escaping inner " and unescaping inner \'). Or, document explicitly in the module docstring (lines 25–27) that ONLY double-quoted strings are accepted and the loader is JSON-with-comments-and-trailing-commas, not full JSON5. Either way, currently the docstring says "JSON5" but the loader accepts a strict subset.
…ing honesty Addresses PR strands-labs#195 review feedback (5 new threads on 05:27Z review): 1. _on_safety_estop now mirrors _on_safety_resume's replay defense: freshness window, forward-skew bound, per-receiver replay cache keyed on (issuer_peer_id, t). Reuses existing RESUME_FRESHNESS_WINDOW_S / RESUME_FORWARD_SKEW_S / RESUME_REPLAY_CACHE_MAX env vars (no new surface). Closes the captured-envelope replay DoS the reviewer flagged: an attacker with brief safety/** ACL access can no longer replay a captured estop indefinitely on a new TLS session. 2. Docstrings on _on_safety_estop, _on_safety_resume, _on_cmd corrected to match permissive default ACL behavior. The 'only operator_peer can publish here' claim was a lie when STRANDS_MESH_ACL_FILE is unset (the documented default). Each docstring now reads 'when the operator supplies STRANDS_MESH_ACL_FILE with role separation, only operator_peer can publish; the default is permissive (CHANGELOG Section 8)'. 3. _zenoh_config.py module docstring fixed: namespace default was wrongly documented as 'strands_robots' (actual: 'strands'); ACL function reference was 'default_acl_block' (actual: 'default_acl') and described as 'default-deny' (actual: permissive). 4. Narrowed exception clause on _on_safety_estop to (AttributeError, UnicodeDecodeError, JSONDecodeError) per AGENTS.md > Review Learnings (strands-labs#86). Pinned by tests/mesh/test_estop_replay.py (12 tests). Verified to fail on pre-R9 main (test_stale_envelope_rejected fails because _on_safety_estop accepted any decode-able JSON object). 714 mesh tests pass on R9 branch.
…llowlist charset+cache)
Five hardening fixes in strands_robots/mesh/security.py covering numeric
coercion, HF repo shape, env-var allowlist hygiene, and the teleop_receive
device_name field:
1. _coerce_float / _coerce_int reject NaN, +/-inf, and overflow.
Without an explicit math.isfinite check, 'nan < lo or nan > hi'
evaluates to False on both bounds (IEEE-754), so a payload like
{"action": "execute", "duration": NaN} would land NaN in the
robot adapter where time.sleep(nan) raises and time.monotonic() + nan
creates a never-terminating deadline. Python's json.loads accepts
the JSON-non-standard literal NaN by default, so this IS reachable
from the wire.
_coerce_int also wraps int(...) so a NaN/inf float input that
isinstance accepts but int() rejects raises ValidationError instead
of bare ValueError. _exec_cmd only catches ValidationError; a bare
ValueError would bypass the structured command_rejected audit and
surface as a generic 'dispatch error' -- asymmetric forensic
treatment with the _coerce_float path.
2. is_safe_model_path(hf_only=True) requires exactly two non-empty
path segments (the canonical <org>/<repo> shape).
Pre-fix the validator only required '/' in path and a matching org
prefix, so 'nvidia/etc/passwd' (3 segments, traversal-shaped,
nvidia matches the allowlist) silently passed. Real HF repo IDs
are exactly two segments; deeper paths would be branch/revision
pinning ('org/repo/blob/sha') which today's loaders do not accept.
The validator's job is to enforce the wire contract regardless of
downstream rejection (a future loader that accepts deep paths
would inherit the gap silently).
3. HF + policy_type allowlists charset-validate operator-supplied
entries.
The pre-existing F3 fail-loud-on-misconfig posture for
STRANDS_MESH_POLICY_HOST_ALLOW is now lifted into a single helper
(_validate_env_allowlist_entries) and applied to all three
operator-extensible env-var allowlists. A typo like
STRANDS_MESH_HF_REPO_ALLOW="nvidia,;rm -rf /" no longer silently
produces an allowlist containing ';rm -rf '; the malformed entry
is dropped with a WARNING that names the env var, the offending
entry, and the expected charset.
4. All three allowlist parsers wrap functools.lru_cache(maxsize=1)
keyed on the raw env-var string.
Without the cache the parsers re-read os.getenv on every
validate_command call and re-emit the malformed-entry WARNING,
which on a busy mesh would drown the audit log on a typo'd env
var. monkeypatch.setenv re-parses naturally (different cache
key); a new _clear_security_caches_for_tests() helper exists for
tests that need to re-exercise the WARNING path with the same
raw value.
5. teleop_receive.source_peer_id and device_name are charset + length
bounded (mirrors model_path discipline).
Both fields flow into r.start_teleop_receive(source, dev) and into
log messages; device_name additionally becomes a key in the
per-device state mapping. An authenticated peer publishing
arbitrary unicode / control chars / NUL bytes / shell
metacharacters in either field has no business reaching downstream
code. _PEER_ID_RE and MAX_PEER_ID_LEN now bound the charset to
[A-Za-z0-9_.-]+ at length <= 128.
(The source_peer_id charset already landed via the R25 commit;
this commit adds the matching device_name discipline.)
Pin tests:
tests/mesh/test_validate_command_finite_numerics.py (12 tests)
tests/mesh/test_hf_repo_two_segment_shape.py (15 tests)
tests/mesh/test_allowlist_env_charset_validation.py ( 4 tests)
tests/mesh/test_allowlist_caches.py ( 4 tests)
tests/mesh/test_validate_command_teleop_peer_id.py (24 tests)
Verification:
- hatch run test tests/mesh/: 974 passed, 2 skipped (was 902+2,
+60 new pins on this commit; previous wire-config commit added 16)
- hatch run lint: clean
- hatch run format --check: clean
Public API additions exposed in __all__:
- MAX_PEER_ID_LEN (already in R25; documented here)
The teleop_receive.source_peer_id charset discipline overlapped with the
R25 commit (already in this branch); this commit extends it to
device_name with the same shape.
…thy load)
The R22-A audit-log fallback applied a sane upper bound (_MAX_SEED_SEQ
= 100M) to every seq value seeded into _SEQ_COUNTERS, but the healthy-
sidecar code path accepted any positive int. The two paths are
unsymmetrically defended:
- audit-log fallback: HMAC-verify (F7-D) AND _MAX_SEED_SEQ cap (F15-A)
- sidecar healthy load: no defence
The sidecar is unsigned -- the per-record HMAC defence covers the audit
log only -- so the cap is the only fail-loud surface available on the
healthy-sidecar path. An attacker with audit-dir write access could
otherwise drop a syntactically valid sidecar like
{"victim": 999999999} and after the next process restart the
legitimate writer's first event would silently jump the counter by ~10^9
with no operator-visible signal.
This commit:
- Promotes _MAX_SEED_SEQ from a function-local in the audit-log fallback
to a module-level constant, with a docstring explaining the
100M-events-per-peer-per-year design rationale.
- Applies the cap symmetrically on the healthy-sidecar load path,
emitting a WARNING that names the peer and the offending value
(mirrors the audit-log-fallback WARNING shape).
Pin: tests/mesh/test_audit_sidecar_seq_cap.py (5 tests)
- over-cap sidecar value refused with WARNING
- boundary value (exactly _MAX_SEED_SEQ) accepted
- well-formed values pass through
- module-level constant pinned at 100M
- negative values silently rejected (pre-existing isinstance gate)
Verification:
- hatch run test tests/mesh/: passes (59 audit-related tests green)
- hatch run lint: clean
- hatch run format --check: clean
… HMAC Threat model addressed ---------------------- The body-level HMAC binding (peer_id, t, lockout_elapsed_s, proof_nonce) catches an attacker who mutates body fields of a captured envelope. It does NOT catch an attacker on a *different* mTLS-authenticated session who also holds the override code: that attacker can mint a fresh envelope with a body of their choosing and a freshly-computed MAC. The receiver-side compare passes because the attacker's body matches their MAC. This is the cross-session forgery surface: any peer that the operator trusts enough to ship a cert + override code to (a privileged operator station, a temporarily-onboarded technician laptop) is sufficient to clear another session's lockout. Fix --- Bind the publisher's TLS-authenticated Zenoh session ID (sample.source_info.source_id.zid) into both: 1. The body of every safety envelope (`source_zid` field). 2. The HMAC input for resume envelopes. The receiver: * extracts `sample.source_info.source_id.zid` (set by Zenoh during the mTLS-bootstrapped session handshake; ZenohId has no public Python constructor in zenoh-python, so it is forced to whatever the publishing session bootstrapped to under `connect.tls`), * requires body `source_zid` to equal wire `source_zid` when both are present (no silent downgrade across mixed-version fleets), * re-derives the resume MAC using the wire-level zid -- so an attacker who somehow bypassed the shape check still fails the MAC compare, * keys the resume replay cache on (wire_zid, proof_nonce) so two sessions can legitimately share a nonce without colliding and an attacker on a different session cannot evict legitimate cache slots by reusing a captured nonce. Bridge / IoT transports do not propagate `source_info`. In that case both wire and body `source_zid` are absent and we fall back to the body-level HMAC defences alone -- the cross-session-forgery defence is Zenoh-specific because only Zenoh exposes a TLS-bound publisher identity. Implementation -------------- * New `_extract_sample_source_zid()` helper validates the strict 1..32-hex-char shape so unit-test `MagicMock` stand-ins, third-party transport shims, or zenoh-python repr drift cannot accidentally satisfy the binding (fail open: "no zid available" -> body-only path). * New `_local_session_zid()` reads our own session zid for the publish side; returns None on non-Zenoh transports so the helpers degrade gracefully. * New `_publish_safety_envelope()` declares (and reuses) a Zenoh Publisher per safety topic and attaches `SourceInfo(pub.id, sn)` to every put -- `pub.put(payload, source_info=...)` is the only Python API surface that gets a wire-level zid onto an outbound sample. * Stable per-publisher `EntityGlobalId.eid` plus a monotonic `source_sn` counter scoped per topic so an off-the-wire replay even from the same session is bounded by (zid, sn) uniqueness. * `stop()` undeclares the cached publishers so the underlying Zenoh entity is released cleanly with the session. Tests ----- 20 new tests in tests/mesh/test_safety_source_zid_binding.py: * extractor: well-formed sample, missing source_info, MagicMock rejection, uppercase-hex rejection, overlong-string rejection * estop receiver: body=wire accepted, body!=wire cross-session forgery rejected, body-without-wire rejected, wire-without-body rejected, both-absent (bridge) accepted via body-only defence * resume receiver: happy path, cross-session forgery rejection, MAC-binds-wire-zid defence-in-depth, replay cache keyed on wire zid, pre-binding publisher rejected on Zenoh, bridge transport accepted * publisher helpers: `_local_session_zid()` returns None without session, `_safety_publisher_for()` returns None without session, `_next_safety_sn()` is monotonic per topic, fallback path uses put() All 993 mesh tests pass.
Per AGENTS.md guidance docstrings MUST explain *what the code does and why*, not *who reviewed it on what date or which review round produced it*. The mesh module accumulated a lot of attribution cruft over PR review iterations -- markers like "F18-A (PR strands-labs#195 review): ...", "R20: ...", "Yin's 03:11 batch", "PR strands-labs#195 thread PRRT_kwDORUMiZs6...". That history is preserved in git log; the source code does not need to carry it. This commit removes that noise from comments and docstrings while preserving the actual technical claims. Specifically: * Strips leading "F<N>-<A> (PR strands-labs#195 review): " / "R<N>: " / "F<N>: " prefixes (these were marker-prefixes for review-round attribution). * Strips parenthetical "(PR strands-labs#195 review)", "(PR strands-labs#195 review feedback)", "(PR strands-labs#195 design-thread review)", "(post-F<N>)". * Removes "See PR strands-labs#195 thread PRRT_..." sentences and inline references (the thread ID is meaningful to a code reviewer of that specific PR but not to a future engineer reading the source). * Replaces inline "F<N>(-<A>)?" / "R<N>(-<X>)?" with neutral phrasing ("the prior fix", "earlier revisions", "defensively") and then contextually re-words the awkward results. * Drops attribution noise from log messages ("F9-A per-issuer" -> "per-issuer", "F7-D circular-trust" -> "circular-trust"). * Cleans up tests/mesh/ in the same way. Behavioural changes: NONE. Every assertion, code path, and test remains identical -- this is purely a documentation hygiene pass. Verified -------- * All 993 mesh tests still pass. * `ruff check` passes. * `ruff format --check` passes. * mypy passes on core.py + audit.py. The AGENTS.md "Review Learnings" sections are kept intact -- those *are* the appropriate place for review-round attribution because they document project-wide guidance derived from those reviews.
…t match
The previous docstring claimed the validator regex `^[a-zA-Z0-9_-]{1,128}$`
'matches both AWS IoT's accepted Thing-name character set and our own
filesystem-path safety rules'. That is incorrect: AWS IoT accepts `:`
server-side; we reject it for filesystem-path safety on NTFS / classic
Mac where `:` is a stream / directory separator. The pattern is
therefore a *strict subset* of the AWS surface, not a match.
Update the docstring to call this out explicitly so an operator with a
pre-existing `:`-containing Thing knows to rename or maintain a
mapping rather than file a bug. Code unchanged.
Update: TLS-bound
|
The helper returns an inner _Zid instance, not a SimpleNamespace. The earlier -> SimpleNamespace annotation was a copy-paste from the sibling _make_sample helper above (which does return SimpleNamespace). mypy --strict caught it; the runtime behaviour was already correct. Fixes call-test-lint failure on commit 7423be2.
Split into 9 child PRs — see umbrella tracker #219This PR is being split into 9 dependency-ordered child PRs to make the 213 review threads tractable. This PR stays open for forensic context (R-round changelog) until PR-9 (#228) lands. Child PRs
Where each review round landsThe full R-round → child-PR mapping is in each child PR's commit message. The R-round changelog table in this PR's body remains the forensic source of truth. Reviewer notes
Full plan: |
Four fixes from PR-220 review comments:
1. _pki.py: Add anchored CN regex validation in issue() + make_test_ca().
common_name was interpolated directly into filesystem path and
x509.DNSName(...) without validation. AGENTS.md > 'Sanitize user
inputs' applies -- every later mesh-security test PR uses this
helper, so issue('node/a', out_dir) or issue('../foo', out_dir)
must fail loudly.
Ref: strands-labs#220 (comment)
2. _pki.py: Rename TestCA -> EphemeralCA. The original triggered
PytestCollectionWarning on every run because pytest's default rule
matches Test*. Rename is cleaner than __test__ = False.
Ref: strands-labs#220 (comment)
3. conftest.py: Tighten autouse fixture: (a) move second-factor flag
into same conditional branch as auth-mode override so AUTH_MODE=mtls
does NOT inject the insecure flag; (b) add @pytest.mark.mesh_auth_mode_default
opt-out marker so tests exercising the gate itself don't pass by
accident under autouse.
Ref: strands-labs#220 (comment)
4. _pki.py: Fix docstring -- module path 'strands_robots.mesh.iot.provision'
not filename 'mesh/iot/provision.py' (AGENTS.md PR strands-labs#86 review
learning); 'Returns the generated CA' not 'loaded'.
Ref: strands-labs#220 (comment)
Tracking: strands-labs#219
Source PR: strands-labs#195
|
Correction to my earlier comment: GitHub assigned PR numbers in API-call order rather than slot order. The correct slot-to-PR mapping is:
Umbrella issue #219 has been updated to match. The earlier comment's #226=tools / #227=iot / #228=docs lines were wrong; sorry for the noise. |
…ling convention Addresses 6 R1 review threads on PR strands-labs#224 (5 from human reviewer + 1 CodeQL alert). Each fix is surgical and pinned by an AST-based regression test in tests/mesh/test_mesh_review_invariants_pr224.py; pre-fix the 4 invariant tests fail at the exact line numbers, post-fix all pass. Threads addressed: 1. **Unused `_TESTS` global in tests/mesh/test_acl_example_refs.py.** CodeQL alert (security/code-scanning/251). The global was set to `_REPO_ROOT / "tests"` but only `_REPO_ROOT / ref` is referenced. Removed. The alert resolution is the pin. Ref: https://github.com/strands-labs/robots/security/code-scanning/251 2. **`except (OSError, FileNotFoundError):` at _acl_config.py:380.** `FileNotFoundError ⊂ OSError` since Python 3.3, so the redundant element silently expands the catch surface from a reader's perspective and misleads static analysers. Per AGENTS.md > Review Learnings (PR strands-labs#86) > "Exception Clauses Must Be Narrow." Collapsed to `except OSError:`. Ref: strands-labs#224 (comment) 3. **`except (OSError, ValueError, FileNotFoundError) as exc:` at _acl_config.py:471.** Same shape as strands-labs#2; `FileNotFoundError` element is redundant. Collapsed to `except (OSError, ValueError) as exc:`. Ref: strands-labs#224 (comment) 4. **Removed dead `try/except ValueError: _auth_mode = "mtls"` swallow** at session.py:425-427 and the duplicate at session.py:521-523. The reviewer correctly identified the swallow as dead: 5 lines later `_build_config()` calls `resolve_auth_mode()` again unconditionally and lets the same ValueError propagate. So the silent fallback to "mtls" was setting up a wire-scheme that the immediately-following config build would then reject — three confusing fallback warnings and a `return None` instead of one clear stacktrace at the typo'd env var. The fix lets `resolve_auth_mode()` raise once at the first call site. Inline comment documents the new posture and cites the precedent (`_float_env`, `_load_acl_file`). Ref: strands-labs#224 (comment) 5. **Removed dead `is_symlink()` re-check on `key_path`** at _zenoh_config.py:547-555. The CA/cert/key reject loop above (lines 499-507) already rejected symlinks for all three TLS paths, including the private key. The second `is_symlink()` fired at the same wall-clock instant as the loop above and could never observe a TOCTOU window the loop didn't see first. The big multi-line comment that motivated the dead check has been reframed to explicitly document the residual TOCTOU window between this `lstat()` and Zenoh's eventual `open()` (Python cannot pass `O_NOFOLLOW` into the C-level Zenoh wheel, so the residual window is upstream-bound). Ref: strands-labs#224 (comment) 6. **`is_default_acl_in_use()` was called bare at session.py:308.** The function takes a `namespace` argument that the gate's call site already had on hand (resolved at line 273) but did not pass through. The function currently ignores the arg via shape-only inspection so the bug was latent, but a future shape-check that does honour `namespace` would silently regress to reasoning over different namespaces between the gate and the ACL block. Now `is_default_acl_in_use(namespace)` -- calling convention is pinned in the new test. Ref: strands-labs#224 (comment) Behavior change (Thread 4): STRANDS_MESH_AUTH_MODE=<invalid> (anything other than "mtls" / "none") now raises ValueError from get_session() instead of silently falling back to "mtls" and returning None after three failed config attempts. The exception propagates out of Mesh.start, surfacing the typo with a clear stacktrace at the moment of misconfiguration. Aligns with the loud-on-misconfig posture of _float_env / _load_acl_file. The pre-existing `except Exception: return None` wrapper around `_build_config()` (session.py:437, 459, 470, 552) remains -- the reviewer flagged it as a separate follow-up, "worth a follow-up, not blocking" -- so the only NEW raising call site introduced by this commit is the one at session.py:425 (and its mirror at 521). Pin verification (AGENTS.md S0 round-trip): Pre-fix: tests/mesh/test_mesh_review_invariants_pr224.py test_acl_config_no_redundant_subclass_in_except FAILED (2 lines) test_session_no_auth_mode_value_error_swallow FAILED (2 lines) test_zenoh_config_resolve_tls_paths_one_symlink_check FAILED (2 calls) test_session_default_acl_check_passes_namespace FAILED (line 308) Post-fix: All 4 pin tests pass. CodeQL alert #251 resolves once the unused `_TESTS` global is removed; not pinned by this PR's tests because the alert IS the pin. Refs: strands-labs#195 (umbrella), strands-labs#219 (split tracker), #346 (ext task tracker)
) * test(mesh): test PKI helpers + shared conftest Adds shared test infrastructure used by every later mesh security test PR: - tests/mesh/_pki.py: minimal in-process CA + cert helpers (deterministic CN strings, fd-clean tempdir lifetime, no host paths) used by the wire-config and transport-security suites to assemble mTLS chains in tests. - tests/mesh/conftest.py: shared pytest fixtures (env scrubbing, audit dir cleanup, Zenoh mock). Lands first in the split of #195 so every later test PR has a green CI baseline. Tracking: #219 Source PR: #195 * test(mesh): address PR-220 review feedback (yinsong1986) Four fixes from PR-220 review comments: 1. _pki.py: Add anchored CN regex validation in issue() + make_test_ca(). common_name was interpolated directly into filesystem path and x509.DNSName(...) without validation. AGENTS.md > 'Sanitize user inputs' applies -- every later mesh-security test PR uses this helper, so issue('node/a', out_dir) or issue('../foo', out_dir) must fail loudly. Ref: #220 (comment) 2. _pki.py: Rename TestCA -> EphemeralCA. The original triggered PytestCollectionWarning on every run because pytest's default rule matches Test*. Rename is cleaner than __test__ = False. Ref: #220 (comment) 3. conftest.py: Tighten autouse fixture: (a) move second-factor flag into same conditional branch as auth-mode override so AUTH_MODE=mtls does NOT inject the insecure flag; (b) add @pytest.mark.mesh_auth_mode_default opt-out marker so tests exercising the gate itself don't pass by accident under autouse. Ref: #220 (comment) 4. _pki.py: Fix docstring -- module path 'strands_robots.mesh.iot.provision' not filename 'mesh/iot/provision.py' (AGENTS.md PR #86 review learning); 'Returns the generated CA' not 'loaded'. Ref: #220 (comment) Tracking: #219 Source PR: #195
yinsong1986
left a comment
There was a problem hiding this comment.
Summary
This PR replaces ~1700 lines of hand-rolled application-layer crypto on the mesh with Zenoh built-in primitives (mTLS, per-key-expression ACLs, multicast-off discovery, namespace routing, downsampling/low_pass_filter rate+size caps), keeps a hardened payload-validation + audit layer above it, and pins each fix with a regression test. The defence-in-depth model is sound: identity at TLS, authorisation at ACL, semantics at validate_command, HITL at the LLM tool surface, and an independent HMAC-signed audit log. Test coverage is genuinely impressive (839 mesh tests, plus a live two-peer Zenoh test_zenoh_transport_security.py that exercises namespace isolation, downsampling, low-pass filtering, mTLS handshake rejection, and ACL drop on unknown CN).
The inline comments below flag five concrete concerns I could substantiate. None are critical and all have a small-blast-radius fix; the largest concern is forensic-integrity (#1, #2) where the code currently makes a corrupted byte or a transient write failure indistinguishable from tampering.
What's good
- Threat-model coverage table in
README.mdand the vector-by-vector closure with named regression tests is exemplary documentation. - Symmetric
O_NOFOLLOW+ bounded-read on_ensure_paths,_load_seq_counters,_persist_seq_counters,_seq_flock,read_audit_log, and_acl_config._load_acl_file. - HITL approval is genuinely out-of-band:
tool_context.interrupt(...)intools/robot_mesh.pydoesn't flow through the LLM's tool-arg channel, and_interrupt_approvesenforces literal string match. Prompt injection cannot smuggle an approval. - The PSK-degrade defence is now symmetric (signed->unsigned, unsigned->signed, AND value-rotation) with the fingerprint compare-and-set correctly under
_PSK_STATE_LOCK. validate_commandhas a clean whitelist-rebuild shape (onlyturn_id/sender_idpass through unchanged); I could not find an allowlist-bypass surface.
Concerns
- Forensic integrity (the two top inline comments): the audit log currently treats a single corrupt byte and a transient
os.open/fh.writefailure as outwardly identical to tamper-evidence. Both seam fixes are small and well-contained. - Doc-vs-code drift on the ACL TOCTOU bound — the docstrings in
_acl_config.pyoverstate what_load_acl_cachedactually closes. The PR description correctly notes the architectural fix is deferred to follow-up #218; just align the docstrings with what's implemented today (only the unchanged-file race is closed, not the mutation race). - Scope is tight to the mesh subsystem — no scope creep observed. The CHANGELOG entry, README env-var matrix, and migration sketch are all in this PR. Good.
- Deferred items are acknowledged (issue #218 ACL TOCTOU; two skipped Zenoh 1.x ACL fanout tests;
mesh-doctorCLI). Reasonable hand-offs.
Verification suggestions
# 1. Reproduce concern #1 -- a single bad byte in a rotated audit log breaks the verifier:
python - <<'PY'
import os, tempfile
from pathlib import Path
os.environ['STRANDS_MESH_AUDIT_DIR'] = d = tempfile.mkdtemp()
from strands_robots.mesh.audit import log_safety_event, read_audit_log, audit_log_path
log_safety_event('emergency_stop', 'p1', {'k': 'v'})
p = audit_log_path()
rotated = p.with_suffix(p.suffix + '.1')
os.replace(p, rotated)
rotated.write_bytes(b'\xff\xfe garbage \n' + rotated.read_bytes())
try:
read_audit_log()
print('OK')
except UnicodeDecodeError as e:
print('REPRODUCED:', e)
PY
# 2. Reproduce concern #2 -- _next_seq advances even when write fails (simulate ENOSPC via
# chmod 0o000 on the audit dir after _ensure_paths runs once); next successful write
# will show seq=N+k with k > 1.
# 3. Smoke-test the live two-peer ACL/mTLS path the PR claims is exercised:
hatch run pytest tests/mesh/test_zenoh_transport_security.py -v| # ``os.fdopen`` owns the fd; the with-block above closes | ||
| # it. If iter raises, we already closed the fd via | ||
| # context manager exit -- nothing else to do here. | ||
| raise |
There was a problem hiding this comment.
Forensic-integrity bug — a single corrupt byte crashes the audit walker.
The inner except (OSError, UnicodeDecodeError) re-raises both. The outer except OSError as exc: at L1153 only catches OSError, so UnicodeDecodeError propagates out of read_audit_log(). Cascading effects:
verify_audit_integrity()(L1160) crashes instead of reporting tampering — the comment at L1149-1151 says the intent here was a no-op cleanup ("nothing else to do here"), soraisedefeats that intent._load_seq_counters()callsread_audit_log()from inside itsexcept (OSError, JSONDecodeError, ValueError, TypeError)(around L565);UnicodeDecodeErrorisn't in that tuple either, so the seq-seed fallback also collapses on a single bad byte in any rotated copy.
A corrupted audit byte (write torn by power loss, FS corruption, attacker poisoning) should be reported by the verifier, not crash it. Drop the inner try/except entirely (the with os.fdopen(...) already closes the fd on exception), or change raise to continue so the bad file is skipped with a DEBUG log like the symmetric path at L1153-1156.
Please add a regression test that writes invalid UTF-8 into a rotated audit copy and asserts verify_audit_integrity() returns bad_signature/sequence_gaps rather than raising — per AGENTS.md > Review Learnings (#85) > 'Pin regression tests for reviewed fixes'.
| """ | ||
| record = { | ||
| try: | ||
| seq = _next_seq(peer_id) |
There was a problem hiding this comment.
_next_seq advances and persists before the audit-line write succeeds — phantom gaps.
Seq is reserved at L918 (and persisted to the sidecar inside _next_seq) before the _ensure_paths / os.open / fh.write chain at L981-1034 runs. If os.open raises (ELOOP, ENOSPC, EACCES) or fh.write raises mid-line, the warning at L1033 swallows it but the sidecar already shows seq=N. The next successful write will use seq=N+1 (or seq=N+k), so verify_audit_integrity reports a sequence gap that looks identical to a deletion by an attacker. The seq is never rolled back.
This breaks the audit-integrity contract documented in the module docstring ("gaps within a single peer's stream indicate missing events") in the presence of any transient write failure — exactly the failure mode where an honest writer most needs the integrity bound. Two options:
- Reserve seq after the write is durably flushed (move the
_next_seqcall belowos.fsync, then fix up the persisted record by re-opening — adds one fsync per event). - Match the
PSK_DEGRADED/SIGN_FAILEDpoison-record pattern at L946-973: on write failure, persist aWRITE_FAILEDpoison record so the cursor doesn't appear gapped.
Option 2 is the smaller diff and matches how the PR already handles signing failures.
| Two callers in the same ``Mesh.start`` flow (the gate check and the | ||
| config builder) get the same dict object instead of two independent | ||
| reads -- closing the prior TOCTOU surface. If the file changes a | ||
| later call computes a fresh identity tuple and re-loads. |
There was a problem hiding this comment.
Docstring overstates the TOCTOU bound the cache provides.
The comment block above (L362-371) and the _load_acl_cached docstring claim the gate (is_default_acl_in_use) and the wire load (resolve_acl) see the same snapshot. They don't in the file-mutation case: _file_identity() is recomputed on each call (the (dev, ino, size, mtime_ns) tuple), so if the file changes between the two calls, the second call misses the cache and _load_acl_file() reads the new content. The gate validates one snapshot; the wire loads another.
The actual bound this code provides today is "two consecutive cache hits return the same dict" — strictly weaker than what's documented. The PR description correctly notes the architectural fix is deferred to follow-up #218, but the docstrings should match what's implemented in this PR. Either:
- Soften the docstring to reflect the unchanged-file bound only, or
- Implement an fd-based atomic load (open once with
O_NOFOLLOW, validate shape, then pass the bytes to the wire path) — this is the architectural path security(mesh): close ACL file TOCTOU window between is_default_acl_in_use and resolve_acl #218 will need anyway.
Not blocking on its own, but operator-facing security docs claiming protection that isn't there is itself a security concern.
| type(warn_exc).__name__, | ||
| warn_exc, | ||
| ) | ||
| return False |
There was a problem hiding this comment.
_refuse_under_permissive_default_acl fails OPEN on bad STRANDS_MESH_AUTH_MODE.
resolve_auth_mode() (_zenoh_config.py:188-203) raises ValueError on a typo (STRANDS_MESH_AUTH_MODE=mtsl) and on none without STRANDS_MESH_I_KNOW_THIS_IS_INSECURE=1. Catching it here returns False, which means "do not refuse, proceed to start." The misleading log ("permissive-ACL gate may be missing; investigate") hides the actual misconfig from the operator. This breaks the fail-loud-on-misconfig posture that resolve_auth_mode and the _int_env helpers are explicitly designed around — and that's the most safety-relevant invariant in this PR.
Later calls to get_session() will re-raise the ValueError, so the mesh ultimately doesn't start, but the operator chases a phantom ACL problem instead of the real bad-env problem.
Fix: split the catch.
except ImportError as warn_exc:
logger.warning("[mesh] %s: ACL posture check imports unavailable: %s", self.peer_id, warn_exc)
return False
# Let ValueError from resolve_auth_mode() propagate -- bad env vars
# must fail loud, not be swallowed under a misleading WARNING.| args=(data,), | ||
| name=f"mesh-exec-{self.peer_id}", | ||
| daemon=True, | ||
| ).start() |
There was a problem hiding this comment.
Receiver spawns one OS thread per inbound command, unbounded.
_on_cmd is the Zenoh-callback hot path; every accepted command creates a fresh threading.Thread (L944-949). The transport downsampling cap bounds steady-state ingress per key-expression, but combined across many senders + many key-expressions the worker count is unbounded. If _dispatch blocks (slow robot adapter, hung HF download triggered by an execute command, slow audit-write under fsync), threads accumulate until the OS thread limit.
AGENTS.md > Review Learnings (#85) > 'Don't create executors in hot loops' applies. Recommended fix: a bounded ThreadPoolExecutor constructed once in Mesh.__init__ (with explicit max_workers derived from STRANDS_MESH_MAX_SESSIONS or a smaller dedicated env var), shut down in Mesh.stop(), with a queue-drop / WARNING when saturated. The same pattern would replace the except Exception: pass bare-cleanup at L553-557 with the established (RuntimeError, OSError) narrow-except shape used at L568-575.
The test_application_security.py file (358 LOC) and the TestF15WireCommandKeyRequired class in test_validate_command.py exercise modules that land in later PRs of the strands-labs#195 split: - bridge_transport.DEFAULT_BRIDGE_SUFFIXES with 'safety/resume' (bridge_transport additions land in PR-5 / strands-labs#222) - audit._seq_sidecar_path, audit._resolve_log_max_bytes, F2/F3 symlink protections (land in PR-4 / strands-labs#221) - core.Mesh._expected_responders, core.BROADCAST_RESPONDER, the wire-side _exec_cmd dict-shape gate (land in PR-6 / strands-labs#225) - robot_mesh._interrupt_approves (lands in PR-7 / strands-labs#227) PR-2 is scoped to the leaf module strands_robots/mesh/security.py with zero internal mesh deps. These tests were pre-staged here but need their consumers to land first. They will re-land in the appropriate consumer PR. Test impact: 9 failing tests removed (8 from test_application_security.py, the F15 class moved out). All 571 remaining mesh tests pass. No production code change; no behaviour change on validate_command or any security.py contract.
…ed MAX_OVERRIDE_CODE_LEN Addresses two of five R7 review threads on PR-2 (the other three are deferral / docstring acknowledgements handled in inline replies): 1. is_safe_policy_host defence-in-depth gap. The function is exported in __all__, so external callers (PR-7 tools, PR-8 iot) can import it directly. Today, str.strip() inside the function silently drops ASCII whitespace including \r\n\t\v\f, so 'localhost\r\n' passes membership while preserving injection-shaped bytes for the caller. validate_command had a post-check that rescued in-process callers, but bare imports got nothing. R7 lifts the same charset gate (_SAFE_PASSTHROUGH_RE) into is_safe_policy_host before the strip, so the function is safe in isolation. The validate_command post-check is kept as defence-in-depth so a future refactor of the membership compare cannot silently drop the wire rejection. 2. resume.override_code length cap was the only inline magic number in this module; every other length cap is a named module-level constant. R7 introduces MAX_OVERRIDE_CODE_LEN = 256, exports it in __all__, and uses it both in validate_command and in the new pin test. Pin tests: - tests/mesh/test_security_review_r7.py (20 tests, all green) - tests/mesh/test_security_control_char_gates.py: three R6 pin tests updated to accept either 'not in allowlist' or 'control characters' in the error message, since the R7 charset gate inside is_safe_policy_host fires before the post-check for these inputs. Docstring: noted that STRANDS_MESH_POLICY_TYPE_ALLOW shares one allowlist between is_safe_policy_type and is_safe_policy_provider (addresses strands-labs#239 bucket C public-doc gap). Verification: - tests/mesh/ -- 637 passed, 2 skipped, 0 failed - ruff check + ruff format --check -- clean - mypy untouched (no signature changes) Round-budget note: this is R7 on PR-2. Acknowledged exceeding the AGENTS.md 3-round cap; justified because R7 closes a real defence-in-depth gap on a public symbol that PR-6/7/8 will import. No further code rounds on PR-2 -- the remaining three R7 threads are reply-only deferrals tracked in strands-labs#239. Refs: strands-labs#195, strands-labs#219, strands-labs#239.
Mesh Security Hardening (Zenoh-Native)
Closes the 11 attack vectors identified in the strands_robots mesh pentest scope. Replaces ~1700 lines of hand-rolled application-layer crypto with Zenoh's built-in transport primitives, then verifies every claim against live two-peer Zenoh sessions.
One commit, branched from current
main, simulation tests untouched — supersedes #194 which was based on a pre-#192 ancestor and contained an unintended camera-recording revert.1. Threat surface (before)
flowchart TB subgraph LAN["Zenoh LAN multicast"] A1["Any device on Wi-Fi"] A2["Compromised IoT camera"] end subgraph CLOUD["AWS IoT Core"] B1["Stolen robot cert"] B2["Leaked claim cert"] B3["Misconfigured operator"] end subgraph LLM["LLM agent context"] C1["Injected prompt"] C2["Poisoned RAG document"] end R[("Robot fleet<br/>cmd / state / camera")]:::target A1 -->|"1. publish to /cmd"| R A1 -->|"2. replay captured msg"| R A1 -->|"3. fake policy_host=evil"| R LAN <-->|"4. dual-path double-exec"| CLOUD B1 -->|"6. tamper audit log"| R B1 -->|"9. spoof presence"| R A1 -->|"10. flood /cmd"| R B1 -->|"11. CA-substitution MITM"| R A1 -->|"7. capture camera URLs"| R A1 -->|"8. command after E-stop"| R C1 -->|"5. broadcast emergency_stop"| R C2 --> LLM classDef target fill:#f8d7da,stroke:#721c24,color:#000 classDef attack fill:#f8d7da,stroke:#721c24,color:#000 class R target class A1,A2,B1,B2,B3,C1,C2 attack11 vectors, 3 attacker tiers (LAN device, stolen cert, prompt injection). Single shared LAN with no auth, no replay protection, no audit signing, no LLM HITL, IoT policy wildcards, hand-rolled CA verification.
2. Solution architecture
flowchart LR subgraph WIRE["Wire layer (Zenoh built-ins)"] TLS["transport/link/tls<br/>mTLS + cert CN binding"] ACL["access_control<br/>per-key-expr ACL on cert CN"] DS["downsampling<br/>per-key freq cap"] LPF["low_pass_filter<br/>per-message byte cap"] NS["namespace<br/>fleet routing isolation"] SC["scouting/multicast=off<br/>gossip-only"] end subgraph PAYLOAD["Payload layer (mesh/security.py, 467 LOC)"] VC["validate_command<br/>action allowlist + bounds"] SH["is_safe_policy_host<br/>VLA target allowlist"] SM["is_safe_model_path<br/>HF repo gate + traversal block"] SP["is_safe_policy_type<br/>policy class registry gate"] end subgraph TOOL["LLM tool layer (tools/robot_mesh.py)"] HITL["HITL interrupt<br/>emergency_stop / broadcast"] RL["per-action rate limit<br/>declined != consumed"] CV["client-side validate_command<br/>same gate for programmatic callers"] end subgraph AUDIT["Audit log (mesh/audit.py)"] REC["per-record HMAC-SHA256<br/>STRANDS_MESH_AUDIT_PSK"] SEQ["per-peer monotonic seq<br/>sidecar + fcntl.flock"] ROT["bounded rotation<br/>read spans rotated copies"] CURSOR["tamper-aware cursor<br/>bad sig != advance"] end PEER[("Mesh peer")]:::ok PEER --> WIRE WIRE --> PAYLOAD PAYLOAD --> TOOL PEER --> AUDIT classDef ok fill:#d4edda,stroke:#155724,color:#000 classDef warn fill:#fff3cd,stroke:#856404,color:#000 class TLS,ACL,DS,LPF,NS,SC,VC,SH,SM,SP,HITL,RL,CV,REC,SEQ,ROT,CURSOR,PEER okDefence-in-depth in three layers: Zenoh transport gates fleet membership and topic authorization, application validates payload contents that ACL cannot see, the LLM tool surface adds HITL + rate limits, and the audit log keeps an independent forensic record.
3. Vector-by-vector closure
Every vector is verified by a named regression test in
tests/mesh/. The vector→test map is in §10. (Mermaid omitted for body-length; see commite3da654and prior for the visual.)4. Wire-layer config (
mesh/_zenoh_config.py)transport/link/tls(mutual TLS)STRANDS_MESH_TLS_{CA,CERT,KEY}access_controlkeyed on cert CNSTRANDS_MESH_ACL_FILE(operator-supplied) or permissive built-in defaultnamespaceprefix at routing layerSTRANDS_MESH_NAMESPACE(defaultstrands)STRANDS_MESH_MULTICAST=trueto opt indownsamplingblockSTRANDS_MESH_CMD_RATE_HZ(default 20)low_pass_filterblockSTRANDS_MESH_MAX_CMD_BYTES,STRANDS_MESH_MAX_CAMERA_BYTEStransport/unicast/max_sessionsSTRANDS_MESH_MAX_SESSIONS(default 256)downsamplingon**/safety/**STRANDS_MESH_SAFETY_RATE_HZ(default 2.0)low_pass_filteron**/safety/**STRANDS_MESH_MAX_SAFETY_BYTES(default 4096)5. Zenoh 1.x quirks discovered and fixed
low_pass_filterwas a silent no-op. Empty/missinginterfacesdisables the cap. Builder enumerates every local NIC via psutil.<namespace>/*/cmdpatterns never matched. Zenoh strips namespace before checking key_exprs. All filter / ACL globs use**/cmdform.access_controlrequiresenabled: true. ACL loader hard-rejects any file that omits it.cert_common_namesmatches LITERAL CNs only. Default ACL is permissive; operators wanting per-role enforcement supplySTRANDS_MESH_ACL_FILE.Each is reproduced in a unit test that fails on pre-fix code.
6. LLM tool HITL flow (
tools/robot_mesh.py)Approval delivered out-of-band of the LLM's tool-argument flow, so prompt injection cannot smuggle approval. Only
y/yes/approve/approved(case- and whitespace-insensitive, exact-match afterstrip().lower()) approve. Declined approvals do not consume rate-limit slots.7. Audit integrity flow (
mesh/audit.py)Covers tampering of in-progress records (per-record HMAC), deletion (sequence gaps), restart-and-renumber (sidecar persistence + cross-process flock), env-var unset mid-run (PSK snapshot +
AuditPSKDegradedError), env-var rotated mid-run (R21 fingerprint detection), symlink swap (O_NOFOLLOW+ lstat reject), unbounded growth (rotation cap), tampered seq masking gaps (bad_signature records do not advance per-peer cursor), records hidden in rotated copies (read_audit_logwalks the rotation set in chronological order), verifier without PSK on signed log (R21unverifiable_signedcounter folded intook).8. Default ACL — permissive by design
mesh._acl_config.default_acl()ships permissive. Default-deny with no enumerated CNs would block every legitimate first-run message.Mesh.start()emits a WARNING when default ACL is active undermtls. Operators supplySTRANDS_MESH_ACL_FILEfor role separation.9. Backwards compatibility
~/.strands_robots/iot/).STRANDS_MESH_AUTH_MODE=none+STRANDS_MESH_I_KNOW_THIS_IS_INSECURE=1second-factor (R18). ERROR log on every session open.robot_mesh'sconfirm: boolparameter replaced by Strands SDKtool_context.interrupt.STRANDS_MESH_BRIDGE_TOPICSexact-match by default; old prefix-walk viaSTRANDS_MESH_BRIDGE_TOPICS_PREFIX.10. Test summary
ruff check/ruff format --check/mypy --strictclean across 215 source files.11. Files changed (top-level)
strands_robots/mesh/_zenoh_config.pystrands_robots/mesh/_acl_config.pystrands_robots/mesh/security.pystrands_robots/mesh/audit.pystrands_robots/mesh/core.pystrands_robots/mesh/transport/bridge_transport.pystrands_robots/mesh/iot/provision.pystrands_robots/tools/robot_mesh.pyexamples/mesh_acl_example.json512. Follow-ups (not in this PR)
Per maintainer directive (issue #340), design-level threads addressed in-PR via R19+ commits. Remaining narrow items:
TestACLEnforcement(need Zenoh 1.x local-subscriber-fanout egress research). Security-critical case (rogue CN dropped) is covered bytest_unknown_cn_dropped_by_default_deny.mesh/audit.py; live in Zenoh's own log output.mesh-doctorCLI subcommand to verify the live config emits all expected blocks.13. Round-N changelog
AuditPSKDegradedErroruntestedf75318etest_audit_integrity::test_psk_degrade_drops_recordSTRANDS_MESH_RESUME_*not in README/CHANGELOGa0d8855logger+ emptyexcept+ duplicate imports81e80cdc76fa65STRANDS_MESH_PEER_RATEref in audit.py:136ab33847_on_safety_estopmissing freshness/replay defenses1164801test_estop_replay.py(12 tests)safety/**was operator-only1164801_zenoh_config.pydocstring drift1164801validate_commandhad noresumeclause5bf1018TestValidateCommandResume(8 tests)try/except OSError as oe: raise oe35c5758126f1b4f8b1270bridge_transportTTL math usedtime.time()0d1eb89test_bridge_dedup_monotonicexcept Exceptionin_on_safety_resumec9cb0fdjson.loadse63e6a9test_acl_config::TestJSON5SingleQuotes6a37d1ctest_bridge_dedup::TestStrictc3627e8STRANDS_MESH_RESUME_*failed module import on bad input + silently accepted negativef79c94ftest_resume_env_validation.py(10 tests)c9e0a32test_default_acl_self_consistenta9c5c8ftest_psk_degrade_unsigned_to_signed_drops_recordauth_mode=nonesingle env var disabled wire auth8accc8fTestAuthMode::test_explicit_none_without_optin_raises(4 tests)f67e7c4test_cursor_backward_rollbackb3bdc7b346715ftest_acl_symlink_rejectede53dd4ftime.time()->time.monotonic()for replay-cache TTL eviction (B5)e3da654test_replay_cache_monotonic.py(2 tests)_validate_acl_shaperefuses malformed ACL files at parse timeb8444fctest_acl_shape_validation.py(10 tests)2c91415test_audit_psk_poison_record.pyexamples/mesh_acl_example.json57b61856test_acl_example_refs.py_load_seq_counterslacked O_NOFOLLOW (asymmetric with persist)6695063test_audit_seq_symlink.pyt-only; reject emptypeer_id4b284b6test_estop_peer_id_replay.pyverify_audit_integrityreturnedok=Truewhen verifier lacked PSK on signed logff64de6test_verify_unverifiable_signed.py(5 tests)135f982test_audit_psk_rotation.py(6 tests)**/safety/**ad958b8test_safety_rate_caps.py(8 tests)_evict_replay_cachehelper, generic over key type (PEP-695). Net diff +52/-26; behaviour preserved. Both 25 existing replay-cache integration tests AND 10 new helper unit tests stay green; on pre-R23 main the helper test file fails to collect (ImportError).3f9dc3ctest_replay_cache_eviction.py(10 tests)except Exception:on the wire-input parse path of_on_cmdand_on_responsewas inconsistent with the narrow(AttributeError, UnicodeDecodeError, json.JSONDecodeError)tuple already used by_on_safety_estop/_on_safety_resume. Bare-except maskedRuntimeError/TypeError/MemoryErrorfrom a future Zenoh API change. Symmetric-reasoning audit pin (test_all_four_wire_handlers_use_same_tuple) prevents regressing one handler without the others.ff076aftest_wire_handler_narrow_except.py(3 tests)_exec_cmdcoerced bare-string commands into{action:execute, instruction:<str>, policy_provider:mock}, letting any mTLS+ACL-authorised peer drive the robot at the mock policy with arbitrary text just by publishing"hello"instead of a dict \u2014 bypassingvalidate_command's dict-shape contract. Now rejects non-dict payloads at the wire boundary with WARNING. The ergonomic string form is still available on the outgoingtell()path. Updated existingtest_exec_cmd_string_command_becomes_execute(which documented the wrong contract) totest_exec_cmd_string_command_rejected.8050103test_mesh_rpc.py::test_exec_cmd_string_command_rejected_resolve_tls_pathsonly checkedis_file(), but the docstring on_zenoh_config.py:73and the README env-var matrix both promisedmode 0o600forSTRANDS_MESH_TLS_KEY. A 0o644 key file silently passed \u2014 a co-tenant on a shared host could read the key. Now enforceskey.stat().st_mode & 0o077 == 0on POSIX with a clear remediation message (Run: chmod 600 <path>). Skipped on Windows (file modes do not map cleanly). 4 existing test fixture sites updated tochmod 0o600\u2014 the symptom of a contract that was always documented but never enforced.23295b3test_zenoh_config.py::TestTLSKeyMode(4 tests)teleop_receive.source_peer_idlacked the model_path-style charset/length contract -- nowMAX_PEER_ID_LEN=128+_PEER_ID_RE=^[A-Za-z0-9_.-]+$(with/additionally forbidden, peer_ids are not paths); (2)Mesh.startsubscriber-cleanup used bareexcept Exception-- narrowed to(RuntimeError, OSError)matching R24-A wire-handler precedent (ZError <: RuntimeError; transport faults are OSError); (3)_seq_flocklackedO_NOFOLLOWwhile every sister inode-write site set it -- restored the symmetric defence with explicit ELOOP WARNING; (4)STRANDS_MESH_TLS_KEYdocstring promised mode 0o600 without the Windows caveat -- documented that the loader skips on non-POSIX and operators must use NTFS ACLs as the substitute. The 5th concern from the same review (ACL file TOCTOU betweenis_default_acl_in_useandresolve_acl) is deferred to follow-up issue #218 with both reviewer-proposed options preserved verbatim -- it requires an architectural choice that warrants its own review round, and folding it here would inevitably cascade into R26. The four R25 fixes are extend-established-pattern only; net diff +79/-6 across 4 source files.52017f5test_r25_audit_symmetry.py(14 tests)