mesh(wire): Zenoh + ACL config builders (mTLS, downsampling, low_pass_filter) (3/9 of #195 split)#224
Conversation
…_filter)
The wire-layer foundation: Zenoh built-ins gate fleet membership and
topic authorization before any payload reaches the application layer.
New modules:
- strands_robots/mesh/_zenoh_config.py (620 LOC) -- builders for
transport/link/tls (mutual TLS), namespace prefix, scouting/multicast,
downsampling (per-key freq cap), low_pass_filter (per-message byte
cap including NIC enumeration), session-count bound, **/safety/**
rate + size caps. Honours STRANDS_MESH_TLS_{CA,CERT,KEY} +
STRANDS_MESH_NAMESPACE + STRANDS_MESH_MULTICAST + STRANDS_MESH_*_RATE_HZ
+ STRANDS_MESH_MAX_*_BYTES + STRANDS_MESH_SAFETY_RATE_HZ.
- strands_robots/mesh/_acl_config.py (500 LOC) -- ACL builder + JSON5-lite
loader + shape validator. Permissive by default with operator
remediation via STRANDS_MESH_ACL_FILE. Hard-rejects malformed files at
parse time. O_NOFOLLOW + bounded read on load.
Modified:
- strands_robots/mesh/session.py (+107/-3) -- integration site that
consumes the two builders and emits a WARNING when default ACL is
active under mtls.
- examples/mesh_acl_example.json5 -- role-separated ACL template.
- pyproject.toml -- adds json5>=0.9.0,<1.0.0 to the [mesh] extra
(for the JSON5-lite loader's whitespace/comment subset).
Carries review fixes from strands-labs#195: R7 (CodeQL duplicate import in resume
test), R14 (JSON5 single-quoted strings break json.loads), R15 (ACL
load O_NOFOLLOW + bounded read), R18 (default ACL drift, auth_mode=none
second-factor), R21 (safety rate + size caps), R23 (wire-config NaN
env-var, strict ACL bool, Windows TLS-mode warning), R24-C (TLS key
mode 0o600 enforcement on POSIX), R25-d (TLS key mode docstring +
Windows caveat).
Tests (2,440 LOC across 11 files): config-builder coverage including
the four documented Zenoh 1.x quirks (low_pass_filter NIC enum,
namespace-stripped key globs, 'enabled: true' requirement, literal
CN match), ACL shape validator, default-ACL WARNING, env-var bounds.
Tracking: strands-labs#219
Source PR: strands-labs#195
Lands after: strands-labs#220 (PR-1)
Lands before: PR-6 (which imports session.start)
yinsong1986
left a comment
There was a problem hiding this comment.
Summary
This is the wire-layer foundation PR — config-builder modules for Zenoh transport hardening (_zenoh_config.py) and ACL (_acl_config.py), plus the session._build_config integration site. The split is well-scoped (no payload-layer logic leaked in), the prior-review fixes are pinned with regression tests, and the four documented Zenoh 1.x quirks are surfaced both in code comments and the operator template.
What's good
- Strong fail-closed posture across the loader: literal-
Trueidentity check onenabled,_validate_acl_shapecovering the typo footguns (interface:singular, missingcert_common_names, unknown rule/subject ids),_float_envrejecting NaN/inf, and_load_acl_filedoingO_NOFOLLOW+ bounded read. - The
STRANDS_MESH_AUTH_MODE=nonesecond-factor gate (STRANDS_MESH_I_KNOW_THIS_IS_INSECURE) is the right shape for a security toggle that disables the entire wire layer. - Test surface is meaningful — 2,440 LOC across 11 files, with explicit pin tests for prior review fixes (R7/R14/R15/R18/R21/R23/R24/R25). The
test_default_acl_warning.pysuite covers the dangerousmtls + permissive default ACLinteraction. - ACL identity-tuple cache (
_load_acl_cached) closes a real TOCTOU between the gate check and the wire load; the 4-entry cap keeps it from being a memory amplifier.
Concerns
- Two
except (X, Y)clauses where Y is a subclass of X in_acl_config.py(lines 380 and 471 — see inline). AGENTS.md > Review Learnings (PR #86) > 'Exception Clauses Must Be Narrow' explicitly calls this out as a bug pattern: the redundant element silently expands the catch surface. session.pyswallowsresolve_auth_mode()ValueError twice with different semantics (see inline). The early swallow picks_auth_mode='mtls'for the local-endpoint scheme, but the immediately-following_build_config()call re-invokesresolve_auth_mode()and lets it raise. The first try/except is therefore dead for the cases the comment claims to handle (AUTH_MODE=nonewithoutI_KNOW, typo'd values).- Dead symlink check in
_resolve_tls_paths— the loop at lines 501-507 already rejects symlinks for all three paths; the secondis_symlink()onkey_pathat lines 547-555 cannot fire. It's not a bug, but it makes the file longer and the security claim harder to audit, and a future hardener may believe a TOCTOU is mitigated when only the easy half is. - Existing broad
except Exceptionclauses around_build_config()(session.py:439, 459, 470, 552 — pre-existing, not in this hunk) silently downgrade what may be legitimate ACL shape errors / TLS misconfig errors toWARNING + return None. This contradicts the rest of the module's loud-on-misconfig posture. Worth a follow-up, not blocking. _NON_POSIX_TLS_WARNED['v']is mutated without a lock under the multi-threaded session-open path. Effectively benign (worst case is two log lines), but the comment on lines 98-103 promises one-shot behaviour the code does not actually guarantee.
Verification suggestions
# Confirm the resolve_auth_mode swallow on a typo:
STRANDS_MESH_AUTH_MODE=foo python -c 'from strands_robots.mesh.session import _build_config; _build_config()'
# Expected: ValueError. If get_session() instead returns None silently,
# the swallow at session.py:425-427 is masking the typo.
# Confirm the dead-code claim — patch out the first symlink-reject
# loop and verify the second is_symlink() at line 547 is reached:
hatch run test tests/mesh/test_tls_key_mode_non_posix_warning.py -v
# Stress the ACL gate under mtls + permissive-shape operator file:
# (the dangerous case test_default_acl_warning.py covers).
hatch run test tests/mesh/test_default_acl_warning.py -v| """Return ``(path_str, dev, ino, size, mtime_ns)`` or None on stat err.""" | ||
| try: | ||
| st = os.stat(str(path), follow_symlinks=False) | ||
| except (OSError, FileNotFoundError): |
There was a problem hiding this comment.
FileNotFoundError is a subclass of OSError, so this tuple collapses to except OSError. AGENTS.md > Review Learnings (PR #86) > 'Exception Clauses Must Be Narrow' calls this exact pattern out: except (ImportError, Exception) is a bug because Exception swallows ImportError. Same shape here.
Drop the redundant element:
except OSError:
return NoneSame issue at line 471 (except (OSError, ValueError, FileNotFoundError) as exc).
| return True | ||
| try: | ||
| resolved = _load_acl_cached(Path(path_env)) | ||
| except (OSError, ValueError, FileNotFoundError) as exc: |
There was a problem hiding this comment.
Second occurrence of the OSError / FileNotFoundError tuple collapse. FileNotFoundError ⊂ OSError — drop the trailing element so a future reader (or static analyser) is not misled into thinking the catch is narrower than it is.
except (OSError, ValueError) as exc:
...| cfg = zenoh.Config() | ||
| _auth_mode = resolve_auth_mode() | ||
| except ValueError: | ||
| _auth_mode = "mtls" |
There was a problem hiding this comment.
This try/except ValueError: _auth_mode = "mtls" swallow is misleading. Five lines down, _build_config() (line 432) calls resolve_auth_mode() again and lets the same ValueError propagate — so for the cases this block implies it protects against (typo'd STRANDS_MESH_AUTH_MODE, =none without I_KNOW_THIS_IS_INSECURE), the eventual outcome is still a raise from _build_config(), not the silent fallback to mtls that this block sets up. The fallback is dead for those inputs.
Either:
- Remove the swallow and let
resolve_auth_mode()raise once at config-build time (matches the loud-on-misconfig posture of_float_env,_load_acl_file). - Cache the resolved value once and pass it into
_build_config()so the two call sites cannot disagree.
Same pattern at session.py:523.
| _NON_POSIX_TLS_WARNED["v"] = True | ||
| else: | ||
| key_path = paths[2] | ||
| if key_path.is_symlink(): |
There was a problem hiding this comment.
This is_symlink() check is unreachable: lines 501-507 above already rejected symlinks for all three paths (CA / cert / key) before this branch runs. The big multi-line comment at 523-531 implies the symlink-reject + lstat is doing real defence-in-depth on the key path specifically, but the symlink reject already happened in the loop above.
If the intent is defence-in-depth against a TOCTOU between line 502's is_symlink() and the Zenoh wheel's later read (which is a real residual concern — Zenoh opens the file without O_NOFOLLOW from Python's perspective), this is the wrong primitive to address it: this check fires at the same wall-clock instant as the loop above. Either:
- Remove the dead
if key_path.is_symlink(): raise ...and let the comment focus on the lstat/mode check that is actually doing work, or - Document the residual TOCTOU window explicitly so a future hardener does not believe the problem is fully solved.
| "true", | ||
| "yes", | ||
| ) | ||
| if _acl_config.is_default_acl_in_use() and not accept_permissive: |
There was a problem hiding this comment.
is_default_acl_in_use() is called without the namespace argument that was just resolved on line 273. It defaults to "strands" so this works in the common case, but if the operator sets STRANDS_MESH_NAMESPACE=fleet-a, the gate check and the actual ACL block end up reasoning over different namespaces. Pass the resolved value through:
if _acl_config.is_default_acl_in_use(namespace) and not accept_permissive:Minor — currently the function ignores its namespace arg via shape-only inspection — but the calling convention should be consistent so a future shape-check that does care about namespace doesn't silently regress.
…ling convention Addresses 6 R1 review threads on PR strands-labs#224 (5 from human reviewer + 1 CodeQL alert). Each fix is surgical and pinned by an AST-based regression test in tests/mesh/test_mesh_review_invariants_pr224.py; pre-fix the 4 invariant tests fail at the exact line numbers, post-fix all pass. Threads addressed: 1. **Unused `_TESTS` global in tests/mesh/test_acl_example_refs.py.** CodeQL alert (security/code-scanning/251). The global was set to `_REPO_ROOT / "tests"` but only `_REPO_ROOT / ref` is referenced. Removed. The alert resolution is the pin. Ref: https://github.com/strands-labs/robots/security/code-scanning/251 2. **`except (OSError, FileNotFoundError):` at _acl_config.py:380.** `FileNotFoundError ⊂ OSError` since Python 3.3, so the redundant element silently expands the catch surface from a reader's perspective and misleads static analysers. Per AGENTS.md > Review Learnings (PR strands-labs#86) > "Exception Clauses Must Be Narrow." Collapsed to `except OSError:`. Ref: strands-labs#224 (comment) 3. **`except (OSError, ValueError, FileNotFoundError) as exc:` at _acl_config.py:471.** Same shape as strands-labs#2; `FileNotFoundError` element is redundant. Collapsed to `except (OSError, ValueError) as exc:`. Ref: strands-labs#224 (comment) 4. **Removed dead `try/except ValueError: _auth_mode = "mtls"` swallow** at session.py:425-427 and the duplicate at session.py:521-523. The reviewer correctly identified the swallow as dead: 5 lines later `_build_config()` calls `resolve_auth_mode()` again unconditionally and lets the same ValueError propagate. So the silent fallback to "mtls" was setting up a wire-scheme that the immediately-following config build would then reject — three confusing fallback warnings and a `return None` instead of one clear stacktrace at the typo'd env var. The fix lets `resolve_auth_mode()` raise once at the first call site. Inline comment documents the new posture and cites the precedent (`_float_env`, `_load_acl_file`). Ref: strands-labs#224 (comment) 5. **Removed dead `is_symlink()` re-check on `key_path`** at _zenoh_config.py:547-555. The CA/cert/key reject loop above (lines 499-507) already rejected symlinks for all three TLS paths, including the private key. The second `is_symlink()` fired at the same wall-clock instant as the loop above and could never observe a TOCTOU window the loop didn't see first. The big multi-line comment that motivated the dead check has been reframed to explicitly document the residual TOCTOU window between this `lstat()` and Zenoh's eventual `open()` (Python cannot pass `O_NOFOLLOW` into the C-level Zenoh wheel, so the residual window is upstream-bound). Ref: strands-labs#224 (comment) 6. **`is_default_acl_in_use()` was called bare at session.py:308.** The function takes a `namespace` argument that the gate's call site already had on hand (resolved at line 273) but did not pass through. The function currently ignores the arg via shape-only inspection so the bug was latent, but a future shape-check that does honour `namespace` would silently regress to reasoning over different namespaces between the gate and the ACL block. Now `is_default_acl_in_use(namespace)` -- calling convention is pinned in the new test. Ref: strands-labs#224 (comment) Behavior change (Thread 4): STRANDS_MESH_AUTH_MODE=<invalid> (anything other than "mtls" / "none") now raises ValueError from get_session() instead of silently falling back to "mtls" and returning None after three failed config attempts. The exception propagates out of Mesh.start, surfacing the typo with a clear stacktrace at the moment of misconfiguration. Aligns with the loud-on-misconfig posture of _float_env / _load_acl_file. The pre-existing `except Exception: return None` wrapper around `_build_config()` (session.py:437, 459, 470, 552) remains -- the reviewer flagged it as a separate follow-up, "worth a follow-up, not blocking" -- so the only NEW raising call site introduced by this commit is the one at session.py:425 (and its mirror at 521). Pin verification (AGENTS.md S0 round-trip): Pre-fix: tests/mesh/test_mesh_review_invariants_pr224.py test_acl_config_no_redundant_subclass_in_except FAILED (2 lines) test_session_no_auth_mode_value_error_swallow FAILED (2 lines) test_zenoh_config_resolve_tls_paths_one_symlink_check FAILED (2 calls) test_session_default_acl_check_passes_namespace FAILED (line 308) Post-fix: All 4 pin tests pass. CodeQL alert #251 resolves once the unused `_TESTS` global is removed; not pinned by this PR's tests because the alert IS the pin. Refs: strands-labs#195 (umbrella), strands-labs#219 (split tracker), #346 (ext task tracker)
Fix CI lint failure: ruff format --check exit 1. Pure formatter output (string-concat collapse on a single assert message). No semantic changes. ruff format --check strands_robots tests tests_integ now clean (197 files already formatted, this one brought to spec).
yinsong1986
left a comment
There was a problem hiding this comment.
Summary
This is the wire-layer foundation: builders for Zenoh's namespace / scouting / transport-caps / mTLS / ACL config, plus an integration site in session.py that consumes them. The split is well-scoped (no behavioural change to core.py, sensors.py, input.py), the env-var matrix is documented in module docstrings, and the security-critical defaults (mTLS on, multicast off, second-factor for auth_mode=none, fail-closed ACL parsing) are well-chosen. The 0o600 key-mode enforcement with documented Windows caveat is exactly right.
What's good
- Fail-closed by default.
_load_acl_filerejects missingenabled: true(literal-bool identity check), refuses symlinks underO_NOFOLLOW, bounds reads at 256 KiB, and validates per-rule/subject/policy shape so typos likeinterface:(singular) surface at parse time rather than as a silent total outage. - Loud-on-misconfig.
_float_envrejects NaN / inf with a named-env-var diagnostic,_int_envclamps with bounds,resolve_auth_mode()requires the second-factorSTRANDS_MESH_I_KNOW_THIS_IS_INSECUREfornone. Pin tests intest_env_float_finite_check.pyandtest_acl_enabled_strict_bool.pylock these in. - TOCTOU defence. The ACL identity-tuple cache (
(path, dev, ino, size, mtime_ns)) collapses the gate-check / config-build double-read into a single snapshot. - Scope discipline. 3/9 of the #195 split is genuinely independent of #220 / #221 / #222; only
session.pyis touched in a way that lands cleanly on top of PR-1.
Concerns
-
Test file
tests/mesh/test_default_acl_warning.pyreferences behaviour that does not exist in this PR. The tests assert ERROR-level"PERMISSIVE DEFAULT ACL ACTIVE UNDER MTLS"/"Mesh NOT STARTED"breadcrumbs, an INFO-downgrade onSTRANDS_MESH_ACCEPT_PERMISSIVE_ACL=1, a staticassert "except (ImportError, ValueError) as warn_exc:" in srcagainstcore.py, and aTestF17PreSessionGatethat assertsget_session()is never called when the gate refuses. Searching the source:grep -rn 'PERMISSIVE DEFAULT ACL ACTIVE\|Mesh NOT STARTED\|ACCEPT_PERMISSIVE_ACL\|except (ImportError, ValueError)' strands_robots/mesh/core.pyreturns nothing —core.pyis unchanged in this PR. This means either (a) the tests are pinning a fix that lives in another PR in the stack and shouldn't be in this one yet, or (b) the correspondingcore.pychanges were dropped during the split. Either way the test suite as shipped will fail at PR-3 isolation. (See inline ontests/mesh/test_default_acl_warning.py.) -
Auto-listener path silently swallows
_build_config()errors.session.py:445and:465useexcept Exceptionto fall back to client mode (debug-level) and then returnNone(warning-level). When_build_config()raisesFileNotFoundError(missing TLS cert) orValueError(bad env-var clamp / unknown auth_mode), the failure is indistinguishable from "port already bound" and the mesh quietly stays off — exactly the silent-fallback the surrounding R1 comment block claims to be fixing. The pre-configresolve_auth_mode()call only catches the auth-mode-typo case; everything else still gets swallowed. (Inline.) -
namespaceparameter is unused in the ACL builder chain but threaded through three function signatures (default_acl,is_default_acl_in_use,resolve_acl,acl_block).default_acldiscards it (_ = namespace); the others only forward it. The docstrings claim namespace-prefixed key-globs were broken (Zenoh quirk #4) which is exactly why ACLkey_exprsuse**/cmdinstead — i.e. the parameter is a vestige of an earlier design that should be removed before this becomes the public contract for PR-6's consumers. (Inline.) -
Lazy
import json5inside_parse_json5defeats the dependency-extra contract.pyproject.tomladdsjson5>=0.9.0,<1.0.0to the[mesh]extra. An operator who installspip install strands-robots[mesh]and setsSTRANDS_MESH_ACL_FILEwill succeed; an operator who installs the same extra but lazy-loads (e.g. development plugin paths) will hit the lazy-import on first ACL load, but the diagnostic is buried in a deferred path. Move the import to module-load withrequire_optional()(the project's documented pattern instrands_robots/utils.py) so misconfiguration surfaces atimport strands_robots.mesh._acl_configtime rather than first-use. (Inline.) -
_NON_POSIX_TLS_WARNEDglobal mutated without lock. Minor race: two concurrent session opens on Windows could both observeFalseand both log the WARNING. Worst case is a duplicate log, not a correctness bug, but the comment at line 100 explicitly invokes "explicit at the call site without needing aglobaldeclaration" as the design rationale — that misses the concurrency point. Either usethreading.Lockor anitertools.count-style atomic, or document that duplicate warnings are acceptable. (Not blocking; not worth an inline.)
Verification suggestions
# 1. Test suite isolation against PR-3 alone (depends on PR-1 fixtures):
hatch run test tests/mesh/
# Expect failures in test_default_acl_warning.py until the matching
# core.py change lands or the tests are split out of this PR.
# 2. Confirm _build_config raises (not swallows) on missing TLS files:
STRANDS_MESH_AUTH_MODE=mtls \
STRANDS_MESH_TLS_CA=/nonexistent/ca.pem \
STRANDS_MESH_TLS_CERT=/nonexistent/cert.pem \
STRANDS_MESH_TLS_KEY=/nonexistent/key.pem \
python -c "from strands_robots.mesh.session import _build_config; _build_config()"
# Currently raises FileNotFoundError directly (good) but get_session()
# auto-listener path (line 437-454) catches it as `except Exception` and
# downgrades to DEBUG — verify the operator-visible behaviour.
# 3. Static check that key-mode enforcement is POSIX-gated, not skipped
# entirely on tmpfs / overlayfs (Docker /tmp):
python -c "
import os, tempfile
from strands_robots.mesh._zenoh_config import _resolve_tls_paths
with tempfile.NamedTemporaryFile(suffix='.pem') as f:
os.chmod(f.name, 0o644)
os.environ.update(STRANDS_MESH_TLS_CA=f.name, STRANDS_MESH_TLS_CERT=f.name, STRANDS_MESH_TLS_KEY=f.name)
try: _resolve_tls_paths()
except ValueError as e: print('OK:', e)
"
# 4. Smoke that namespace-stripped key globs actually match in live
# Zenoh — the live-session test already exists at
# tests/mesh/test_zenoh_transport_security.py; verify it runs in CI
# rather than skip-on-no-zenoh.| # Must NOT raise -- the prior fix made this refuse-and-return rather than refuse-and-raise. | ||
| records = _start_with_stub_session(stub_robot, caplog) | ||
|
|
||
| # The operator-facing ERROR IS logged. |
There was a problem hiding this comment.
These tests pin behaviour that does not exist in this PR. strands_robots/mesh/core.py is not in the PR's diff, and grep against the head SHA shows no "PERMISSIVE DEFAULT ACL ACTIVE UNDER MTLS", no "Mesh NOT STARTED", no STRANDS_MESH_ACCEPT_PERMISSIVE_ACL consumer in core.py, and no except (ImportError, ValueError) as warn_exc: clause. The static-source assertion at line 160 will fail outright; test_mtls_plus_default_acl_does_not_start will fail because Mesh.start() does not currently inspect is_default_acl_in_use() at all (only session.py:_build_config emits a WARNING).
Either (a) the matching core.py changes were dropped during the #195 split — re-include them here so the tests pin a real fix per AGENTS.md > Review Learnings (#85) > "Pin regression tests for reviewed fixes", or (b) move this file to whichever PR in the stack actually lands the Mesh.start() gate. Shipping pin tests for absent code makes them silent regressions waiting to revert.
| scheme = "tls" if _auth_mode == "mtls" else "tcp" | ||
| local_ep = f"{scheme}/127.0.0.1:{mesh_port}" | ||
|
|
||
| try: |
There was a problem hiding this comment.
The auto-listener try block here (and its except Exception at line 445, just below this hunk) swallows _build_config() failures — missing/unreadable TLS cert -> FileNotFoundError, bad env-var bound -> ValueError, mode 0o644 key -> ValueError — at DEBUG level, then retries the same call in client mode where the same exceptions raise again and get caught as a generic "Zenoh session open failed (client mode)" WARNING. Net effect: the mesh silently stays off on a misconfigured mTLS setup.
The surrounding R1 comment block at lines 408-432 explicitly claims this path is now loud-on-misconfig, but only the resolve_auth_mode() typo case actually surfaces — every other config-build error is indistinguishable from the benign "port already bound" case. Per AGENTS.md > Review Learnings (#86) > "Exception Clauses Must Be Narrow": narrow the except to except OSError (which covers "port already bound" via OSError: [Errno 98]) and let FileNotFoundError / ValueError from _build_config() propagate. The same fix applies to the duplicate auto-listener at _get_zenoh_session_directly.
| layer ``validate_command`` gates payload semantics. ACL is the | ||
| third line of defence and operators opt in explicitly. | ||
| """ | ||
| _ = namespace # `namespace` config does the routing isolation; ACL key_exprs do not need it |
There was a problem hiding this comment.
The namespace parameter is unused (_ = namespace here, only forwarded by the other three functions in this file). The docstring at lines 18-20 of this module already explains why: ACL key_exprs see the namespace-stripped key, so the matcher uses **/cmd regardless of namespace.
This is a public-API hygiene concern per AGENTS.md > Review Learnings (#86) > "Public API Hygiene". PR-6 is going to import acl_block(namespace) and feed it a namespace it has no business needing — locking in a vestigial parameter that future contributors will assume controls something. Drop the parameter from default_acl, resolve_acl, acl_block, and is_default_acl_in_use before this becomes the consumer-facing contract; if a future Zenoh release fixes the namespace-stripping quirk, re-introduce it then with a real consumer.
| permissive default. | ||
| """ | ||
| try: | ||
| import json5 # type: ignore[import-not-found] |
There was a problem hiding this comment.
Lazy-importing json5 defeats the [mesh] extra contract. The comment block at lines 66-73 justifies this by saying eager import would break operators "running with no ACL file ... and no mesh extra installed" — but that's already covered by the lazy import zenoh inside _build_config(), which gates the entire wire stack. If the operator has zenoh, they have the [mesh] extra, which now declares json5 as a hard dep (pyproject.toml line 65).
A cleaner pattern, and the one AGENTS.md flags as the project standard: from strands_robots.utils import require_optional and json5 = require_optional("json5", "mesh") at module top-level. That gives operators a single, named diagnostic at import strands_robots.mesh._acl_config rather than buried inside the deferred parse path. Not blocking, but the rationale comment as written contradicts the dep declaration in pyproject.toml.
| # ``audit.py:_ensure_paths``, ``_load_seq_counters``, and | ||
| # ``_acl_config.py:_load_acl_file``. | ||
| # | ||
| if not _is_posix(): |
There was a problem hiding this comment.
_NON_POSIX_TLS_WARNED is mutated without a lock. The comment at line 102 invokes "explicit at the call site without needing a global declaration" as the design rationale, but that misses the concurrency point — the dict-as-mutable-cell trick gets you out of global keyword bookkeeping, not out of GIL-vs-multiple-session-opens races. Two concurrent _resolve_tls_paths() calls on Windows can both observe False and both emit the WARNING.
Worst case is a duplicate warning, not a correctness bug, so this is a soft note rather than a blocker — but it intersects AGENTS.md > Review Learnings (#85) > "Lock ALL ... mutations". A threading.Lock (matching the discipline used by _ACL_CACHE_LOCK in _acl_config.py) or just deleting the de-dup logic and accepting the once-per-session-open WARNING would resolve.
Per umbrella tracker strands-labs#219 the four PR-2/3/4/5 slices land in parallel and must each go green standalone. test_r25_audit_symmetry.py imports modules introduced by sibling slices that are not present on this branch: from strands_robots.mesh import _zenoh_config as zc # lives on PR strands-labs#224 from strands_robots.mesh import audit as audit_mod # lives on this PR from strands_robots.mesh import core as core_mod # lives on this PR from strands_robots.mesh import security as sec # lives on PR strands-labs#223 That makes the file structurally un-passable on the audit slice in isolation, which is why call-test-lint fails mypy here with "Module strands_robots.mesh has no attribute 'security'" / "... no attribute '_zenoh_config'". The four pin sites the file describes (peer-id charset on security.py, partial-failure cleanup on core.py, O_NOFOLLOW on the seq lockfile in audit.py, and the STRANDS_MESH_TLS_KEY mode-0o600 docstring on _zenoh_config.py) span four sibling slices and only make sense as a cross-cutting integration pin once strands-labs#223, strands-labs#224, strands-labs#221 and strands-labs#225 have all merged. The natural home is PR-6 (strands-labs#225, mesh/core replay caches + override-resume + safety topic handlers) which is the integration slice that needs all the listed modules present anyway, or a follow-up issue queued behind PR-9 (strands-labs#226, docs). The audit-side pin (point 3, O_NOFOLLOW on the seq lockfile) is already covered on this PR by tests/mesh/test_audit_seq_symlink.py and tests/mesh/test_audit_psk_poison_record.py, so removing this file does not weaken regression coverage on the audit slice itself.
sundargthb
left a comment
There was a problem hiding this comment.
Wire-layer security approved.
Two non-blocking notes for the record. First, _validate_acl_shape does not warn when cert_common_names entries contain * or ?. Zenoh 1.x matches CNs literally, so a subject configured with "robot-*" silently matches nothing and produces the exact silent-total-outage failure mode the rest of this module is designed to prevent. The example template documents this for operators who read it, but the validator does not. A five-line WARNING in the cns validation block would keep the loud-on-misconfig posture consistent with the rest of the loader. Can land here or in v0.4.1.
Second, the refuse-to-start gate documented at _acl_config.py:449-451 (Mesh.start emits ERROR + refuses to start when permissive ACL is combined with auth_mode=mtls without STRANDS_MESH_ACCEPT_PERMISSIVE_ACL=1) is implemented in PR-6, not in this PR. In PR-3 alone the session.py gate at line 308-315 is WARNING-only. If PR-3 lands without PR-6, deployments running mTLS without an explicit ACL file run on the permissive default with only a log warning. The dependency should be explicit in the PR-3 description and in the v0.4.0 migration documentation. The intentional breaking changes (mTLS default, multicast off, key mode 0o600 enforcement, strands namespace prefix) all go in the right direction but require the migration guide to call each one out explicitly.
🎯 Pentest evidence for this PR (#224 — Zenoh + ACL config)2026-05-26 pentest run Findings landing in this PR
Pinned regression tests waiting for these fixes
Posture confirmed by this PR (positive control)
Three harness bugs we found while testing mtls (already fixed in pentest repo)
All three manifest as the same B-02 silent-fail symptom — worth keeping the Zenoh-1.x schema sample (e.g. Mapping for all PRs: |
Part 3 / 9 of the split of #195 — tracked by #219.
The wire-layer foundation: Zenoh built-ins gate fleet membership and topic authorization before any payload reaches the application layer.
What's in this PR
New modules:
strands_robots/mesh/_zenoh_config.py(620 LOC) — builders for transport/link/tls (mutual TLS), namespace prefix, scouting/multicast, downsampling (per-key freq cap), low_pass_filter (per-message byte cap including NIC enumeration), session-count bound,**/safety/**rate + size caps. HonoursSTRANDS_MESH_TLS_{CA,CERT,KEY},STRANDS_MESH_NAMESPACE,STRANDS_MESH_MULTICAST,STRANDS_MESH_*_RATE_HZ,STRANDS_MESH_MAX_*_BYTES,STRANDS_MESH_SAFETY_RATE_HZ.strands_robots/mesh/_acl_config.py(500 LOC) — ACL builder + JSON5-lite loader + shape validator. Permissive by default with operator remediation viaSTRANDS_MESH_ACL_FILE. Hard-rejects malformed files at parse time.O_NOFOLLOW+ bounded read on load.Modified:
strands_robots/mesh/session.py(+107/-3) — integration site that consumes the two builders and emits a WARNING when default ACL is active undermtls.examples/mesh_acl_example.json5— role-separated ACL template.pyproject.toml— addsjson5>=0.9.0,<1.0.0to the[mesh]extra (for the JSON5-lite loader's whitespace/comment subset).Reviewer focus
0o600enforcement (POSIX-only with documented Windows caveat — NTFS ACLs are operator's responsibility).low_pass_filterNIC enum, namespace-stripped key globs,enabled: truerequirement, literal CN match.Carries review fixes from #195
R7 (CodeQL duplicate import in resume test), R14 (JSON5 single-quoted strings break json.loads), R15 (ACL load O_NOFOLLOW + bounded read), R18 (default ACL drift, auth_mode=none second-factor), R21 (safety rate + size caps), R23 (wire-config NaN env-var, strict ACL bool, Windows TLS-mode warning), R24-C (TLS key mode 0o600 enforcement on POSIX), R25-d (TLS key mode docstring + Windows caveat).
Tests
2,440 LOC across 11 files: config-builder coverage, ACL shape validator, default-ACL WARNING, env-var bounds.
Stacking note
Depends only on PR-1 (#220) for shared test fixtures. CI on this branch in isolation passes once PR-1 lands. PR-6 will later import
session.startfrom this module.Landing order: PR-1 → PR-3 (parallel with PR-2/4/5) → PR-6 → PR-7 → PR-8 → PR-9. Full plan:
PR_LIST.md. Tracking: #219.