Skip to content

mesh(wire): Zenoh + ACL config builders (mTLS, downsampling, low_pass_filter) (3/9 of #195 split)#224

Open
cagataycali wants to merge 3 commits into
strands-labs:mainfrom
cagataycali:pr3/mesh-wire-zenoh-acl-config
Open

mesh(wire): Zenoh + ACL config builders (mTLS, downsampling, low_pass_filter) (3/9 of #195 split)#224
cagataycali wants to merge 3 commits into
strands-labs:mainfrom
cagataycali:pr3/mesh-wire-zenoh-acl-config

Conversation

@cagataycali
Copy link
Copy Markdown
Member

Part 3 / 9 of the split of #195 — tracked by #219.

The wire-layer foundation: Zenoh built-ins gate fleet membership and topic authorization before any payload reaches the application layer.

What's in this PR

New modules:

  • strands_robots/mesh/_zenoh_config.py (620 LOC) — builders for transport/link/tls (mutual TLS), namespace prefix, scouting/multicast, downsampling (per-key freq cap), low_pass_filter (per-message byte cap including NIC enumeration), session-count bound, **/safety/** rate + size caps. Honours STRANDS_MESH_TLS_{CA,CERT,KEY}, STRANDS_MESH_NAMESPACE, STRANDS_MESH_MULTICAST, STRANDS_MESH_*_RATE_HZ, STRANDS_MESH_MAX_*_BYTES, STRANDS_MESH_SAFETY_RATE_HZ.
  • strands_robots/mesh/_acl_config.py (500 LOC) — ACL builder + JSON5-lite loader + shape validator. Permissive by default with operator remediation via STRANDS_MESH_ACL_FILE. Hard-rejects malformed files at parse time. O_NOFOLLOW + bounded read on load.

Modified:

  • strands_robots/mesh/session.py (+107/-3) — integration site that consumes the two builders and emits a WARNING when default ACL is active under mtls.
  • examples/mesh_acl_example.json5 — role-separated ACL template.
  • pyproject.toml — adds json5>=0.9.0,<1.0.0 to the [mesh] extra (for the JSON5-lite loader's whitespace/comment subset).

Reviewer focus

  • mTLS chain handling, ACL shape validator strictness (no silent drops), JSON5 single-quote support.
  • Key file mode 0o600 enforcement (POSIX-only with documented Windows caveat — NTFS ACLs are operator's responsibility).
  • Default-ACL permissive WARNING.
  • The four documented Zenoh 1.x quirks: low_pass_filter NIC enum, namespace-stripped key globs, enabled: true requirement, literal CN match.

Carries review fixes from #195

R7 (CodeQL duplicate import in resume test), R14 (JSON5 single-quoted strings break json.loads), R15 (ACL load O_NOFOLLOW + bounded read), R18 (default ACL drift, auth_mode=none second-factor), R21 (safety rate + size caps), R23 (wire-config NaN env-var, strict ACL bool, Windows TLS-mode warning), R24-C (TLS key mode 0o600 enforcement on POSIX), R25-d (TLS key mode docstring + Windows caveat).

Tests

2,440 LOC across 11 files: config-builder coverage, ACL shape validator, default-ACL WARNING, env-var bounds.

Stacking note

Depends only on PR-1 (#220) for shared test fixtures. CI on this branch in isolation passes once PR-1 lands. PR-6 will later import session.start from this module.


Landing order: PR-1 → PR-3 (parallel with PR-2/4/5) → PR-6 → PR-7 → PR-8 → PR-9. Full plan: PR_LIST.md. Tracking: #219.

…_filter)

The wire-layer foundation: Zenoh built-ins gate fleet membership and
topic authorization before any payload reaches the application layer.

New modules:
- strands_robots/mesh/_zenoh_config.py (620 LOC) -- builders for
  transport/link/tls (mutual TLS), namespace prefix, scouting/multicast,
  downsampling (per-key freq cap), low_pass_filter (per-message byte
  cap including NIC enumeration), session-count bound, **/safety/**
  rate + size caps. Honours STRANDS_MESH_TLS_{CA,CERT,KEY} +
  STRANDS_MESH_NAMESPACE + STRANDS_MESH_MULTICAST + STRANDS_MESH_*_RATE_HZ
  + STRANDS_MESH_MAX_*_BYTES + STRANDS_MESH_SAFETY_RATE_HZ.

- strands_robots/mesh/_acl_config.py (500 LOC) -- ACL builder + JSON5-lite
  loader + shape validator. Permissive by default with operator
  remediation via STRANDS_MESH_ACL_FILE. Hard-rejects malformed files at
  parse time. O_NOFOLLOW + bounded read on load.

Modified:
- strands_robots/mesh/session.py (+107/-3) -- integration site that
  consumes the two builders and emits a WARNING when default ACL is
  active under mtls.
- examples/mesh_acl_example.json5 -- role-separated ACL template.
- pyproject.toml -- adds json5>=0.9.0,<1.0.0 to the [mesh] extra
  (for the JSON5-lite loader's whitespace/comment subset).

Carries review fixes from strands-labs#195: R7 (CodeQL duplicate import in resume
test), R14 (JSON5 single-quoted strings break json.loads), R15 (ACL
load O_NOFOLLOW + bounded read), R18 (default ACL drift, auth_mode=none
second-factor), R21 (safety rate + size caps), R23 (wire-config NaN
env-var, strict ACL bool, Windows TLS-mode warning), R24-C (TLS key
mode 0o600 enforcement on POSIX), R25-d (TLS key mode docstring +
Windows caveat).

Tests (2,440 LOC across 11 files): config-builder coverage including
the four documented Zenoh 1.x quirks (low_pass_filter NIC enum,
namespace-stripped key globs, 'enabled: true' requirement, literal
CN match), ACL shape validator, default-ACL WARNING, env-var bounds.

Tracking: strands-labs#219
Source PR: strands-labs#195
Lands after: strands-labs#220 (PR-1)
Lands before: PR-6 (which imports session.start)
Comment thread tests/mesh/test_acl_example_refs.py Fixed
Copy link
Copy Markdown
Contributor

@yinsong1986 yinsong1986 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This is the wire-layer foundation PR — config-builder modules for Zenoh transport hardening (_zenoh_config.py) and ACL (_acl_config.py), plus the session._build_config integration site. The split is well-scoped (no payload-layer logic leaked in), the prior-review fixes are pinned with regression tests, and the four documented Zenoh 1.x quirks are surfaced both in code comments and the operator template.

What's good

  • Strong fail-closed posture across the loader: literal-True identity check on enabled, _validate_acl_shape covering the typo footguns (interface: singular, missing cert_common_names, unknown rule/subject ids), _float_env rejecting NaN/inf, and _load_acl_file doing O_NOFOLLOW + bounded read.
  • The STRANDS_MESH_AUTH_MODE=none second-factor gate (STRANDS_MESH_I_KNOW_THIS_IS_INSECURE) is the right shape for a security toggle that disables the entire wire layer.
  • Test surface is meaningful — 2,440 LOC across 11 files, with explicit pin tests for prior review fixes (R7/R14/R15/R18/R21/R23/R24/R25). The test_default_acl_warning.py suite covers the dangerous mtls + permissive default ACL interaction.
  • ACL identity-tuple cache (_load_acl_cached) closes a real TOCTOU between the gate check and the wire load; the 4-entry cap keeps it from being a memory amplifier.

Concerns

  • Two except (X, Y) clauses where Y is a subclass of X in _acl_config.py (lines 380 and 471 — see inline). AGENTS.md > Review Learnings (PR #86) > 'Exception Clauses Must Be Narrow' explicitly calls this out as a bug pattern: the redundant element silently expands the catch surface.
  • session.py swallows resolve_auth_mode() ValueError twice with different semantics (see inline). The early swallow picks _auth_mode='mtls' for the local-endpoint scheme, but the immediately-following _build_config() call re-invokes resolve_auth_mode() and lets it raise. The first try/except is therefore dead for the cases the comment claims to handle (AUTH_MODE=none without I_KNOW, typo'd values).
  • Dead symlink check in _resolve_tls_paths — the loop at lines 501-507 already rejects symlinks for all three paths; the second is_symlink() on key_path at lines 547-555 cannot fire. It's not a bug, but it makes the file longer and the security claim harder to audit, and a future hardener may believe a TOCTOU is mitigated when only the easy half is.
  • Existing broad except Exception clauses around _build_config() (session.py:439, 459, 470, 552 — pre-existing, not in this hunk) silently downgrade what may be legitimate ACL shape errors / TLS misconfig errors to WARNING + return None. This contradicts the rest of the module's loud-on-misconfig posture. Worth a follow-up, not blocking.
  • _NON_POSIX_TLS_WARNED['v'] is mutated without a lock under the multi-threaded session-open path. Effectively benign (worst case is two log lines), but the comment on lines 98-103 promises one-shot behaviour the code does not actually guarantee.

Verification suggestions

# Confirm the resolve_auth_mode swallow on a typo:
STRANDS_MESH_AUTH_MODE=foo python -c 'from strands_robots.mesh.session import _build_config; _build_config()'
# Expected: ValueError. If get_session() instead returns None silently,
# the swallow at session.py:425-427 is masking the typo.

# Confirm the dead-code claim — patch out the first symlink-reject
# loop and verify the second is_symlink() at line 547 is reached:
hatch run test tests/mesh/test_tls_key_mode_non_posix_warning.py -v

# Stress the ACL gate under mtls + permissive-shape operator file:
# (the dangerous case test_default_acl_warning.py covers).
hatch run test tests/mesh/test_default_acl_warning.py -v

Comment thread strands_robots/mesh/_acl_config.py Outdated
"""Return ``(path_str, dev, ino, size, mtime_ns)`` or None on stat err."""
try:
st = os.stat(str(path), follow_symlinks=False)
except (OSError, FileNotFoundError):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileNotFoundError is a subclass of OSError, so this tuple collapses to except OSError. AGENTS.md > Review Learnings (PR #86) > 'Exception Clauses Must Be Narrow' calls this exact pattern out: except (ImportError, Exception) is a bug because Exception swallows ImportError. Same shape here.

Drop the redundant element:

except OSError:
    return None

Same issue at line 471 (except (OSError, ValueError, FileNotFoundError) as exc).

Comment thread strands_robots/mesh/_acl_config.py Outdated
return True
try:
resolved = _load_acl_cached(Path(path_env))
except (OSError, ValueError, FileNotFoundError) as exc:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second occurrence of the OSError / FileNotFoundError tuple collapse. FileNotFoundErrorOSError — drop the trailing element so a future reader (or static analyser) is not misled into thinking the catch is narrower than it is.

except (OSError, ValueError) as exc:
    ...

Comment thread strands_robots/mesh/session.py Outdated
cfg = zenoh.Config()
_auth_mode = resolve_auth_mode()
except ValueError:
_auth_mode = "mtls"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This try/except ValueError: _auth_mode = "mtls" swallow is misleading. Five lines down, _build_config() (line 432) calls resolve_auth_mode() again and lets the same ValueError propagate — so for the cases this block implies it protects against (typo'd STRANDS_MESH_AUTH_MODE, =none without I_KNOW_THIS_IS_INSECURE), the eventual outcome is still a raise from _build_config(), not the silent fallback to mtls that this block sets up. The fallback is dead for those inputs.

Either:

  1. Remove the swallow and let resolve_auth_mode() raise once at config-build time (matches the loud-on-misconfig posture of _float_env, _load_acl_file).
  2. Cache the resolved value once and pass it into _build_config() so the two call sites cannot disagree.

Same pattern at session.py:523.

Comment thread strands_robots/mesh/_zenoh_config.py Outdated
_NON_POSIX_TLS_WARNED["v"] = True
else:
key_path = paths[2]
if key_path.is_symlink():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is_symlink() check is unreachable: lines 501-507 above already rejected symlinks for all three paths (CA / cert / key) before this branch runs. The big multi-line comment at 523-531 implies the symlink-reject + lstat is doing real defence-in-depth on the key path specifically, but the symlink reject already happened in the loop above.

If the intent is defence-in-depth against a TOCTOU between line 502's is_symlink() and the Zenoh wheel's later read (which is a real residual concern — Zenoh opens the file without O_NOFOLLOW from Python's perspective), this is the wrong primitive to address it: this check fires at the same wall-clock instant as the loop above. Either:

  • Remove the dead if key_path.is_symlink(): raise ... and let the comment focus on the lstat/mode check that is actually doing work, or
  • Document the residual TOCTOU window explicitly so a future hardener does not believe the problem is fully solved.

Comment thread strands_robots/mesh/session.py Outdated
"true",
"yes",
)
if _acl_config.is_default_acl_in_use() and not accept_permissive:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_default_acl_in_use() is called without the namespace argument that was just resolved on line 273. It defaults to "strands" so this works in the common case, but if the operator sets STRANDS_MESH_NAMESPACE=fleet-a, the gate check and the actual ACL block end up reasoning over different namespaces. Pass the resolved value through:

if _acl_config.is_default_acl_in_use(namespace) and not accept_permissive:

Minor — currently the function ignores its namespace arg via shape-only inspection — but the calling convention should be consistent so a future shape-check that does care about namespace doesn't silently regress.

cagataycali and others added 2 commits May 25, 2026 22:44
…ling convention

Addresses 6 R1 review threads on PR strands-labs#224 (5 from human reviewer + 1
CodeQL alert). Each fix is surgical and pinned by an AST-based
regression test in tests/mesh/test_mesh_review_invariants_pr224.py;
pre-fix the 4 invariant tests fail at the exact line numbers, post-fix
all pass.

Threads addressed:

1. **Unused `_TESTS` global in tests/mesh/test_acl_example_refs.py.**
   CodeQL alert (security/code-scanning/251). The global was set to
   `_REPO_ROOT / "tests"` but only `_REPO_ROOT / ref` is referenced.
   Removed. The alert resolution is the pin.
   Ref: https://github.com/strands-labs/robots/security/code-scanning/251

2. **`except (OSError, FileNotFoundError):` at _acl_config.py:380.**
   `FileNotFoundError ⊂ OSError` since Python 3.3, so the redundant
   element silently expands the catch surface from a reader's
   perspective and misleads static analysers. Per AGENTS.md > Review
   Learnings (PR strands-labs#86) > "Exception Clauses Must Be Narrow." Collapsed
   to `except OSError:`.
   Ref: strands-labs#224 (comment)

3. **`except (OSError, ValueError, FileNotFoundError) as exc:` at
   _acl_config.py:471.** Same shape as strands-labs#2; `FileNotFoundError`
   element is redundant. Collapsed to `except (OSError, ValueError) as exc:`.
   Ref: strands-labs#224 (comment)

4. **Removed dead `try/except ValueError: _auth_mode = "mtls"` swallow**
   at session.py:425-427 and the duplicate at session.py:521-523.
   The reviewer correctly identified the swallow as dead: 5 lines
   later `_build_config()` calls `resolve_auth_mode()` again
   unconditionally and lets the same ValueError propagate. So the
   silent fallback to "mtls" was setting up a wire-scheme that the
   immediately-following config build would then reject — three
   confusing fallback warnings and a `return None` instead of one
   clear stacktrace at the typo'd env var.
   The fix lets `resolve_auth_mode()` raise once at the first call
   site. Inline comment documents the new posture and cites the
   precedent (`_float_env`, `_load_acl_file`).
   Ref: strands-labs#224 (comment)

5. **Removed dead `is_symlink()` re-check on `key_path`** at
   _zenoh_config.py:547-555. The CA/cert/key reject loop above (lines
   499-507) already rejected symlinks for all three TLS paths,
   including the private key. The second `is_symlink()` fired at the
   same wall-clock instant as the loop above and could never observe
   a TOCTOU window the loop didn't see first. The big multi-line
   comment that motivated the dead check has been reframed to
   explicitly document the residual TOCTOU window between this
   `lstat()` and Zenoh's eventual `open()` (Python cannot pass
   `O_NOFOLLOW` into the C-level Zenoh wheel, so the residual
   window is upstream-bound).
   Ref: strands-labs#224 (comment)

6. **`is_default_acl_in_use()` was called bare at session.py:308.**
   The function takes a `namespace` argument that the gate's call
   site already had on hand (resolved at line 273) but did not pass
   through. The function currently ignores the arg via shape-only
   inspection so the bug was latent, but a future shape-check that
   does honour `namespace` would silently regress to reasoning over
   different namespaces between the gate and the ACL block.
   Now `is_default_acl_in_use(namespace)` -- calling convention is
   pinned in the new test.
   Ref: strands-labs#224 (comment)

Behavior change (Thread 4):

  STRANDS_MESH_AUTH_MODE=<invalid> (anything other than "mtls" / "none")
  now raises ValueError from get_session() instead of silently falling
  back to "mtls" and returning None after three failed config attempts.
  The exception propagates out of Mesh.start, surfacing the typo with
  a clear stacktrace at the moment of misconfiguration. Aligns with
  the loud-on-misconfig posture of _float_env / _load_acl_file. The
  pre-existing `except Exception: return None` wrapper around
  `_build_config()` (session.py:437, 459, 470, 552) remains -- the
  reviewer flagged it as a separate follow-up, "worth a follow-up,
  not blocking" -- so the only NEW raising call site introduced by
  this commit is the one at session.py:425 (and its mirror at 521).

Pin verification (AGENTS.md S0 round-trip):

  Pre-fix:
    tests/mesh/test_mesh_review_invariants_pr224.py
      test_acl_config_no_redundant_subclass_in_except      FAILED (2 lines)
      test_session_no_auth_mode_value_error_swallow         FAILED (2 lines)
      test_zenoh_config_resolve_tls_paths_one_symlink_check FAILED (2 calls)
      test_session_default_acl_check_passes_namespace       FAILED (line 308)

  Post-fix:
    All 4 pin tests pass.

CodeQL alert #251 resolves once the unused `_TESTS` global is
removed; not pinned by this PR's tests because the alert IS the pin.

Refs: strands-labs#195 (umbrella), strands-labs#219 (split tracker), #346 (ext task tracker)
Fix CI lint failure: ruff format --check exit 1.
Pure formatter output (string-concat collapse on a single
assert message). No semantic changes.

ruff format --check strands_robots tests tests_integ now clean
(197 files already formatted, this one brought to spec).
Copy link
Copy Markdown
Contributor

@yinsong1986 yinsong1986 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This is the wire-layer foundation: builders for Zenoh's namespace / scouting / transport-caps / mTLS / ACL config, plus an integration site in session.py that consumes them. The split is well-scoped (no behavioural change to core.py, sensors.py, input.py), the env-var matrix is documented in module docstrings, and the security-critical defaults (mTLS on, multicast off, second-factor for auth_mode=none, fail-closed ACL parsing) are well-chosen. The 0o600 key-mode enforcement with documented Windows caveat is exactly right.

What's good

  • Fail-closed by default. _load_acl_file rejects missing enabled: true (literal-bool identity check), refuses symlinks under O_NOFOLLOW, bounds reads at 256 KiB, and validates per-rule/subject/policy shape so typos like interface: (singular) surface at parse time rather than as a silent total outage.
  • Loud-on-misconfig. _float_env rejects NaN / inf with a named-env-var diagnostic, _int_env clamps with bounds, resolve_auth_mode() requires the second-factor STRANDS_MESH_I_KNOW_THIS_IS_INSECURE for none. Pin tests in test_env_float_finite_check.py and test_acl_enabled_strict_bool.py lock these in.
  • TOCTOU defence. The ACL identity-tuple cache ((path, dev, ino, size, mtime_ns)) collapses the gate-check / config-build double-read into a single snapshot.
  • Scope discipline. 3/9 of the #195 split is genuinely independent of #220 / #221 / #222; only session.py is touched in a way that lands cleanly on top of PR-1.

Concerns

  1. Test file tests/mesh/test_default_acl_warning.py references behaviour that does not exist in this PR. The tests assert ERROR-level "PERMISSIVE DEFAULT ACL ACTIVE UNDER MTLS" / "Mesh NOT STARTED" breadcrumbs, an INFO-downgrade on STRANDS_MESH_ACCEPT_PERMISSIVE_ACL=1, a static assert "except (ImportError, ValueError) as warn_exc:" in src against core.py, and a TestF17PreSessionGate that asserts get_session() is never called when the gate refuses. Searching the source: grep -rn 'PERMISSIVE DEFAULT ACL ACTIVE\|Mesh NOT STARTED\|ACCEPT_PERMISSIVE_ACL\|except (ImportError, ValueError)' strands_robots/mesh/core.py returns nothing — core.py is unchanged in this PR. This means either (a) the tests are pinning a fix that lives in another PR in the stack and shouldn't be in this one yet, or (b) the corresponding core.py changes were dropped during the split. Either way the test suite as shipped will fail at PR-3 isolation. (See inline on tests/mesh/test_default_acl_warning.py.)

  2. Auto-listener path silently swallows _build_config() errors. session.py:445 and :465 use except Exception to fall back to client mode (debug-level) and then return None (warning-level). When _build_config() raises FileNotFoundError (missing TLS cert) or ValueError (bad env-var clamp / unknown auth_mode), the failure is indistinguishable from "port already bound" and the mesh quietly stays off — exactly the silent-fallback the surrounding R1 comment block claims to be fixing. The pre-config resolve_auth_mode() call only catches the auth-mode-typo case; everything else still gets swallowed. (Inline.)

  3. namespace parameter is unused in the ACL builder chain but threaded through three function signatures (default_acl, is_default_acl_in_use, resolve_acl, acl_block). default_acl discards it (_ = namespace); the others only forward it. The docstrings claim namespace-prefixed key-globs were broken (Zenoh quirk #4) which is exactly why ACL key_exprs use **/cmd instead — i.e. the parameter is a vestige of an earlier design that should be removed before this becomes the public contract for PR-6's consumers. (Inline.)

  4. Lazy import json5 inside _parse_json5 defeats the dependency-extra contract. pyproject.toml adds json5>=0.9.0,<1.0.0 to the [mesh] extra. An operator who installs pip install strands-robots[mesh] and sets STRANDS_MESH_ACL_FILE will succeed; an operator who installs the same extra but lazy-loads (e.g. development plugin paths) will hit the lazy-import on first ACL load, but the diagnostic is buried in a deferred path. Move the import to module-load with require_optional() (the project's documented pattern in strands_robots/utils.py) so misconfiguration surfaces at import strands_robots.mesh._acl_config time rather than first-use. (Inline.)

  5. _NON_POSIX_TLS_WARNED global mutated without lock. Minor race: two concurrent session opens on Windows could both observe False and both log the WARNING. Worst case is a duplicate log, not a correctness bug, but the comment at line 100 explicitly invokes "explicit at the call site without needing a global declaration" as the design rationale — that misses the concurrency point. Either use threading.Lock or an itertools.count-style atomic, or document that duplicate warnings are acceptable. (Not blocking; not worth an inline.)

Verification suggestions

# 1. Test suite isolation against PR-3 alone (depends on PR-1 fixtures):
hatch run test tests/mesh/
# Expect failures in test_default_acl_warning.py until the matching
# core.py change lands or the tests are split out of this PR.

# 2. Confirm _build_config raises (not swallows) on missing TLS files:
STRANDS_MESH_AUTH_MODE=mtls \
  STRANDS_MESH_TLS_CA=/nonexistent/ca.pem \
  STRANDS_MESH_TLS_CERT=/nonexistent/cert.pem \
  STRANDS_MESH_TLS_KEY=/nonexistent/key.pem \
  python -c "from strands_robots.mesh.session import _build_config; _build_config()"
# Currently raises FileNotFoundError directly (good) but get_session()
# auto-listener path (line 437-454) catches it as `except Exception` and
# downgrades to DEBUG — verify the operator-visible behaviour.

# 3. Static check that key-mode enforcement is POSIX-gated, not skipped
# entirely on tmpfs / overlayfs (Docker /tmp):
python -c "
import os, tempfile
from strands_robots.mesh._zenoh_config import _resolve_tls_paths
with tempfile.NamedTemporaryFile(suffix='.pem') as f:
    os.chmod(f.name, 0o644)
    os.environ.update(STRANDS_MESH_TLS_CA=f.name, STRANDS_MESH_TLS_CERT=f.name, STRANDS_MESH_TLS_KEY=f.name)
    try: _resolve_tls_paths()
    except ValueError as e: print('OK:', e)
"

# 4. Smoke that namespace-stripped key globs actually match in live
# Zenoh — the live-session test already exists at
# tests/mesh/test_zenoh_transport_security.py; verify it runs in CI
# rather than skip-on-no-zenoh.

# Must NOT raise -- the prior fix made this refuse-and-return rather than refuse-and-raise.
records = _start_with_stub_session(stub_robot, caplog)

# The operator-facing ERROR IS logged.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests pin behaviour that does not exist in this PR. strands_robots/mesh/core.py is not in the PR's diff, and grep against the head SHA shows no "PERMISSIVE DEFAULT ACL ACTIVE UNDER MTLS", no "Mesh NOT STARTED", no STRANDS_MESH_ACCEPT_PERMISSIVE_ACL consumer in core.py, and no except (ImportError, ValueError) as warn_exc: clause. The static-source assertion at line 160 will fail outright; test_mtls_plus_default_acl_does_not_start will fail because Mesh.start() does not currently inspect is_default_acl_in_use() at all (only session.py:_build_config emits a WARNING).

Either (a) the matching core.py changes were dropped during the #195 split — re-include them here so the tests pin a real fix per AGENTS.md > Review Learnings (#85) > "Pin regression tests for reviewed fixes", or (b) move this file to whichever PR in the stack actually lands the Mesh.start() gate. Shipping pin tests for absent code makes them silent regressions waiting to revert.

scheme = "tls" if _auth_mode == "mtls" else "tcp"
local_ep = f"{scheme}/127.0.0.1:{mesh_port}"

try:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The auto-listener try block here (and its except Exception at line 445, just below this hunk) swallows _build_config() failures — missing/unreadable TLS cert -> FileNotFoundError, bad env-var bound -> ValueError, mode 0o644 key -> ValueError — at DEBUG level, then retries the same call in client mode where the same exceptions raise again and get caught as a generic "Zenoh session open failed (client mode)" WARNING. Net effect: the mesh silently stays off on a misconfigured mTLS setup.

The surrounding R1 comment block at lines 408-432 explicitly claims this path is now loud-on-misconfig, but only the resolve_auth_mode() typo case actually surfaces — every other config-build error is indistinguishable from the benign "port already bound" case. Per AGENTS.md > Review Learnings (#86) > "Exception Clauses Must Be Narrow": narrow the except to except OSError (which covers "port already bound" via OSError: [Errno 98]) and let FileNotFoundError / ValueError from _build_config() propagate. The same fix applies to the duplicate auto-listener at _get_zenoh_session_directly.

layer ``validate_command`` gates payload semantics. ACL is the
third line of defence and operators opt in explicitly.
"""
_ = namespace # `namespace` config does the routing isolation; ACL key_exprs do not need it
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The namespace parameter is unused (_ = namespace here, only forwarded by the other three functions in this file). The docstring at lines 18-20 of this module already explains why: ACL key_exprs see the namespace-stripped key, so the matcher uses **/cmd regardless of namespace.

This is a public-API hygiene concern per AGENTS.md > Review Learnings (#86) > "Public API Hygiene". PR-6 is going to import acl_block(namespace) and feed it a namespace it has no business needing — locking in a vestigial parameter that future contributors will assume controls something. Drop the parameter from default_acl, resolve_acl, acl_block, and is_default_acl_in_use before this becomes the consumer-facing contract; if a future Zenoh release fixes the namespace-stripping quirk, re-introduce it then with a real consumer.

permissive default.
"""
try:
import json5 # type: ignore[import-not-found]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lazy-importing json5 defeats the [mesh] extra contract. The comment block at lines 66-73 justifies this by saying eager import would break operators "running with no ACL file ... and no mesh extra installed" — but that's already covered by the lazy import zenoh inside _build_config(), which gates the entire wire stack. If the operator has zenoh, they have the [mesh] extra, which now declares json5 as a hard dep (pyproject.toml line 65).

A cleaner pattern, and the one AGENTS.md flags as the project standard: from strands_robots.utils import require_optional and json5 = require_optional("json5", "mesh") at module top-level. That gives operators a single, named diagnostic at import strands_robots.mesh._acl_config rather than buried inside the deferred parse path. Not blocking, but the rationale comment as written contradicts the dep declaration in pyproject.toml.

# ``audit.py:_ensure_paths``, ``_load_seq_counters``, and
# ``_acl_config.py:_load_acl_file``.
#
if not _is_posix():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_NON_POSIX_TLS_WARNED is mutated without a lock. The comment at line 102 invokes "explicit at the call site without needing a global declaration" as the design rationale, but that misses the concurrency point — the dict-as-mutable-cell trick gets you out of global keyword bookkeeping, not out of GIL-vs-multiple-session-opens races. Two concurrent _resolve_tls_paths() calls on Windows can both observe False and both emit the WARNING.

Worst case is a duplicate warning, not a correctness bug, so this is a soft note rather than a blocker — but it intersects AGENTS.md > Review Learnings (#85) > "Lock ALL ... mutations". A threading.Lock (matching the discipline used by _ACL_CACHE_LOCK in _acl_config.py) or just deleting the de-dup logic and accepting the once-per-session-open WARNING would resolve.

@cagataycali cagataycali added security mesh Zenoh mesh networking / fleet coordination labels May 26, 2026
cagataycali added a commit to cagataycali/robots that referenced this pull request May 26, 2026
Per umbrella tracker strands-labs#219 the four PR-2/3/4/5 slices land in parallel and
must each go green standalone. test_r25_audit_symmetry.py imports modules
introduced by sibling slices that are not present on this branch:

  from strands_robots.mesh import _zenoh_config as zc   # lives on PR strands-labs#224
  from strands_robots.mesh import audit as audit_mod    # lives on this PR
  from strands_robots.mesh import core as core_mod      # lives on this PR
  from strands_robots.mesh import security as sec       # lives on PR strands-labs#223

That makes the file structurally un-passable on the audit slice in
isolation, which is why call-test-lint fails mypy here with
"Module strands_robots.mesh has no attribute 'security'" /
"... no attribute '_zenoh_config'".

The four pin sites the file describes (peer-id charset on
security.py, partial-failure cleanup on core.py, O_NOFOLLOW on the
seq lockfile in audit.py, and the STRANDS_MESH_TLS_KEY mode-0o600
docstring on _zenoh_config.py) span four sibling slices and only
make sense as a cross-cutting integration pin once strands-labs#223, strands-labs#224, strands-labs#221
and strands-labs#225 have all merged. The natural home is PR-6 (strands-labs#225, mesh/core
replay caches + override-resume + safety topic handlers) which is
the integration slice that needs all the listed modules present
anyway, or a follow-up issue queued behind PR-9 (strands-labs#226, docs).

The audit-side pin (point 3, O_NOFOLLOW on the seq lockfile) is
already covered on this PR by tests/mesh/test_audit_seq_symlink.py
and tests/mesh/test_audit_psk_poison_record.py, so removing this
file does not weaken regression coverage on the audit slice itself.
Copy link
Copy Markdown
Contributor

@sundargthb sundargthb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wire-layer security approved.

Two non-blocking notes for the record. First, _validate_acl_shape does not warn when cert_common_names entries contain * or ?. Zenoh 1.x matches CNs literally, so a subject configured with "robot-*" silently matches nothing and produces the exact silent-total-outage failure mode the rest of this module is designed to prevent. The example template documents this for operators who read it, but the validator does not. A five-line WARNING in the cns validation block would keep the loud-on-misconfig posture consistent with the rest of the loader. Can land here or in v0.4.1.

Second, the refuse-to-start gate documented at _acl_config.py:449-451 (Mesh.start emits ERROR + refuses to start when permissive ACL is combined with auth_mode=mtls without STRANDS_MESH_ACCEPT_PERMISSIVE_ACL=1) is implemented in PR-6, not in this PR. In PR-3 alone the session.py gate at line 308-315 is WARNING-only. If PR-3 lands without PR-6, deployments running mTLS without an explicit ACL file run on the permissive default with only a log warning. The dependency should be explicit in the PR-3 description and in the v0.4.0 migration documentation. The intentional breaking changes (mTLS default, multicast off, key mode 0o600 enforcement, strands namespace prefix) all go in the right direction but require the migration guide to call each one out explicitly.

@cagataycali
Copy link
Copy Markdown
Member Author

🎯 Pentest evidence for this PR (#224 — Zenoh + ACL config)

2026-05-26 pentest run cagataycali/robots-pentest@1731f62. Three rounds against this PR's surface.

Findings landing in this PR

Finding Sev What Where to fix
F-08 / B-02 HIGH (monitoring) When Zenoh session open fails (bad ACL schema, wrong listener proto, missing TLS keys), the SDK logs at WARNING but the agent + HTTP shim continue running. Container reports healthy via /health while mesh is dead. For a fleet of 100+ robots this is a 100% silent-failure surface. mesh/session.py — expose session_alive() -> bool. Add STRANDS_MESH_REQUIRED=true env that turns Zenoh open failure into hard exit.
F-10 / B-03 Medium 2000 cmds in 0.011s (175k msg/s). Receiver absorbed them but the health publisher (1 Hz) dropped to zero samples during/after the burst — heartbeat thread starved. _zenoh_config.py — lower cmd topic low_pass_filter (currently 16 KiB / 20 Hz / per-key). Or wrap _on_cmd dispatcher in Semaphore(N=64) to bound per-cmd thread spawning.
F-14 / B-08 High With MESH_AUTH=mtls + minted attacker.crt (CN=attacker, signed by same CA), the attacker still saw state=199 / presence=40 / health=10 in 20s and reached the cmd dispatch path. The shipped permissive acl.json5 (default_permission='allow' rules=allow_all_dev) is the only thing limiting what attacker.crt can do, and it allows everything. _acl_config.py — ship a deny-by-default acl_template.json5 next to the example. README points at the template. Promote STRANDS_MESH_ACCEPT_PERMISSIVE_ACL warning to ERROR or refuse to start in production by default.

Pinned regression tests waiting for these fixes

Posture confirmed by this PR (positive control)

  • F-11 / B-06 — namespace is a routing partition (different ns sees 0 traffic) — NOT auth, but a useful default. Pin a doc test.
  • F-13 — mTLS without attacker cert blocks all 4 PoCs (passive subscribe, inject, router squat, ...). ✅

Three harness bugs we found while testing mtls (already fixed in pentest repo)

  • acl.json5 had old Zenoh 0.x declare_publisher verb — boots fail with Json5Err
  • acl.json5 had policies[].permission field — not in current Zenoh schema
  • ZENOH_LISTEN=tcp/0.0.0.0:7447 hardcoded — mTLS posture needs tls/...

All three manifest as the same B-02 silent-fail symptom — worth keeping the Zenoh-1.x schema sample (e.g. examples/mesh_acl_example.json5) in sync with what the operator-side mint script generates.


Mapping for all PRs: PLAN.md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

mesh Zenoh mesh networking / fleet coordination security

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

4 participants