Skip to content

docs(mesh): README env-var matrix + CHANGELOG entry (9/9 of #195 split)#226

Open
cagataycali wants to merge 6 commits into
strands-labs:mainfrom
cagataycali:pr9/docs-mesh-readme-changelog
Open

docs(mesh): README env-var matrix + CHANGELOG entry (9/9 of #195 split)#226
cagataycali wants to merge 6 commits into
strands-labs:mainfrom
cagataycali:pr9/docs-mesh-readme-changelog

Conversation

@cagataycali
Copy link
Copy Markdown
Member

@cagataycali cagataycali commented May 25, 2026

Part 9 / 9 of the split of #195 — tracked by #219.

Drafted until every other PR in the split lands. The env-var matrix and CHANGELOG entry can only be accurate once the final env-var surface is known.

What's in this PR

  • README.md (+190/-1) — mesh security configuration matrix covering every env var introduced across PRs 2-8.
  • CHANGELOG.md (+219) — consolidated entry covering all 8 code PRs.

Reviewer focus

  • Env-var matrix completeness vs the actual code surface that landed in PRs 2-8.
  • CHANGELOG entry mirrors the umbrella issue's vector-by-vector closure summary.

Stacking note

Lands last. No code changes, easy to review in isolation, no test churn.


§13 Round-N changelog

R1 — 5 surgical doc-drift fixes (5 commits, plain git push, no force)

Commit File / lines Concern Fix
253ad47 README L692, L697 Threat-vector matrix verdict reads as "Mitigated" but the cell body admits the default ACL is permissive — operator scanning the table mis-reads the default-state guarantee. Split each row into a leading **Mitigated** when STRANDS_MESH_ACL_FILE is set ... / **Not mitigated by default** ... form, and surface STRANDS_MESH_ACCEPT_PERMISSIVE_ACL (CHANGELOG L141) as the actual default-state safety net.
3cabbb3 CHANGELOG L222 Said auth_mode=none logs at WARNING; README L508 + CHANGELOG matrix L139 + the _zenoh_config.py emit all say ERROR. Change to ERROR and cite the matrix row + README line for cross-doc consistency.
9893db7 README L716 User-facing prose recommended mesh.transport.bridge_transport._CommandDeduplicator (underscore-prefixed, private). Drop the private symbol; describe the dedup behaviour at the module level (fingerprints (sender_id, turn_id, command) tuples, dispatches once).
44ea5a7 CHANGELOG L159 User-facing prose named _parse_positive_float_env / _parse_positive_int_env (private helpers). Replace with a behavioural description ("internal positive-float / positive-int helpers in strands_robots.mesh.core"). Pin reference unchanged.
bcf96fd README L506 (above the _PREFIX row) STRANDS_MESH_BRIDGE_TOPICS referenced by the _PREFIX row description but had no row of its own — operator could not look up its semantics. Add a row above _PREFIX documenting format, consumer module, full default suffix set (verified against DEFAULT_BRIDGE_SUFFIXES in bridge_transport.py:88-97), the not-bridged-by-default list (state, pose, imu, odom, lidar, camera, input, hand), and the match-semantics interaction with _PREFIX.

Verification: docs-only diff; git diff origin/pr9/docs-mesh-readme-changelog~5 --stat = README.md +8/-4, CHANGELOG.md +8/-8. CI was green on a64bf03 and re-runs on bcf96fd. No test surface change. Each of the 5 review threads on this PR has a corresponding commit; reviewer can resolve each thread against the linked SHA.


Landing order: PR-1 → PR-2/3/4/5 → PR-6 → PR-7 → PR-8 → PR-9. Full plan: PR_LIST.md. Tracking: #219.

Final PR in the split of strands-labs#195. Lands last because the env-var matrix
and CHANGELOG entry can only be accurate once every code PR has
landed and the final env-var surface is known.

- README.md (+190/-1) -- mesh security configuration matrix:
  STRANDS_MESH_TLS_{CA,CERT,KEY}, STRANDS_MESH_ACL_FILE,
  STRANDS_MESH_NAMESPACE, STRANDS_MESH_MULTICAST,
  STRANDS_MESH_CMD_RATE_HZ, STRANDS_MESH_MAX_{CMD,CAMERA,SAFETY}_BYTES,
  STRANDS_MESH_MAX_SESSIONS, STRANDS_MESH_SAFETY_RATE_HZ,
  STRANDS_MESH_AUDIT_PSK, STRANDS_MESH_AUTH_MODE,
  STRANDS_MESH_I_KNOW_THIS_IS_INSECURE,
  STRANDS_MESH_RESUME_*, STRANDS_MESH_BRIDGE_TOPICS{,_PREFIX,_DEDUP_STRICT},
  STRANDS_MESH_POLICY_HOST_ALLOW, STRANDS_MESH_HF_REPO_ALLOW,
  STRANDS_MESH_POLICY_TYPE_ALLOW, STRANDS_MESH_CAMERA_OFFLOAD_DISABLE.

- CHANGELOG.md (+219) -- consolidated entry covering all 8 code PRs.

Tracking: strands-labs#219
Source PR: strands-labs#195
Lands after: every other PR in the split
Copy link
Copy Markdown
Contributor

@yinsong1986 yinsong1986 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Docs-only PR (CHANGELOG + README) capturing the mesh security hardening landed across PRs 2-8 of the #195 split. The matrix is dense and well-organised; the threat-vector table is the kind of thing operators actually need. As the PR description acknowledges, this is the last slice and depends on the other 8 to be in main before the documented surface is real.

What's good

  • Env-var matrix covers nearly every new var in the dependent branches (AUTH_MODE, TLS_*, ACL_FILE, MAX_*, RATE_HZ, RESUME_*, CA_PINS, etc.). Cross-referenced against the source on mesh/zenoh-native-security — the new-var coverage matches the code surface.
  • Migration block (CHANGELOG L180-217) is concrete: enumerates removed env vars and shows the new policy_provider requirement with before/after code.
  • Defaults in the table match the source (e.g. MAX_SESSIONS=256, CMD_RATE_HZ=20.0, RESUME_FRESHNESS_S=60) — verified against mesh/_zenoh_config.py and mesh/core.py on the author's full branch.
  • Captures the four Zenoh 1.x quirks empirically (CHANGELOG L57-84) — useful institutional memory.

Concerns

  • PR is not actually marked Draft. The description says "Drafted until every other PR in the split lands" but the GitHub state is draft: false. None of the dependent PRs (#220-#225, #227, #228) are merged yet. Either flip to draft or update the description so reviewers don't assume this is mergeable in isolation.
  • Pre-existing env vars omitted from the matrix. The README is intended as a comprehensive matrix but a few mesh env vars currently in main aren't listed: STRANDS_MESH_BACKEND (used in mesh/core.py, mesh/session.py, mesh/transport/factory.py), STRANDS_MESH_CAMERA_S3_BUCKET, STRANDS_MESH_CAMERA_S3_PREFIX, plus the per-sensor frequency tunables (STRANDS_MESH_CAMERA_HZ, _HAND_HZ, _HEALTH_HZ, _IMU_HZ, _LIDAR_SUMMARY_HZ, _MAP_INFO_HZ, _ODOM_HZ, _POSE_HZ). Out of scope for the security hardening PR per se, but worth a follow-up issue if the matrix is meant to be exhaustive.
  • STRANDS_MESH_BRIDGE_TOPICS is referenced inside another row's description (L506) but has no row of its own — readers can't look up what it does or its default.
  • Inconsistent log-level claim for auth_mode=none. Three places, two different levels: README L508 and CHANGELOG L139 say ERROR; CHANGELOG L222 says WARNING. Pick one (the source on mesh/_zenoh_config.py actually emits ERROR, so CHANGELOG L222 is the wrong one).
  • "839 mesh tests pass" / specific test counts (CHANGELOG L167-178) will become stale fast and there is no CI gate enforcing the number. Consider switching to "the mesh test suite (tests/mesh/) passes; see CI for the live count."
  • No CI gate for env-var matrix drift. Right now nothing notices when a new STRANDS_MESH_* lands without a README row. A simple test that greps the source and the README and asserts the README is a superset would prevent the matrix from going stale on the next PR. Worth a follow-up issue (tests/test_envvar_matrix_completeness.py).

Verification suggestions

After PRs 2-8 merge into main and this PR is rebased, before the human re-reviews:

# Every STRANDS_MESH_* in source has a row in the README:
grep -hoE 'STRANDS_MESH_[A-Z0-9_]+' strands_robots/**/*.py | sort -u > /tmp/source_vars
grep -oE 'STRANDS_MESH_[A-Z0-9_]+' README.md | sort -u > /tmp/doc_vars
comm -23 /tmp/source_vars /tmp/doc_vars   # should be empty

# Defaults in the README match the source:
grep -E 'STRANDS_MESH_(MAX_SESSIONS|CMD_RATE_HZ|RESUME_FRESHNESS_S|MAX_CMD_BYTES|MAX_CAMERA_BYTES)' \
  strands_robots/mesh/_zenoh_config.py strands_robots/mesh/core.py

# Markdown table renders cleanly:
npx markdownlint README.md CHANGELOG.md

Comment thread README.md Outdated
|---|---|
| LAN outsider, no cert | **Mitigated.** TLS handshake rejects the connection. |
| Cert from a different CA | **Mitigated.** Our CA bundle does not verify it; handshake fails. |
| Valid cert, CN does not match any ACL rule | **Mitigated** when `STRANDS_MESH_ACL_FILE` is set with literal CN allowlists. With the built-in default ACL (`any_authenticated_peer`, intentionally permissive to avoid silent fleet outage on first run), any CA-signed peer is admitted to all key-expressions. Operators are expected to ship a literal-CN ACL file in production. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Threat-vector verdict misleading for the default state. This row (and L697) reads as "Mitigated" but the body of the cell admits the default ACL is permissive — any_authenticated_peer admits any CA-signed peer to all key-expressions until the operator ships a literal-CN STRANDS_MESH_ACL_FILE. An operator scanning the table sees "Mitigated" and assumes a default install is safe against rogue-CN peers; it isn't.

Suggest splitting the verdict so the default state is honest:

| Valid cert, CN does not match any ACL rule | **Mitigated** when `STRANDS_MESH_ACL_FILE` is set with literal CN allowlists. **Not mitigated by default** — the built-in permissive ACL admits any CA-signed peer to every key-expression. |

Apply the same wording fix to L697. Note that STRANDS_MESH_ACCEPT_PERMISSIVE_ACL (CHANGELOG L141) requires explicit opt-in to start under the permissive default — that gate is worth surfacing in this row as the actual default-state safety net.

Comment thread CHANGELOG.md Outdated
---

Dev / lab environments without PKI run `STRANDS_MESH_AUTH_MODE=none` AND `STRANDS_MESH_I_KNOW_THIS_IS_INSECURE=1`
to keep plain-TCP behaviour; the mesh logs a WARNING at session
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent log level for auth_mode=none. This line says "the mesh logs a WARNING at session open", but README L508 ("Logs ERROR (not WARNING) on every session open") and the CHANGELOG matrix L139 ("ERROR-level log at every session open") both say ERROR — and STRANDS_MESH_I_KNOW_THIS_IS_INSECURE is explicitly designed to be loud. Source on the author's full branch (mesh/_zenoh_config.py) emits ERROR. This line should say ERROR for consistency.

Comment thread README.md Outdated

Bridge-transport deployments still need cross-transport deduplication
because the same Zenoh + IoT MQTT topic can deliver the same payload
twice. `mesh.transport.bridge_transport._CommandDeduplicator`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Underscore-prefixed class referenced in user-facing docs. _CommandDeduplicator is a private name (leading underscore) and AGENTS.md > Review Learnings (PR #86) > "Public API Hygiene" rules out recommending _methods in user-facing docstrings or error messages because it locks in a private dependency contract.

Either (a) drop the class name and describe the behaviour without it, or (b) promote it to a public name (CommandDeduplicator) before merge so the doc reference is on the public API surface.

Comment thread CHANGELOG.md Outdated
| `STRANDS_MESH_RESUME_REPLAY_CACHE_MAX` | `4096` | Maximum entries in the per-receiver resume proof_nonce replay cache (LRU eviction) |

All three resume-hardening env vars are parsed via
``_parse_positive_float_env`` / ``_parse_positive_int_env`` helpers in
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Underscore-prefixed helpers in user-facing CHANGELOG. Same AGENTS.md (PR #86) hygiene rule as on README L716: _parse_positive_float_env / _parse_positive_int_env are private. Either rephrase without naming them ("parsed via internal helpers in strands_robots.mesh.core") or promote them to public names if operators are expected to know about them.

Comment thread README.md
| `STRANDS_MESH_AUDIT_DIR` | Directory for the safety audit log (`mesh_audit.jsonl`) and sequence-counter sidecar (`mesh_audit.seq.json`) | `~/.strands_robots/` |
| `STRANDS_MESH_AUDIT_MAX_BYTES` | Maximum size (bytes) of the active audit log before rotation. Hard-capped at 10 GiB. Phase-4 / E1: prevents an attacker who can publish safety events from filling the disk. | `104857600` (100 MiB) |
| `STRANDS_MESH_AUDIT_MAX_FILES` | Maximum number of rotated audit log copies kept (`mesh_audit.jsonl.1` … `.N`). Hard-capped at 100. Older rotations are discarded. | `5` |
| `STRANDS_MESH_BRIDGE_TOPICS_PREFIX` | Comma-separated list of bridge filter entries that match as path-prefix (entry matches `entry/<anything>`). Default: `response` (so `response/<turn-id>` bridges). All other entries in `STRANDS_MESH_BRIDGE_TOPICS` match exactly — Phase-4 / A2 hardening that closes the prefix-bypass attack. | `response` |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

STRANDS_MESH_BRIDGE_TOPICS referenced but has no row in the matrix. This row's description says "All other entries in STRANDS_MESH_BRIDGE_TOPICS match exactly" but STRANDS_MESH_BRIDGE_TOPICS itself isn't a documented row, so a reader can't look up its semantics, format, or default. Add a row for it (it's a real env var read by mesh.transport.bridge_transport) above this _PREFIX row.

strands-coder added 5 commits May 26, 2026 01:39
…L state

Two rows in the threat-vector matrix (rogue-CN and unknown-CN) read as
"Mitigated" but the body of each cell admits the default ACL is
permissive. An operator scanning the table sees "Mitigated" and
assumes a default install is safe against rogue-CN peers; it isn't
unless STRANDS_MESH_ACL_FILE is shipped or STRANDS_MESH_ACCEPT_PERMISSIVE_ACL
is the explicit opt-in.

Reword each row with a leading "**Mitigated** when ..." / "**Not
mitigated by default** ..." split and surface STRANDS_MESH_ACCEPT_PERMISSIVE_ACL
as the actual default-state safety net (already documented in CHANGELOG L141).

Addresses review feedback on README L692 / L697.
The CHANGELOG line described auth_mode=none as logging a WARNING at
session open, but README L508 and the env-var matrix at CHANGELOG L139
both correctly say ERROR, and STRANDS_MESH_I_KNOW_THIS_IS_INSECURE is
explicitly designed to be loud (the actual emit at session open is at
ERROR level).

Single-source the level so an operator cross-checking the three docs
gets a consistent answer.

Addresses review feedback on CHANGELOG L222.
…duplicator

Per the public-API hygiene rule (don't recommend underscore-prefixed
names in user-facing docs because it locks the project into a private
dependency contract), describe the dedup behaviour at the module level
instead of pointing at the private class name. The behaviour the
operator actually cares about — fingerprint on the (sender_id, turn_id,
command) tuple, dispatch once, forward to the application-layer
handler — is unchanged.

If a future PR promotes the class to a public name we can re-cite it
by symbol; until then, the module-level reference is the correct level
of granularity for an operator-facing README.

Addresses review feedback on README L716.
…lpers

Same public-API hygiene rule as on README L716: don't reference
underscore-prefixed helpers (_parse_positive_float_env /
_parse_positive_int_env) in user-facing CHANGELOG text because it
locks the project into a private contract.

Replace with a behavioural description at the module level — operators
care that malformed values log a WARNING and fall back to the documented
default, not which private helper does the parsing. Pin reference is
unchanged.

Addresses review feedback on CHANGELOG L159.
The neighbouring _PREFIX row references STRANDS_MESH_BRIDGE_TOPICS in
its description ("All other entries in STRANDS_MESH_BRIDGE_TOPICS
match exactly") but the parent var has no row of its own, so a reader
trying to look up its semantics, format, or default has nowhere to land.

Add a row above the _PREFIX entry documenting:
- The format (comma-separated suffix allowlist).
- The consumer (mesh.transport.bridge_transport).
- The default suffix set verbatim (presence, health, safety/event,
  safety/estop, cmd, response, broadcast — verified against
  DEFAULT_BRIDGE_SUFFIXES in bridge_transport.py).
- The not-bridged-by-default list (state, pose, imu, odom, lidar,
  camera, input, hand) so operators understand the cost-math choice
  before they extend the var.
- The match-semantics interaction with _PREFIX.

Addresses review feedback on README L506.
Copy link
Copy Markdown
Contributor

@yinsong1986 yinsong1986 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Docs-only PR that backfills the README env-var matrix and a consolidated CHANGELOG entry covering the 8 mesh-security PRs (#195 split). +411 / -1 across README.md and CHANGELOG.md. Lands last per the stack plan; on upstream/main the referenced code surface (mesh/security.py, mesh/_zenoh_config.py, mesh/_acl_config.py, examples/mesh_acl_example.json5, every new tests/mesh/test_*.py, every new STRANDS_MESH_* security env var) does not yet exist, so the docs are necessarily aspirational at the moment of review.

What's good

  • Scope is tight: two doc files, no code churn, no test churn, no host paths added.
  • The 5-commit R1 round addresses real review threads on the same PR (default-ACL verdict honesty, log-level alignment with the underlying emitter, dropping _CommandDeduplicator from user prose, replacing _parse_positive_* with behavioural prose, adding the missing STRANDS_MESH_BRIDGE_TOPICS row). Each commit cites the line and the fix is surgical.
  • No emojis added by this PR (verified U+1F000–U+1FFFF + U+FE0F sweep over + lines). Pre-existing emojis in the README are out of scope.
  • The threat-vector matrix split into Mitigated / Not mitigated by default rows is exactly the right honesty for an operator scanning the table.

Concerns

  1. Docs land before the code they document. The PR description acknowledges this ("Drafted until every other PR in the split lands") but the PR is in draft: false. Either re-flag as draft until 1–8 land, or add a banner row at the top of the new "Unreleased – mesh security hardening" CHANGELOG section pointing to the umbrella issue (#219) so a reader landing on the CHANGELOG via git blame knows the symbols won't resolve in the working tree at this commit. The current entry reads as if the work is complete and merged.
  2. CHANGELOG "New env vars" table is not actually complete. Reviewer focus #1 is "matrix completeness vs the actual code surface". The README env-var matrix (L503–534) lists ~27 mesh security vars; the CHANGELOG "New env vars" table at L136–155 lists 16. Vars present in README but missing from the CHANGELOG table: STRANDS_MESH_BRIDGE_TOPICS, STRANDS_MESH_BRIDGE_TOPICS_PREFIX, STRANDS_MESH_AUDIT_MAX_BYTES, STRANDS_MESH_AUDIT_MAX_FILES, STRANDS_MESH_DEDUP_TTL, STRANDS_MESH_CAMERA_PRESIGN_TTL, STRANDS_MESH_POLICY_HOST_ALLOW, STRANDS_MESH_AUDIT_PSK, STRANDS_MESH_OVERRIDE_CODE, STRANDS_MESH_HF_REPO_ALLOW, STRANDS_MESH_POLICY_TYPE_ALLOW, STRANDS_MESH_CA_PINS, STRANDS_MESH_DISABLE_CA_PIN. Several of these are mentioned in the CHANGELOG narrative but only the ones in the table will be picked up by anyone grep'ing the CHANGELOG for an env-var name.
  3. R1 private-symbol sweep was incomplete. Commit 9893db7 correctly dropped _CommandDeduplicator from user-facing prose, but several other _-prefixed symbols still appear in the README and CHANGELOG (see inline). AGENTS.md Review Learnings (#86) > Public API Hygiene is explicit: "Never recommend a _method in user-facing docstrings or error messages."
  4. Hardcoded README line numbers in CHANGELOG. CHANGELOG L222–223 reference README L508 and matrix at L139. Any future edit above those points silently breaks the cross-reference and the CHANGELOG turns into archeology bait.

Verification suggestions

# Cross-check the CHANGELOG matrix against the README matrix:
comm -23 \
  <(grep -oE 'STRANDS_MESH_[A-Z_]+' README.md | sort -u) \
  <(awk '/^### New env vars/,/^### Tests/' CHANGELOG.md | grep -oE 'STRANDS_MESH_[A-Z_]+' | sort -u)
# Anything emitted is a var documented in the README but missing from
# the CHANGELOG "New env vars" table.

# Find any other private symbols that slipped into user-facing docs:
grep -nE '`[A-Za-z_.]*\._[a-z_]+' README.md CHANGELOG.md

# When the underlying code lands, sanity-check every env var in the
# README matrix actually has an os.getenv call somewhere:
for v in $(grep -oE 'STRANDS_MESH_[A-Z_]+' README.md | sort -u); do
  rg -q "\"$v\"" strands_robots/ || echo "undocumented in code: $v"
done

Comment thread README.md
| `STRANDS_MESH_RESUME_REPLAY_CACHE_MAX` | Maximum number of `proof_nonce` values remembered in the per-receiver replay cache. Bounded LRU eviction prevents memory exhaustion from high-volume resume attempts. | `4096` |
| `STRANDS_MESH_DEDUP_TTL` | Bridge-transport deduplication window (seconds). Caps how long the same envelope nonce is remembered across the Zenoh + IoT subscriber wrappers. | `120` |
| `STRANDS_MESH_CAMERA_PRESIGN_TTL` | Lifetime (seconds) of presigned S3 GET URLs published in `/camera/.../ref` messages. Capped at 3600. | `60` |
| `STRANDS_MESH_CAMERA_DISABLED` | Set to `true` to disable camera publishing entirely (privacy kill switch). `Mesh._publish_cameras_once` short-circuits before any frame is built. | `false` |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Private method (_publish_cameras_once) named in a user-facing env-var description. Per AGENTS.md > Review Learnings (#86) > Public API Hygiene ("Never recommend a _method in user-facing docstrings or error messages"). The R1 round already pulled _CommandDeduplicator out via commit 9893db7; that sweep missed several siblings:

  • README.md:530 and CHANGELOG.md:152Mesh._publish_cameras_once
  • README.md:697 and CHANGELOG.md:94Mesh._on_response
  • README.md:621mesh._acl_config.default_acl()
  • README.md:724-725mesh/_zenoh_config.py, mesh/_acl_config.py (modules whose names start with _ are by-convention private; the public surface should be mesh.security/mesh.config or whatever the eventual public re-export is)
  • CHANGELOG.md:97_resume_lockout

Suggest the same treatment as _CommandDeduplicator: describe behaviour at module level (e.g. "the camera publisher short-circuits before any frame is built"), cite the public module path, and let git blame carry the implementation pin. Otherwise the very next refactor that renames any of these private symbols silently invalidates user-facing docs.

Comment thread CHANGELOG.md
| `STRANDS_MESH_CAMERA_DISABLED` | `false` | Privacy kill switch -- when `true`, `Mesh._publish_cameras_once` short-circuits before any frame is built |
| `STRANDS_MESH_FILTER_INTERFACES` | unset (wildcard) | Optional comma-separated NIC allowlist for the `low_pass_filter` rules. Unset means "every link" (Zenoh's `SubjectProperty::Wildcard`). |
| `STRANDS_MESH_RESUME_FRESHNESS_S` | `60` | Maximum age (seconds) of a resume envelope before rejection as stale |
| `STRANDS_MESH_RESUME_FORWARD_SKEW_S` | `5` | Maximum forward clock skew (seconds) tolerated in resume envelope timestamps |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewer focus #1 ("env-var matrix completeness vs the actual code surface") looks not-yet-met: the CHANGELOG "New env vars" table here lists ~16 vars, but the README matrix (L503–534) documents ~27 mesh security vars. Missing from this table but present in the README:

STRANDS_MESH_BRIDGE_TOPICS
STRANDS_MESH_BRIDGE_TOPICS_PREFIX
STRANDS_MESH_AUDIT_MAX_BYTES
STRANDS_MESH_AUDIT_MAX_FILES
STRANDS_MESH_DEDUP_TTL
STRANDS_MESH_CAMERA_PRESIGN_TTL
STRANDS_MESH_POLICY_HOST_ALLOW
STRANDS_MESH_AUDIT_PSK
STRANDS_MESH_OVERRIDE_CODE
STRANDS_MESH_HF_REPO_ALLOW
STRANDS_MESH_POLICY_TYPE_ALLOW
STRANDS_MESH_CA_PINS
STRANDS_MESH_DISABLE_CA_PIN

Several are mentioned in earlier CHANGELOG sections in prose, but the "New env vars" table is the discoverable surface (an operator running grep STRANDS_MESH_AUDIT_PSK CHANGELOG.md to find what behavior it controls only finds the audit-section bullet at L103, not a dedicated row). Either widen this table to mirror the README, or add a sentence here pointing at the README matrix as the canonical source and explaining why this table is a curated subset.

Comment thread CHANGELOG.md

Dev / lab environments without PKI run `STRANDS_MESH_AUTH_MODE=none` AND `STRANDS_MESH_I_KNOW_THIS_IS_INSECURE=1`
to keep plain-TCP behaviour; the mesh logs an ERROR at session
open (matches README L508 and the env-var matrix at L139 — `STRANDS_MESH_I_KNOW_THIS_IS_INSECURE` is explicitly designed to be loud).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded line numbers (README L508, matrix at L139) are brittle citations — any future README edit above L508 silently breaks the reference and the CHANGELOG becomes misleading on the very next docs PR. Same problem with L139 pointing at this CHANGELOG file. Cite by section name instead, e.g. "matches the Mesh security → Authentication: mTLS (default) section in the README and the STRANDS_MESH_I_KNOW_THIS_IS_INSECURE row in this changelog's New env vars table."

Comment thread README.md
| `STRANDS_MESH_AUDIT_DIR` | Directory for the safety audit log (`mesh_audit.jsonl`) and sequence-counter sidecar (`mesh_audit.seq.json`) | `~/.strands_robots/` |
| `STRANDS_MESH_AUDIT_MAX_BYTES` | Maximum size (bytes) of the active audit log before rotation. Hard-capped at 10 GiB. Phase-4 / E1: prevents an attacker who can publish safety events from filling the disk. | `104857600` (100 MiB) |
| `STRANDS_MESH_AUDIT_MAX_FILES` | Maximum number of rotated audit log copies kept (`mesh_audit.jsonl.1` … `.N`). Hard-capped at 100. Older rotations are discarded. | `5` |
| `STRANDS_MESH_BRIDGE_TOPICS` | Comma-separated allowlist of key-expression suffixes the Zenoh / AWS IoT bridge transport forwards across the cloud boundary. Read by `mesh.transport.bridge_transport`. Empty / unset uses the default suffix set (`presence,health,safety/event,safety/estop,cmd,response,broadcast`); high-volume telemetry (`state`, `pose`, `imu`, `odom`, `lidar`, `camera`, `input`, `hand`) is **not** bridged by default — set this var explicitly to opt in. Match semantics are exact for every entry except those listed in `STRANDS_MESH_BRIDGE_TOPICS_PREFIX`. | `presence,health,safety/event,safety/estop,cmd,response,broadcast` |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value here (presence,health,safety/event,safety/estop,cmd,response,broadcast) and the not-bridged-by-default list are documented as facts, but on upstream/main at this commit no code reads STRANDS_MESH_BRIDGE_TOPICS with these defaults yet (grep -r DEFAULT_BRIDGE_SUFFIXES strands_robots/ returns nothing). That's expected for PR 9/9, but it means whichever predecessor PR introduces the constant has to ship with these exact defaults or this row goes stale silently. Suggest pinning the verification: at land-time, add a unit test in PR 7-or-8 that asserts DEFAULT_BRIDGE_SUFFIXES == ("presence", "health", "safety/event", "safety/estop", "cmd", "response", "broadcast") so a future edit to the tuple has to either update the README or fail CI. Currently the only thing tying these two sources together is reviewer attention.

Comment thread CHANGELOG.md
positive-float / positive-int helpers in ``strands_robots.mesh.core``.
Malformed values (non-numeric or negative) log a WARNING and fall
back to the documented default instead of failing module import or
silently disabling the replay cache. Pinned by
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Citation pin: tests/mesh/test_resume_env_validation.py doesn't exist on upstream/main at this commit (and neither do the 11 other test files named in the Tests section above). Re-flagging because PR 9/9 lands last — if any of those tests get renamed/restructured during the PR-1–8 review cycle, the CHANGELOG ends up citing a path that never exists anywhere in git log. Suggest a final grep -F -f <test_paths_from_changelog> $(git ls-files tests/mesh/) sanity check immediately before merge, or sketch the citations as tests/mesh/test_resume_env_validation.py (or successor) so the reader knows the path is the intent rather than a guarantee.

@cagataycali cagataycali added documentation Improvements or additions to documentation mesh Zenoh mesh networking / fleet coordination labels May 26, 2026
cagataycali added a commit to cagataycali/robots that referenced this pull request May 26, 2026
Per umbrella tracker strands-labs#219 the four PR-2/3/4/5 slices land in parallel and
must each go green standalone. test_r25_audit_symmetry.py imports modules
introduced by sibling slices that are not present on this branch:

  from strands_robots.mesh import _zenoh_config as zc   # lives on PR strands-labs#224
  from strands_robots.mesh import audit as audit_mod    # lives on this PR
  from strands_robots.mesh import core as core_mod      # lives on this PR
  from strands_robots.mesh import security as sec       # lives on PR strands-labs#223

That makes the file structurally un-passable on the audit slice in
isolation, which is why call-test-lint fails mypy here with
"Module strands_robots.mesh has no attribute 'security'" /
"... no attribute '_zenoh_config'".

The four pin sites the file describes (peer-id charset on
security.py, partial-failure cleanup on core.py, O_NOFOLLOW on the
seq lockfile in audit.py, and the STRANDS_MESH_TLS_KEY mode-0o600
docstring on _zenoh_config.py) span four sibling slices and only
make sense as a cross-cutting integration pin once strands-labs#223, strands-labs#224, strands-labs#221
and strands-labs#225 have all merged. The natural home is PR-6 (strands-labs#225, mesh/core
replay caches + override-resume + safety topic handlers) which is
the integration slice that needs all the listed modules present
anyway, or a follow-up issue queued behind PR-9 (strands-labs#226, docs).

The audit-side pin (point 3, O_NOFOLLOW on the seq lockfile) is
already covered on this PR by tests/mesh/test_audit_seq_symlink.py
and tests/mesh/test_audit_psk_poison_record.py, so removing this
file does not weaken regression coverage on the audit slice itself.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation mesh Zenoh mesh networking / fleet coordination

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

2 participants