mesh(transport): bridge cross-transport dedup + monotonic TTL + strict mode (5/9 of #195 split) by cagataycali · Pull Request #222 · strands-labs/robots

cagataycali · 2026-05-25T20:37:24Z

Part 5 / 9 of the split of #195 — tracked by #219.

Bridge dedup correctness across LAN-Zenoh and AWS-IoT transports.

What changes

time.monotonic for TTL eviction (not wall clock — survives clock changes / DST / NTP slew).
Exact-match topic filter via STRANDS_MESH_BRIDGE_TOPICS.
Opt-in prefix-walk via STRANDS_MESH_BRIDGE_TOPICS_PREFIX (back-compat for the old behaviour).
Opt-in strict mode via STRANDS_MESH_BRIDGE_DEDUP_STRICT.
Dedup identity is the canonical RPC triple (sender_id, turn_id, command) hashed by full SHA-256 (no birthday-attack truncation). Callers must not reuse the triple for distinct deliveries; partial canonical payloads fall through to pass-through (default) or full-payload hash (strict).
Adds safety/resume to DEFAULT_BRIDGE_SUFFIXES (paired with safety/estop so the cloud audit timeline can close incident windows). Called out separately because it's an additional bridged topic, not strictly part of the dedup/TTL fix the title advertises.
Narrows seven bare except Exception sites to the documented transport-failure surface tuples (R3 hygiene fix; see §13 row R3-c).

What's in this PR

strands_robots/mesh/transport/bridge_transport.py (+267/-25 baseline; R1+R2+R3 add ~95 LOC of correctness/wiring/hygiene deltas).
strands_robots/mesh/transport/base.py, factory.py, iot_transport.py — minor signature touch-ups.
3 test files (325 LOC of new + modified at baseline; +9 pin tests across R1+R2+R3).

Reviewer focus

TTL clock-source correctness.
Exact-match vs prefix-walk env-var split (back-compat path is opt-in via the _PREFIX variant).
Canonical-tuple identity contract (no partial-canonical aliasing).
Behaviour under transport restart.
Strict-mode reachability from BridgeTransport (R1).
Docstring/code consistency on dedup identity (R2).
Narrow exception surface in transport teardown / IO paths (R3).

Carries review fixes from #195

R12 (TTL math used time.time(), fixed to time.monotonic), R15 (cross-transport dedup opt-in strict mode), R20 (estop replay cache key narrowed to t-only; reject empty peer_id — bridge-side counterpart).

Stacking note

Self-contained subsystem: transport layer is decoupled from core.py via the Transport base class. Depends only on PR-1 (#220). PR-6 will later consume the new bridge contract.

Landing order: PR-1 → PR-5 (parallel with PR-2/3/4) → PR-6 → PR-7 → PR-8 → PR-9. Full plan: PR_LIST.md. Tracking: #219.

§13 Review Round Changelog

Round	Concern	Fix Commit	Pin Test
R1	`STRANDS_MESH_BRIDGE_DEDUP_STRICT` env var advertised but never wired to `_CommandDeduplicator`	`ae4f35c`	`tests/mesh/test_bridge_dedup.py::TestStrictEnvVarWiringR1` (3 cases)
R2	(a) Docstring/code mismatch on "envelope nonce" — class docstring + `declare_subscriber` + test module docstring all promised nonce-based identity; only canonical tuple is read. Resolved by aligning docstrings to actual semantics (option B); nonces are caller-controllable and the canonical tuple is the defensive choice. (b) `_should_bridge` docstring prose mid-sentence-truncated. (c) `safety/resume` added to `DEFAULT_BRIDGE_SUFFIXES` without rationale in the comment block above.	`0230c1b`	`tests/mesh/test_bridge_dedup.py::TestStrictModeIntegrationR2` (2 cases — closes the bridge-layer integration coverage gap for envelope-shaped payloads in default vs strict mode)
R3-a	Partial-canonical false-dedup. `_dedup_id` took the canonical path on any non-None field, so `{sender_id: a}` aliased against `{sender_id: a, extra: 1}`. Fix: canonical path requires all three fields present; partial canonical payloads fall through to pass-through (default) or full-payload-hash (strict). Class docstring strengthened to state the intentional contract (callers must not reuse `(sender, turn, cmd)` for distinct deliveries; `turn_id` monotonic per-sender).	`00ca08e`	`tests/mesh/test_bridge_dedup.py::TestCommandDeduplicator::test_partial_canonical_does_not_alias` and `::test_partial_canonical_strict_mode_uses_full_payload` (2 cases — first fails on pre-fix code; second pins strict-mode fallback shape).
R3-b	Vacuous TTL / clear / first-call tests. `test_first_call_not_duplicate`, `test_ttl_expiry`, and `test_clear` used pass-through payloads (nonce-only) so `_dedup_id` returned `None` and the assertions held trivially regardless of TTL math or `clear()` correctness.	`00ca08e`	All three tests rewritten in place with canonical-tuple payloads so the eviction and clear paths are actually exercised.
R3-c	Seven bare `except Exception` sites in `bridge_transport.py` (handle teardown, connect/close, put, declare_subscriber) violated AGENTS.md > Review Learnings > 'Exception Clauses Must Be Narrow'.	`db192ce`	`tests/mesh/test_bridge_dedup.py::TestNarrowExceptionsR3::test_no_bare_except_exception_in_bridge_transport` (source-grep regression pin; same shape as the R12 `test_no_time_dot_time_in_dedup_path` pin).
R4 (CI)	`db192ce` did not run `hatch run format` over the file before push, so `ruff format --check` flagged 2 trivial whitespace deviations in `tests/mesh/test_bridge_dedup.py` (a multiline assertion message that fits on one line under the 120-char project line-length, plus 2 blank-line spacing nits between test classes). This was the sole `call-test-lint` failure on R3 — `ruff check` and `pytest` were already green.	`cf45728`	Same `hatch run lint` step that failed on `db192ce` is the implicit pin; passes locally (`ruff check` + `ruff format --check` + `mypy` clean). 28/28 tests in `test_bridge_dedup.py` still pass. No semantic change.

Disposition of R3-batch follow-up review (R4 review round, 11:37 UTC)

These are the post-R3 review concerns. Adopting AGENTS.md tenet 8 (round budget = 3, currently at R4 with R5 in flight), each is dispositioned in place rather than re-rolled into a new commit:

Concern (line ref)	Disposition	Rationale
`bridge_transport.py:358` — canonical fingerprint hashes only `(sender, turn, cmd)`, contract fragile if callers reuse `turn_id`	Already addressed in R3-a. The class docstring now explicitly states the contract ("callers must not reuse `(sender, turn, cmd)` for distinct deliveries; `turn_id` monotonic per-sender"). The §13 R3-a row spells this out; the regression pin `test_partial_canonical_does_not_alias` enforces the strict-mode fallback.	Recursion: same concern raised twice in two rounds. The R3-a fix already widened the docstring contract. Additional widening of the hash itself would change the dedup semantics this PR establishes — out of scope.
`bridge_transport.py` no-line — partial-canonical aliases	Already addressed in R3-a (commit `00ca08e`). Pin: `test_partial_canonical_does_not_alias`.	Same concern as the R3-a row; surfaced again in batch review.
`test_bridge_dedup.py` no-line — vacuous TTL / clear tests	Already addressed in R3-b (commit `00ca08e`). Three tests rewritten with canonical-tuple payloads.	Same concern as the R3-b row.
`bridge_transport.py:382` — GC follow-up tracking issue should exist before merge	Already filed: #231. Linked from R3-c row in §13.	Concern is satisfied — issue exists on the project board.
`bridge_transport.py:598` — dedup cache key uses subscription `key_expr`, not delivered `sample.key_expr`; wildcard subscriptions could alias across distinct delivered topics	Deferred — filed as #232. Reviewer concedes "in the typical case this is harmless" because `sender_id` is part of the canonical hash. The wildcard-aliasing case requires either an operator misconfiguring `sender_id` or peer impersonation — both out of the threat model this PR closes.	Wildcard-aliasing is a separate hardening concern from the dedup-correctness scope this PR advertises. Mixing it in would expand R5 into R6.
`bridge_transport.py:355` — `default=str` makes canonical fingerprint non-deterministic for objects with default `__str__` (includes memory address)	Deferred — filed as #233. Reviewer concedes "the bytes/datetime cases are unlikely" and the failure mode is silent in production rather than incorrect. The mesh `command` field is JSON-RPC-shaped per the upstream contract.	Same scope-creep argument as the wildcard case.
`bridge_transport.py:379` — sort-and-slice runs inside dedup lock; lock-hold-time is a separate axis from #231's algorithmic concern	Folded into #231. Will update the issue body to call out lock-granularity as a second axis.	Reviewer marked "not a blocker for this PR".
`bridge_transport.py:99` — safety-topic dedup behaviour change deserves louder log on duplicate-drop	Deferred (separate concern). Reviewer concedes "matches the documented contract". The PR established the contract; tightening operator visibility on safety-topic drops is a separate operator-UX concern that would expand scope.	Operator-UX concern out of dedup-correctness scope.
`bridge_transport.py:225` — `..` rejection only checks tail, head segment unchecked; comment overstates guard	Deferred. Reviewer concedes "For Zenoh keys this is fine (the syntax doesn't interpret `..`)". Minor doc/code alignment; will fold into the PR-3 ACL hardening (which has the broader path-traversal scope).	Cross-PR sequencing: PR-3 (#224) is the natural home for this.

Tracked-as-follow-up (not in this PR)

GC perf concern at _CommandDeduplicator.is_duplicate() (algorithm + lock granularity) — perf(mesh/transport): bound _CommandDeduplicator GC cost under sustained pressure #231.
Wildcard-key dedup aliasing (R4 review batch, line 598) — mesh(transport): bridge dedup cache key uses subscription expr, not delivered topic (wildcard aliasing) #232.
default=str non-determinism on objects without __str__ override (R4 review batch, line 355) — mesh(transport): _dedup_id default=str makes canonical fingerprint non-deterministic for objects with default __str__ #233.
Safety-topic duplicate-drop visibility (R4 review batch, line 99) — operator-UX concern, deferred to safety-handler hardening pass.
Path-traversal head-segment guard (R4 review batch, line 225) — folded into PR-3 (mesh(wire): Zenoh + ACL config builders (mTLS, downsampling, low_pass_filter) (3/9 of #195 split) #224) ACL hardening scope.

…t mode Bridge dedup correctness across LAN-Zenoh and AWS-IoT transports: - time.monotonic for TTL eviction (not wall clock -- survives clock changes / DST / NTP slew) - exact-match topic filter via STRANDS_MESH_BRIDGE_TOPICS - opt-in prefix-walk via STRANDS_MESH_BRIDGE_TOPICS_PREFIX (back-compat for the old behaviour) - opt-in strict mode via STRANDS_MESH_BRIDGE_DEDUP_STRICT - dedup hash collision resistance (full SHA-256 of normalized payload, not truncated) Self-contained subsystem: transport layer is decoupled from core.py via the Transport base class. Lands before PR-6 (core) consumes the new bridge contract. Modified: - strands_robots/mesh/transport/bridge_transport.py (+267/-25) - strands_robots/mesh/transport/base.py (+2/-2) - strands_robots/mesh/transport/factory.py (+2/-2) - strands_robots/mesh/transport/iot_transport.py (+14/-14) Carries review fixes from strands-labs#195: R12 (TTL math used time.time(), fixed to time.monotonic), R15 (cross-transport dedup opt-in strict mode), R20 (estop replay cache key narrowed to t-only; reject empty peer_id -- this PR carries the bridge-side counterpart). Tests (325 LOC): bridge dedup with monotonic clock, exact-match vs prefix-walk, strict-mode behaviour, transport restart resilience. Tracking: strands-labs#219 Source PR: strands-labs#195 Lands after: strands-labs#220 (PR-1) Lands before: PR-6 (core consumes the bridge contract)

yinsong1986

Summary

This PR hardens the bridge transport in three areas: TTL math now uses time.monotonic (so wall-clock skew / NTP / suspend-resume don't break the eviction window); the _should_bridge filter is split into exact-match (default) vs explicit prefix-match (opt-in via STRANDS_MESH_BRIDGE_TOPICS_PREFIX), closing a cloud-pollution amplification on cmd / safety/event; and a new _CommandDeduplicator collapses duplicates that arrive via both Zenoh and MQTT in bridge mode. The dedup uses a full SHA-256 (no birthday-attack truncation) over (sender_id, turn_id, command).

The direction is right and the regression-pin tests for R12 (monotonic clock) and R15 (strict mode behaviour at the deduplicator level) are good practice per AGENTS.md > Review Learnings (#85) > "Pin regression tests for reviewed fixes." The _should_bridge tightening is straightforward and well-reasoned.

That said, there are two material correctness gaps and one documentation drift below — see inline. The biggest is that STRANDS_MESH_BRIDGE_DEDUP_STRICT is advertised in the PR description as an opt-in but isn't actually wired anywhere in the diff: BridgeTransport.__init__ constructs _CommandDeduplicator() with no kwargs, so the env var is a no-op even though R15 is listed as a carried-over fix.

What's good

TTL switch from time.time to time.monotonic is correct and pinned by two regression tests (test_dedup_uses_monotonic_clock, test_no_time_dot_time_in_dedup_path) that fail on pre-fix code.
Full SHA-256 (no truncation) for the content fingerprint — good defensive choice.
_should_bridge exact/prefix split is the right shape and the threat model is documented inline.
Cache key is (topic_key, dedup_id) so coincidentally-matching ids on different topics don't alias.
Cache-size guard (_MAX_DEDUP_ENTRIES) plus stale-eviction-then-oldest-20% drop is a reasonable bounded-memory strategy.
_resolve_dedup_ttl warns on ValueError per AGENTS.md (#86) > "Warn on unrecognized values".

Concerns

Advertised env var not wired. PR description lists STRANDS_MESH_BRIDGE_DEDUP_STRICT as an opt-in; no occurrence in the diff or the rest of the tree. Either the wiring was dropped or the description is wrong. Either way needs reconciling before merge — see inline on BridgeTransport.__init__.
Docstring claims an identity that the code doesn't compute. Several user-visible docstrings say identity is the "envelope nonce when present, otherwise content fingerprint". _dedup_id only inspects sender_id / turn_id / command — the literal key nonce is never read. AGENTS.md > Review Learnings (#86) > "Match docstrings to semantics".
Test names oversell what they cover. test_first_call_not_duplicate and several integration tests pass payloads shaped {"nonce": "...", "payload": {...}}, but in default (non-strict) mode those payloads have no canonical fields, so _dedup_id returns None and the call passes through unconditionally. The assertion holds trivially regardless of dedup behaviour. The legitimate dedup behaviour is only exercised by the canonical-tuple integration tests.
Docstring prose for _should_bridge is mid-sentence-truncated ("The exact / prefix split closes the cloud-pollution attack / The pre-fix attack: ..."). Reads like a merge artefact; small but worth a re-flow before merge.
Scope creep, minor: safety/resume was added to DEFAULT_BRIDGE_SUFFIXES without explanation in the "Why this default" comment block immediately above. The PR title is about dedup + TTL + strict mode; adding a new bridged topic kind is a separate concern that future readers will trip over.

Verification suggestions

hatch run test tests/mesh/test_bridge_dedup.py tests/mesh/test_bridge_transport.py -v

# Sanity-check the strict env var claim:
grep -rn STRANDS_MESH_BRIDGE_DEDUP_STRICT strands_robots/ tests/
# (expected to return zero hits as of d1df5145)

# Manual smoke for the docstring/code mismatch:
python -c "
from strands_robots.mesh.transport.bridge_transport import _CommandDeduplicator
d = _CommandDeduplicator(ttl_s=10.0)
p = {'nonce': 'abc1234567890def', 'payload': {'sender_id': 'a'}}
print('dup1:', d.is_duplicate('k', p))
print('dup2:', d.is_duplicate('k', p))  # docstring implies True; actually False
"

yinsong1986 · 2026-05-25T20:50:32Z

+                    ordered = sorted(self._seen.items(), key=lambda kv: kv[1])
+                    drop = max(1, len(ordered) // 5)
+                    for k, _ in ordered[:drop]:
+                        self._seen.pop(k, None)


Bounded-memory GC strategy is sound but pathological at the boundary. Once len(self._seen) > _MAX_DEDUP_ENTRIES (10000), every subsequent call holds self._lock, walks the full dict to build stale, and then in the worst case sorts the entire dict (O(n log n)) to drop the oldest 20% — under sustained pressure this happens on every subsequent is_duplicate() call until enough entries expire. At the 50 Hz cmd rates this codebase typically operates at this is unlikely to be hot, but the bridge is also used for lower-rate but burstier topics (response, broadcast); a single burst above the cap will throttle every thread that calls into the bridge.

Not a blocker, but two cheap mitigations worth considering:

Hysteresis: only run the GC when len > _MAX * 1.1 (or some band) so the steady-state cost amortises.

Replace the sort-and-slice with heapq.nsmallest(drop, self._seen.items(), key=...) which is O(n log drop) instead of O(n log n).

Either way, the GC path deserves a logger.debug line so operators can tell when it kicks in.

…nv var (addresses thread L407) Add _resolve_dedup_strict() that reads the env var and threads it into BridgeTransport.__init__ -> _CommandDeduplicator(strict=...). Previously the env var was advertised in the PR description (carried-over fix R15) but never wired -- strict mode was unreachable from the bridge. Pin test: tests/mesh/test_bridge_dedup.py::TestStrictEnvVarWiringR1

…antics + safety/resume rationale + strict-mode integration pin Addresses three review threads on PR strands-labs#222: (1) Docstring/code mismatch on "envelope nonce". The class docstring, declare_subscriber() docstring, and test module docstring all promised nonce-based identity; _dedup_id() only ever inspects (sender_id, turn_id, command). The literal key "nonce" is never read. Reconciled by striking every "envelope nonce" reference and stating the actual semantics: SHA-256 over the canonical tuple, with strict-mode opt-in for full-payload-hash fallback. (option B in the reviewer's two-path suggestion -- the canonical tuple is the only intended path, since nonces are attacker-controllable opaque keys and dedup based on actual command identity is more defensive.) (2) Truncated docstring prose for _should_bridge: "The exact / prefix split closes the cloud-pollution attack / The pre-fix attack: ..." re-flowed to a single coherent paragraph. Adds the rationale for why "response" is the sole prefix-walk default. (3) safety/resume rationale gap: added a one-liner to the Why-this-default comment block above DEFAULT_BRIDGE_SUFFIXES to explain the pairing with safety/estop and the audit-timeline justification. Plus a new pin TestStrictModeIntegrationR2 (2 tests) that closes the coverage gap the reviewer flagged: every prior integration test in TestBridgeDedupIntegration drove canonical-tuple payloads, so strict-mode behaviour was unverified end-to-end at the bridge layer. The new tests drive an envelope-shaped payload through the bridge's Zenoh+IoT fanout under both default mode (asserts pass-through, x2 delivery) and strict mode (asserts dedup, x1 delivery). Test module + nearby comments updated to use "canonical tuple" terminology consistently with the implementation. GC pathological-boundary concern (thread on L358) is explicitly deferred this round -- reviewer marked it 'Not a blocker'. Tracked for a follow-up PR so this PR stays focused on dedup + TTL + strict mode wiring; see PR description \xc2\xa713 round-2 changelog. Tests: hatch run test tests/mesh/test_bridge_dedup.py tests/mesh/test_bridge_transport.py -> 68 passed (was 66; +2 new R2 pins)

yinsong1986

Summary

This PR delivers the bulk of what its title promises: time.monotonic() for TTL math, an exact-match / opt-in-prefix env-var split that closes the cloud-pollution attack on cmd / safety/event, opt-in strict mode for non-canonical payloads, and a full-SHA-256 canonical fingerprint. The deduplicator is thread-safe, the bridge degrades gracefully when one side is down, and the §13 changelog cleanly documents the R1/R2 review-round fixes plus their pin tests. Scope is well-disciplined for a 5/9 split.

My main concerns are (1) a real false-dedup hazard in the canonical fingerprint that doesn't match what the PR description / class docstring advertise, and (2) several pin-test assertions that are vacuously true on the pre-fix code path. Details inline.

What's good

time.monotonic() migration is clean and the TestMonotonicClockR12 source-grep regression pin is the right shape (it would catch a regression even if test logic drifts).
Exact-match vs prefix-walk split is well-motivated in the comment block above _DEFAULT_BRIDGE_PREFIX_SUFFIXES; the .. traversal guard at _should_bridge line 223 is appropriate defence-in-depth.
R1 wiring fix (_resolve_dedup_strict() -> constructor) is pinned by three explicit cases including invalid-value warning behaviour — matches the AGENTS.md "warn on unrecognized env values" learning from #86.
R2 integration tests (TestStrictModeIntegrationR2) close the bridge-layer coverage gap correctly: default-mode pass-through and strict-mode dedup are both pinned end-to-end through the fanout.
Per-topic cache key (topic_key, dedup_id) correctly isolates topic buckets; test_dedup_resets_per_topic pins it.

Concerns

Canonical-tuple fingerprint conflates 'same RPC' with 'identical message'. The PR description says "full SHA-256 of normalized payload" but the implementation hashes only (sender_id, turn_id, command) — any two messages that share that triple but differ in other top-level fields (timestamps, audit metadata, signed envelope, future fields) collapse to one delivered call. For RPC where turn_id is monotonic per-sender this may be safe today, but the docstring / PR description need to match the code, OR the implementation needs to widen. See inline.
Partial-canonical payloads false-dedup. A payload with only sender_id set (or only turn_id) takes the canonical path with the other two as None and aliases against every other partial-canonical from the same sender. Inline.
Several deduplicator unit tests pass vacuously. test_ttl_expiry, test_clear, test_first_call_not_duplicate all use payloads with no canonical fields, which take the pass-through path — the assertions hold regardless of TTL / clear() correctness. The R2 integration tests cover this at the bridge layer in strict mode, but the unit-level pins for the deduplicator's TTL eviction path are weaker than they look. Inline.
Bare except Exception appears in seven places in bridge_transport.py (handle teardown, connect/close, put, declare_subscriber). AGENTS.md > Review Learnings (#86) > 'Exception Clauses Must Be Narrow' is explicit that this is forbidden for non-recovery code paths. The narrow-except pattern at line 552 inside _filtered is the correct shape; the surrounding sites should match. Inline.
GC perf concern is acknowledged but untracked. The PR description marks the O(n log n) sort-and-slice as "deferred to follow-up" but I don't see a linked issue. AGENTS.md says "the project board is the source of truth" — please file the follow-up before merge so it doesn't get lost.
Docstring drift in iot_transport.py:204-214. The whitespace-collapse in the comment block (e.g. # (a) -> # (a)) reads like it lost a column of indentation. Not a blocker, but the previous shape was clearer; if this was intentional formatting, fine.

Verification suggestions

# Confirm the strict-mode env var actually reaches the bridge under realistic shape:
pytest tests/mesh/test_bridge_dedup.py::TestStrictEnvVarWiringR1 -v
pytest tests/mesh/test_bridge_dedup.py::TestStrictModeIntegrationR2 -v

# Spot-check that the unit-level TTL test would actually fail on a buggy TTL impl:
# (it currently uses a pass-through payload so it can't — see inline comment).

# Confirm the prefix-walk closure for the cloud-pollution attack:
pytest tests/mesh/test_bridge_transport.py -v -k 'should_bridge or filter'

yinsong1986 · 2026-05-26T07:34:23Z

+            default=str,
+        ).encode("utf-8")
+        # Full 256-bit (64 hex chars) -- no birthday-attack truncation.
+        return "f:" + hashlib.sha256(canonical).hexdigest()


Canonical fingerprint hashes only (sender, turn, cmd), not the full payload. The PR description says the dedup id is a "full SHA-256 of normalized payload" and the class docstring on line 283-288 says "Identity is a SHA-256 fingerprint over the canonical (sender_id, turn_id, command) tuple" — those two statements are inconsistent, and the code matches the second.

Real-world impact: two messages with the same (sender_id, turn_id, command) but different other fields (timestamps, audit metadata, payload args, future-added envelope fields) will be falsely deduped. If turn_id is strictly monotonic per-sender today this is safe in practice, but the contract is fragile — any future caller that reuses a turn_id (retries with mutated args, batching, etc.) silently drops messages.

Either widen the canonical hash to cover the whole payload (matching the PR description) or tighten the docstring + PR description to say "identity is the canonical RPC triple, callers must not reuse (sender, turn, cmd) for distinct deliveries." Pick one and pin a regression test that would fail if a future contributor flipped the choice.

yinsong1986 · 2026-05-26T07:34:23Z

+        turn = payload.get("turn_id")
+        cmd = payload.get("command")
+
+        if sender is None and turn is None and cmd is None:


Partial-canonical payloads alias. This branches on sender is None AND turn is None AND cmd is None. A payload with only sender_id set (no turn_id, no command) falls through to line 332 and hashes {"sender": "alice", "turn": null, "cmd": null} — every other partial-canonical payload from the same sender hashes to the exact same value and gets deduped against it.

Pin test: d.is_duplicate("k", {"sender_id": "a"}); d.is_duplicate("k", {"sender_id": "a", "extra": 1}) — both currently dedup as the same message under the canonical path. That's almost certainly not intended.

Suggest: require all three fields present to take the canonical path; if any one is missing, fall through to strict-mode logic (pass-through in default, full-payload-hash in strict). Add a regression test.

yinsong1986 · 2026-05-26T07:34:23Z

+        payload = {"nonce": "abcdef0123456789"}
+        assert d.is_duplicate("k", payload) is False
+        time.sleep(0.1)
+        assert d.is_duplicate("k", payload) is False  # expired -> re-accepted


This TTL test passes vacuously. Payload {"nonce": "abcdef0123456789"} has no canonical fields, so _dedup_id() returns None (pass-through path) and is_duplicate() returns False regardless of TTL math — the time.sleep(0.1) is decorative. The same applies to test_clear (line 101-106) and test_first_call_not_duplicate (line 44-47): all three exercise the pass-through path, not the TTL eviction / clear paths they claim to pin.

The R2 integration tests cover envelope-shaped payloads in strict mode end-to-end, so the bridge layer is OK, but the unit-level _CommandDeduplicator TTL pin is missing. Suggest: rewrite these three tests with a canonical-tuple payload (e.g. {"sender_id": "a", "turn_id": "t1", "command": {"action": "x"}}) so the assertions actually exercise the eviction / clear paths. AGENTS.md > Review Learnings (#85) > 'Pin regression tests for reviewed fixes' applies — a future regression in TTL math would not be caught by the current pin.

yinsong1986 · 2026-05-26T07:34:24Z

            try:
-                zenoh_sub = self._zenoh.declare_subscriber(key_expr, handler)
+                zenoh_sub = self._zenoh.declare_subscriber(key_expr, make_dedup_handler("zenoh"))
            except Exception as exc:


Bare except Exception here and in seven other sites (lines 400, 466, 470, 513, 520, 575, 581) violates AGENTS.md > Review Learnings (#86) > 'Exception Clauses Must Be Narrow': "except Exception is forbidden for non-recovery code paths. Use the smallest superset of expected exception types."

Note that the equivalent narrow-except inside _filtered at line 552 (except (AttributeError, UnicodeDecodeError, json.JSONDecodeError)) is already the right shape — and the comment on line 553-558 even cites the AGENTS.md rule. The cleanup paths just need the same treatment. Suggested narrowing per site:

sub.undeclare() -> except (RuntimeError, AttributeError)

self._zenoh.close() / self._iot.close() -> the transport's documented close exceptions

self._zenoh.put() / self._iot.put() -> except (RuntimeError, ConnectionError, OSError)

declare_subscriber (this site + line 581) -> same as put

Or if the underlying transports don't document their exception surface, define a TransportError base in base.py and catch that.

yinsong1986 · 2026-05-26T07:34:24Z

+                    ordered = sorted(self._seen.items(), key=lambda kv: kv[1])
+                    drop = max(1, len(ordered) // 5)
+                    for k, _ in ordered[:drop]:
+                        self._seen.pop(k, None)


GC follow-up should be filed before merge. The PR description acknowledges this O(n log n) sort-and-slice as "deferred to follow-up" but I can't find a tracking issue on the project board. AGENTS.md is explicit: "ALWAYS use the project board to track work. ... Never track work only in local markdown — the board is the source of truth."

Please open the follow-up issue (hysteresis band on the GC trigger, heapq.nsmallest replacement, debug log, tests_integ/ microbenchmark) before this PR merges so the deferred concern doesn't get lost in the next refactor. Linking it from the §13 changelog row keeps the audit trail intact.

…-tuple contract + un-vacuous TTL/clear pins Addresses three review-feedback concerns on the bridge dedup R3 round: 1. Partial-canonical false-dedup. The previous _dedup_id took the canonical path whenever any one of (sender_id, turn_id, command) was non-None and serialised the missing fields as null. Two partial payloads from the same sender (e.g. {sender_id: a} and {sender_id: a, extra: 1}) hashed to the same value and silently deduped against each other. Fix: require all three fields present for the canonical path; partial canonical payloads fall through to pass-through (default) or full-payload-hash (strict). 2. Canonical-tuple contract drift. The class docstring said identity is the canonical RPC triple but the prose did not state the implied contract (callers must not reuse the triple for distinct deliveries and turn_id is monotonic per-sender). Strengthened the docstring so future contributors do not silently widen the dedup id without filing an interface change. 3. Vacuous TTL/clear/first-call tests. test_first_call_not_duplicate, test_ttl_expiry, and test_clear all used pass-through payloads (nonce-only, no canonical fields) so _dedup_id returned None and the assertions held trivially regardless of TTL math or clear() correctness. Rewrote the three tests with canonical-tuple payloads so the eviction and clear paths are actually exercised. Pin tests: - TestCommandDeduplicator.test_partial_canonical_does_not_alias asserts default-mode pass-through for partial canonical payloads; fails on pre-fix code where _dedup_id returned the same f: hash for {sender_id: a} and {sender_id: a, extra: 1}. - TestCommandDeduplicator.test_partial_canonical_strict_mode_uses_full_payload asserts strict-mode falls back to full-payload hash for partial canonical, so two partial payloads with extra fields do not alias. - The rewritten test_first_call_not_duplicate, test_ttl_expiry, and test_clear exercise the canonical-tuple identity path so a future regression in TTL math or clear() would be caught. GC perf concern (R2/R3 carried-over) tracked separately by strands-labs#231 per project-board-as-source-of-truth requirement. AI Disclosure: This change was authored by an autonomous AI agent (strands-agents). Tests verified to pass locally (27/27 in tests/mesh/test_bridge_dedup.py) and the partial-canonical pin verified to fail on pre-fix code via inline reproduction.

…s in bridge_transport Addresses the R3 review-feedback concern that bridge_transport.py had seven bare `except Exception` sites (handle teardown, connect/close, put, declare_subscriber) violating AGENTS.md > Review Learnings > 'Exception Clauses Must Be Narrow' (forbidden for non-recovery code paths). Each site was narrowed to the documented transport-failure surface: - _BridgeSubHandle.undeclare() -> (RuntimeError, AttributeError, OSError) - BridgeTransport.close() (zenoh+iot) -> (RuntimeError, ConnectionError, OSError) - BridgeTransport.put() (zenoh+iot) -> (RuntimeError, ConnectionError, OSError) - declare_subscriber (zenoh+iot) -> (RuntimeError, ConnectionError, OSError) The narrow tuples follow the same shape as the existing _filtered handler at line ~552 (which already used (AttributeError, UnicodeDecodeError, json.JSONDecodeError) for payload decode and was the right model). The classification rationale is documented inline at each site: - RuntimeError covers already-closed sessions, undeclared handles, and invalid-state errors raised by the zenoh / awsiot APIs. - ConnectionError covers broker drops mid-call (IoT) and Zenoh peer disconnects mid-publish. - OSError covers socket-level write/teardown races. - AttributeError (teardown only) covers mock handles and partial-init states surfaced during unit tests. Genuine bugs (TypeError, ValueError, KeyError, etc.) now propagate instead of being silently logged at debug level. Pin test: TestNarrowExceptionsR3.test_no_bare_except_exception_in_bridge_transport is a source-grep regression pin (same shape as the existing test_no_time_dot_time_in_dedup_path R12 pin). Fails if any future change reintroduces `except Exception` in bridge_transport.py. Includes a descriptive error message pointing at the AGENTS.md rule and the documented narrow tuples to use. All 28 tests in tests/mesh/test_bridge_dedup.py pass; full mesh suite (468 tests / 2 skipped) is green. AI Disclosure: This change was authored by an autonomous AI agent (strands-agents).

yinsong1986

Summary

PR-5/9 of #195's split. Lands four roughly-orthogonal changes on the bridge transport:

_CommandDeduplicator for cross-transport dedup of duplicate Zenoh+MQTT deliveries, keyed by canonical (sender_id, turn_id, command) SHA-256.
TTL eviction switches from time.time() to time.monotonic() (R12 carry-over).
_should_bridge tightens prefix-walk to exact-match plus an opt-in STRANDS_MESH_BRIDGE_TOPICS_PREFIX allow-list (good cloud-pollution mitigation).
Seven except Exception sites narrowed to (RuntimeError, ConnectionError, OSError) / (RuntimeError, AttributeError, OSError) per AGENTS.md hygiene.

The §13 changelog is unusually thorough (R1/R2/R3 each pinned with a regression test that fails on pre-fix code) and the test file actually exercises the canonical-tuple path in TTL/clear/first-call assertions, fixing the vacuous-test class flagged in R3-b. Scope discipline is good — this is genuinely a transport-layer change with no core.py reach-through.

Main concerns are correctness-leaning rather than blocking:

The dedup cache keys on the subscription key_expr (often a wildcard) rather than the delivered sample.key_expr, which can alias unrelated topics under a single wildcard subscription.
json.dumps(..., default=str) on the canonical command makes the fingerprint non-deterministic for command payloads containing types like bytes, numpy scalars, or objects without stable __str__ — they may dedup or fail to dedup based on object identity.
The over-budget GC path in is_duplicate() runs sort-and-slice inside the dedup lock (already tracked as #231 — flagging the lock-hold-time, not the algorithm).
Safety topics (safety/estop, safety/resume) now silently drop within the 120s TTL window if the canonical triple repeats; that's a behaviour change worth a higher log level than debug.
_should_bridge's default change is a behaviour break for any operator who relied on the old loose prefix-walk for non-response topics. No CHANGELOG / startup warning calls this out.

What's good

time.monotonic() switch is the right call and is pinned by test_no_time_dot_time_in_dedup_path (good source-grep regression pattern, same shape as the R3-c except-clause grep).
Exact-match vs prefix-match split is a legitimate hardening — the cloud-pollution attack surface (10 KiB blob on strands/x/safety/event/<blob> ending up in the DDB audit table) is real.
_dedup_id requiring all three canonical fields rather than "any non-None" closes the partial-aliasing bug, and the regression test (test_partial_canonical_does_not_alias) is a clean fail-on-pre-fix pin.
Three vacuous tests (test_first_call_not_duplicate, test_ttl_expiry, test_clear) rewritten with canonical payloads so the assertions actually exercise the dedup path. Self-review caught what review otherwise wouldn't.
Narrow exception clauses cite AGENTS.md > Review Learnings inline and are pinned by test_no_bare_except_exception_in_bridge_transport.
_resolve_dedup_strict warns on unrecognised env-var values (per STRANDS_ROBOT_MODE precedent in #86 review-learnings).

Concerns

Behaviour break for existing operators. The _should_bridge tightening drops any non-response tail-bearing topic that operators may have been bridging via the old loose prefix walk. There is no CHANGELOG entry, no startup logger.warning listing dropped topics, and the STRANDS_MESH_BRIDGE_TOPICS_PREFIX opt-in is documented only in a comment. If this lands as written, an operator who upgrades and finds their custom cmd/<sub> topic stops bridging will only learn why by reading the diff. Consider either (a) emitting a startup log enumerating what's exact-matched vs prefix-matched, or (b) a CHANGELOG entry citing the env-var migration path.
Dedup caches a per-BridgeTransport instance, not per-subscription, but the TTL is fixed at construction. STRANDS_MESH_DEDUP_TTL is read once at _CommandDeduplicator.__init__ time (line 298), so re-reading the env var at runtime does nothing. Acceptable, but worth a one-line note in the docstring next to the env-var reference.
#231 follow-up is reasonable but the lock-hold-time aspect deserves an inline TODO, since a 10k-entry sort-and-slice inside a threading.Lock will block every other is_duplicate call on the bridge — at the rates implied (cmd/response per turn) this is fine; at heartbeat/presence rates it isn't.
No transport-restart test. The PR description's reviewer focus mentions "behaviour under transport restart" but I don't see a test that exercises connect() → close() → connect() and confirms the dedup cache survives (or doesn't, deliberately) across the cycle. The current _dedup instance lives on self, so it persists across close(), which is probably correct but should be pinned.

Verification suggestions

# Run the new dedup suite
hatch run test tests/mesh/test_bridge_dedup.py -v

# Confirm the source-grep pins still trip on a hand-rolled regression
hatch run test tests/mesh/test_bridge_dedup.py::TestNarrowExceptionsR3 -v

# Smoke-test the prefix-walk break: with default env, confirm
# `strands/peer/cmd/something` is NOT bridged but `strands/peer/response/abc-123` IS.
python -c "from strands_robots.mesh.transport.bridge_transport import _should_bridge, DEFAULT_BRIDGE_SUFFIXES; print(_should_bridge('strands/p/cmd/x', DEFAULT_BRIDGE_SUFFIXES), _should_bridge('strands/p/response/abc', DEFAULT_BRIDGE_SUFFIXES))"
# Expect: False True

No blocking concerns. The R-round changelog and pin tests are exemplary; the inline notes are correctness-leaning suggestions rather than rework requests.

yinsong1986 · 2026-05-26T11:37:38Z

+                    # ``test_wire_handler_narrow_except.py``.
+                    payload = None
+
+                if payload is not None and self._dedup.is_duplicate(key_expr, payload):


Dedup cache key uses the subscription expression, not the delivered topic.

key_expr here is the argument to declare_subscriber (closed over from line 560), which is frequently a wildcard like strands/+/cmd or strands/+/presence. The dedup cache key is then ("strands/+/cmd", canonical_hash) — meaning two distinct delivered topics that match the same wildcard subscription will share a cache slot.

In the typical case this is harmless because sender_id is part of the canonical hash, so cross-sender collisions don't happen. But it does mean that:

The same sample arriving on sample.key_expr = "strands/robot-a/cmd" and on sample.key_expr = "strands/robot-b/cmd" (impossible normally, but possible if an operator misconfigures sender_id or a peer impersonates) will dedup incorrectly across robots.

The test test_different_keys_isolate_payloads in test_bridge_dedup.py uses literal keys "k1" / "k2", so the wildcard-aliasing case isn't exercised.

Consider keying on sample.key_expr instead (the actual delivered topic), with a fallback to key_expr when the sample doesn't expose it. That matches the R3-a docstring contract — "two distinct topics with coincidentally matching dedup_ids don't collide" — more faithfully than the current closure capture does.

yinsong1986 · 2026-05-26T11:37:38Z

+            {"sender": sender, "turn": turn, "cmd": cmd},
+            sort_keys=True,
+            separators=(",", ":"),
+            default=str,


default=str makes the canonical fingerprint non-deterministic for some command shapes.

json.dumps(..., default=str) falls back to str(obj) for non-JSON-serializable values. For:

bytes -> "b'...'" (deterministic, but compares equal across calls only if the bytes object is identical — fine).

datetime / Decimal / numpy.float32 -> stable string form (fine).

Custom objects without __str__ overrides -> "<MyCmd object at 0x7f...>" — the address is included, so two semantically identical commands hash differently and won't dedup.

set -> "{1, 2, 3}" with non-deterministic iteration order pre-3.7 (fine on 3.12).

The command field is operator-controlled ({"action": "move", "args": {...}} shape per the upstream RPC contract) so the bytes/datetime cases are unlikely. But the failure mode is silent: a refactor that puts a non-serializable object on command in 6 months will produce a dedup that appears to work in tests (because mocks have stable __str__) but doesn't dedup in production.

Consider replacing default=str with a strict mode that raises TypeError and a callable hash key like repr(sorted(payload.items())), or document the JSON-only contract for command in the _dedup_id docstring and add a regression test that asserts TypeError on an object whose __str__ includes its address.

yinsong1986 · 2026-05-26T11:37:38Z

+                    self._seen.pop(k, None)
+                if len(self._seen) > _MAX_DEDUP_ENTRIES:
+                    # drop oldest 20%
+                    ordered = sorted(self._seen.items(), key=lambda kv: kv[1])


Sort-and-slice runs inside the dedup lock.

At the over-budget threshold (10k entries), sorted(self._seen.items(), key=lambda kv: kv[1]) is O(n log n) with the threading.Lock held — every concurrent is_duplicate call on the bridge blocks until it completes. At the topic rates this PR targets (cmd/response/safety per turn), 10k is unreachable in any sane window so it's never hit. At heartbeat/presence rates (10 Hz × 50 peers) it's a 20s budget.

#231 covers the algorithmic cleanup (heapq.nsmallest + hysteresis), but the lock-hold-time aspect is a separate concern: even with nsmallest, doing GC under the lock at all means dedup throughput is gated by GC throughput. Consider snapshotting self._seen.items() under the lock, releasing it, computing the eviction set, and re-acquiring to apply — or noting the trade-off explicitly in #231.

Not a blocker for this PR; flagging so the deferred work captures both axes (algorithm + lock granularity).

yinsong1986 · 2026-05-26T11:37:38Z

        "health",
        "safety/event",
        "safety/estop",
+        "safety/resume",


Safety-topic dedup behaviour change deserves a louder log.

With safety/estop and safety/resume both in DEFAULT_BRIDGE_SUFFIXES, a safety event whose canonical (sender, turn, command) triple matches a previous one within the 120 s TTL is silently dropped at logger.debug level (line 599). For non-safety topics this is correct dedup; for safety, two scenarios are uncomfortable:

Legitimate retry: an operator presses E-stop, sees no acknowledgement (LAN flap), presses again. If both presses serialise to the same turn_id (e.g., the UI generates turn_id from the button-click handler, not a fresh UUID per click), the second is dropped.

Audit replay: a cloud-side replay of an E-stop incident (for testing the audit pipeline) within the TTL window is invisible.

The class docstring's contract ("callers must not reuse (sender, turn, cmd) for distinct deliveries") puts the burden on callers, which is fine in principle. But for safety topics specifically, consider either (a) bypassing dedup entirely on safety/* suffixes, or (b) logging duplicate-drop on safety topics at logger.warning so a real-world misconfiguration surfaces in operator logs instead of silently in debug.

Not a blocker — it matches the documented contract — but worth a paragraph in the docstring explicitly calling out the safety implication.

yinsong1986 · 2026-05-26T11:37:38Z

+        rest = suffix[len(head) + 1 :] if "/" in suffix else ""
+        if rest and any(seg == ".." for seg in rest.split("/")):
+            return False
+        return True


.. rejection is the only path-traversal guard, but the head segment isn't checked.

The check any(seg == ".." for seg in rest.split("/")) rejects .. segments only in the tail (after head). The head itself comes from suffix.split("/", 1)[0] and is matched against allowed_prefixes — if .. were ever in allowed_prefixes (it can't be in defaults, but STRANDS_MESH_BRIDGE_TOPICS_PREFIX=".." would put it there), the head guard wouldn't catch it.

Also: the comment says "Defence-in-depth: reject any tail containing path-traversal segments" but the check is exact-equal .., not containing-... "foo.." or "..." would slip through. For Zenoh keys this is fine (the syntax doesn't interpret ..), but the comment overstates the guard.

Minor nit: align the comment to what the code does ("reject any tail segment equal to .."), and consider validating that _resolve_bridge_prefix_filter() rejects ".." and other reserved tokens at parse time so operators can't disable the guard via env var.

The R3 commit (db192ce) added the TestNarrowExceptionsR3 source-grep pin and the partial-canonical regression tests but did not run `hatch run format` over the file before push. `ruff format --check` now flags two trivial whitespace deviations: - A multiline assertion message in test_partial_canonical_does_not_alias joined onto a single line (under the 120-char project line-length). - Two blank-line spacing nits between TestCommandDeduplicator, TestStrictModeIntegrationR2, and TestNarrowExceptionsR3. No semantic change; `pytest tests/mesh/test_bridge_dedup.py` still reports 28 passed before and after. Pin: the same `hatch run lint` step that failed on db192ce now passes locally (`ruff check` + `ruff format --check` + mypy clean) and will green up the call-test-lint check on the next CI run.

cagataycali added this to Strands Labs - Robots May 25, 2026

github-project-automation Bot moved this to Backlog in Strands Labs - Robots May 25, 2026

cagataycali moved this from Backlog to In review in Strands Labs - Robots May 25, 2026

yinsong1986 reviewed May 25, 2026

View reviewed changes

strands-agent and others added 2 commits May 26, 2026 02:05

yinsong1986 reviewed May 26, 2026

View reviewed changes

yinsong1986 mentioned this pull request May 26, 2026

mesh(wire): Zenoh + ACL config builders (mTLS, downsampling, low_pass_filter) (3/9 of #195 split) #224

Open

cagataycali added security mesh Zenoh mesh networking / fleet coordination labels May 26, 2026

cagataycali mentioned this pull request May 26, 2026

perf(mesh/transport): bound _CommandDeduplicator GC cost under sustained pressure #231

Open

6 tasks

strands-robots-agent added 2 commits May 26, 2026 09:27

yinsong1986 reviewed May 26, 2026

View reviewed changes

Conversation

cagataycali commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes

What's in this PR

Reviewer focus

Carries review fixes from #195

Stacking note

§13 Review Round Changelog

Disposition of R3-batch follow-up review (R4 review round, 11:37 UTC)

Tracked-as-follow-up (not in this PR)

Uh oh!

yinsong1986 left a comment

Choose a reason for hiding this comment

Summary

What's good

Concerns

Verification suggestions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yinsong1986 left a comment

Choose a reason for hiding this comment

Summary

What's good

Concerns

Verification suggestions

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yinsong1986 left a comment

Choose a reason for hiding this comment

Summary

What's good

Concerns

Verification suggestions

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cagataycali commented May 25, 2026 •

edited

Loading