Skip to content

Enhance KissModemWrapper with improved command handling and connection resilience#80

Open
agessaman wants to merge 18 commits into
pyMC-dev:devfrom
agessaman:int/espfix-1160
Open

Enhance KissModemWrapper with improved command handling and connection resilience#80
agessaman wants to merge 18 commits into
pyMC-dev:devfrom
agessaman:int/espfix-1160

Conversation

@agessaman
Copy link
Copy Markdown
Contributor

This includes a bunch of improvements to the KISS wrapper, mostly inspired by the investigation I worked on with w00f. We have timeouts on functions that expect responses so they don't block for long periods, we also have logic for handling USB reenumeration that can happen due to modem or Raspberry Pi flakiness.

Several changes to bring the companion up to pre-1.16.0 compatibility, including regions discovery (which now returns regions instead of the owner info) and the ability to send raw packets through the companion wrapper. Also, our repeater now returns its regions—although consider this alpha status.

Some improvements to message send reliability in the companion by modifying the logic on when we report send success.

Significant improvement to the companion repeater login process brings the login success and responsiveness in line with firmware companion.

agessaman and others added 18 commits May 15, 2026 18:39
- Introduced a set of allowed responses for SetHardware commands to handle legitimate OK responses.
- Implemented single-flight command execution to prevent race conditions during concurrent requests.
- Updated methods to accept an optional timeout parameter for better control over response waiting.
- Enhanced response queue management to handle late or out-of-order responses effectively.
- Added tests to ensure correct behavior of command responses and timeout handling.
- Added a serial write lock to ensure atomic writes across threads, preventing interleaving of frame bytes during concurrent operations.
- Updated the _write_frame method to utilize the new lock, enhancing reliability in multi-threaded environments.
- Introduced tests to verify that UART writes from different callers do not overlap, ensuring data integrity during transmission.
- Changed the version from "1.0.10" to "1.0.10-agdev" in both pyproject.toml and __init__.py to reflect the current development status of the project.
- Added mechanisms for managing degraded connection states, including flags for degraded status and reasons.
- Implemented a reconnect worker to handle automatic reconnection attempts after connection failures.
- Updated the connect and disconnect methods to properly manage connection states and thread handling.
- Enhanced error handling during serial communication to ensure robust operation in multi-threaded environments.
- Introduced tests to validate the new reconnect behavior and degraded state management.
…d safety

- Updated connection handling to set `is_connected` based on the success of the handshake process.
- Enhanced thread conditions in the RX and TX workers to check for the serial connection's readiness.
- Modified the `configure_radio` method to validate the serial connection state before proceeding.
- Added tests to ensure correct behavior of connection states during connect and reconnect operations.
- Changed the version from "1.0.10-agdev" to "1.0.10.dev1" in both pyproject.toml and __init__.py to indicate a new development iteration.
- Introduced new parameters for post-connect settling and configuration retries to improve connection robustness.
- Implemented retry logic for the `configure_radio` method, allowing for transient failures during configuration attempts.
- Added tests to validate the new retry behavior and ensure correct connection state management during configuration.
…mWrapper

- In CompanionFrameServer, modified the channel message sending logic to handle failure cases more gracefully by writing an error response when the message fails to send.
- In KissModemWrapper, updated the frame sending method to wait for modem-level TX_DONE confirmation, enhancing reliability and providing clearer error messages when transmission fails.
- Added tests to ensure dispatcher correctly handles cases where the radio send method returns None, and to verify async send behavior in KissModemWrapper.
…d cleanup

- Added parameters for connection retries, post-open delay, and USB reset on connect to the KissModemWrapper constructor.
- Implemented methods for waiting for modem readiness and configuring the radio with retries, improving connection reliability.
- Introduced a cleanup method to handle resource release and abort in-flight operations during shutdown.
- Updated tests to validate the new functionality, ensuring robust behavior during modem connection and shutdown scenarios.
…2-kiss-fixes

Combine single-flight SetHardware/UART serialization and serial recovery
with ESP32 connect resilience, TX_DONE confirmation, and shutdown handling.

Co-authored-by: Cursor <cursoragent@cursor.com>
The official client app's "discover regions from zero-hop repeaters"
returned the repeater's name and owner details instead of the region
list. The virtual companion built anonymous requests with
create_protocol_request(), which emits a PAYLOAD_TYPE_REQ packet and
prepends the protocol code 0x07 to the payload. Repeaters read that
leading 0x07 as REQ_TYPE_GET_OWNER_INFO and replied with owner info,
producing the garbage string.
Add PacketBuilder.create_anon_request(), which builds a true
PAYLOAD_TYPE_ANON_REQ (dest_hash + sender pubkey + cipher) with the
plaintext timestamp followed by the caller's request data verbatim, and
routes direct for zero-hop neighbors (out_path_len >= 0) and flood only
when the path is unknown. This matches firmware BaseChatMesh::sendAnonReq
and satisfies the repeater's isRouteDirect() requirement for region
replies. Rewire send_anon_req() to use it and report is_flood correctly
for zero-hop.
Record the anonymous request sub-type so responses are parsed by
sub-type rather than colliding with BinaryReqType.OWNER_INFO (both use
0x07). Add parsers for region-name lists, anon owner info, and the basic
clock/feature response.
Bump FIRMWARE_VER_CODE from 11 to 12 and add the command codes the
firmware dev branch defines: CMD_GET_ALLOWED_REPEAT_FREQ (60), which
replies with an empty allowed-frequency list since the virtual companion
models no regional restrictions, and CMD_SEND_RAW_PACKET (65), which
delegates to a bridge send_raw_packet() hook (returning unsupported
until that hook is implemented).
…ssion

Add the send_raw_packet method to CompanionBase, allowing the injection of fully-formed on-air packets for transmission. This method parses the raw packet bytes and enqueues them for sending, returning success or failure based on the parsing and sending process.
Include unit tests to verify the functionality, ensuring correct parsing and handling of both valid and invalid packet data.
Incorporate AnonRequestHandler and AnonRateLimiter into the message handlers module, enhancing the functionality for handling anonymous requests. Update the __all__ export to include these new handlers.
…KissModemWrapper

- Introduced HW_ERR_TX_BUSY to indicate when a transmit is already pending, preventing concurrent DATA frame transmissions.
- Implemented a locking mechanism to ensure only one frame is in flight at a time, enhancing reliability during data transmission.
- Updated send_frame_and_wait method to handle TX_BUSY errors gracefully, allowing for fast failure and retry without stalling.
- Added tests to validate single-flight behavior and ensure proper handling of TX_BUSY and serial failures.
Combine ESP32 KISS modem fixes with companion v12 / anon-request work for integration testing.
… repeaters

Logging a companion identity into a zero-hop firmware repeater was slow
and flaky compared to the firmware client: stats often failed on the
first request and only succeeded on a retry, and the post-login flow
fell back to flood routing. Three independent companion-side defects
were responsible; each is corrected to match firmware behavior.

Monotonic request timestamps:
  PacketBuilder._get_timestamp() returned int(time.time()) at
  whole-second resolution, so a login and an immediately following
  stats request issued in the same second carried identical timestamps.
  Firmware repeaters enforce a strictly-greater replay guard
  (timestamp > client->last_timestamp) and silently drop the colliding
  packet, which manifested as a dropped first stats request that only
  succeeded ~15s later on retry. Mirror firmware getCurrentTimeUnique()
  by keeping shared, lock-guarded monotonic state and returning
  max(now, last + 1) so every login/REQ tag is unique and increasing.

Zero-hop request routing:
  create_protocol_request() treated out_path_len <= 0 as flood, so a
  zero-hop neighbor (out_path_len == 0, empty path) always flooded its
  stats/telemetry requests. Route direct whenever out_path_len >= 0,
  matching create_anon_request() and firmware sendRequest(); the empty
  path is sent as a zero-hop direct packet.

Path learning at login:
  The login PATH-return was dropped before path learning because the
  protocol-response handler bailed out when no stats/telemetry waiter
  was registered. Let PATH packets always decrypt so the existing
  _update_contact_path() and reciprocal-PATH send run at login time
  (firmware onContactPathRecv), establishing a direct route for the
  first stats request. Wire the packet injector into the radio
  companion so the reciprocal PATH is actually transmitted.

Also add [PATHDIAG] debug logging across the login/path-learning/
stats path (learned out_path, inbound vs embedded path, reciprocal
contents, request route, and login send/timeout) to make routing and
decrypt decisions diagnosable from logs.

Add regression tests: zero-hop protocol requests route direct,
timestamps are strictly monotonic within a second, a login followed by
a stats request produces increasing tags, and a login PATH-return with
no waiter learns the path and sends a reciprocal.
Introduced a new helper function, _fmt_path, to format and log contact paths in a clearer manner for [PATHDIAG] logs. This function handles various input types for out_path and provides detailed output, including path length and hop count, enhancing the debuggability of path-related logs. Updated existing logging calls to utilize this new function for consistency and clarity.
Login, stats, telemetry and region discovery were slow and unreliable
against firmware repeaters: a single lost packet stalled for the fixed
10-15s wait with no resend, where the firmware companion recovers in
~3s. Firmware sizes each request's timeout from packet airtime and route
(calcFloodTimeoutMillisFor = 500 + 16*airtime; calcDirectTimeoutMillisFor
= 500 + (6*airtime + 250)*(hops+1)) and re-issues on that cadence. This
change reproduces that model so a dropped round-trip recovers in seconds
instead of stalling.

Add companion/timing.py: a Semtech airtime estimator (from the radio's
SF/BW/CR) and the firmware flood/direct timeout formulas, clamped to
1.5-12s, with DEFAULT_MAX_ATTEMPTS = 3. CompanionBase._response_timeout_s
computes the per-attempt timeout from the actual packet.

Wrap the waiting request paths in an internal resend loop (up to 3
attempts, one adaptive timeout each):
  - send_login: rebuild a fresh login packet per attempt (new monotonic
    tag, so the repeater's flood dedup doesn't drop the resend); the
    single login callback resolves whichever reply lands.
  - send_status_request, send_telemetry_request, _send_protocol_request:
    fresh REQ per attempt. The waiter is keyed by contact, so a late
    reply arriving between attempts resolves it immediately instead of
    being dropped as an unmatched straggler.

For fire-and-forget region discovery (send_anon_req), return an adaptive
timeout_ms hint instead of the fixed 10s so the client re-issues on the
firmware ~3s cadence (the same client-driven model firmware uses for
anon/discovery), and size the pending-request expiry to the full retry
budget.

Remove the per-hop _wait_for_path_propagation sleep (0.5s/hop, up to
~1.5s of dead latency for multi-hop contacts); the reciprocal PATH sent
at login plus the adaptive retry make it unnecessary. The method is kept
as a pre-send [PATHDIAG] log only.

Also log anon/binary responses as MATCHED/UNMATCHED at the tag-match
point, so a decryptable response that arrives with no pending request
(timed out, or tag mismatch) is distinguishable from one that never
arrived.

Add tests: timing math vs hand-computed Semtech airtime and the firmware
formulas with clamps; login resend-then-succeed and full-budget-timeout
behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant