Release 0.3.0 — cooperative lifecycle + emergency stop#22
Merged
Conversation
A single failed ws_send_frame_async used to close the WebSocket connection, which caused spurious disconnects in noisy RF environments where one lost frame is routine. Track per-fd consecutive failure counts; close only after 3 in a row. Successful send resets the counter. Stale fds are pruned on each tick.
Previously /health required the caller to be the DS owner, which meant judges/referees and side-monitoring stations could not observe robot liveness while an active driver was connected. They'd either get 403 or have to steal the ownership slot mid-match. Drop enforceOwner() from handleHealth and handleInfo. Also remove the password field from /info since it is now publicly readable by anything on the AP; passwords should not leak to other teams' scans.
When the DS owner is released (idle timeout, force disconnect, or a new client claims the slot), the gamepad buffer kept its last known values. User teleop loops reading getA() / getRawAxis() would see the last button press frozen, causing motors to keep whatever command was current when the link died. Call _gs.write(now, null, 0, null, 0) in releaseOwner() to zero the state so reads return neutral regardless of the driver's last frame.
Previously DS timeout always forced Status::STOP, killing the teleop task and requiring a full init+start cycle to resume driving. For teams that want a softer behavior — keep user loops running while the gamepad is zeroed, then resume on reconnect without restart — add PROBOT_DS_TIMEOUT_FORCE_STOP define. Default is 1 (current behavior, safe). Setting it to 0 just releases the owner and closes WS sessions; teleop keeps running with neutral input.
The /joystick WebSocket previously accepted any client at handshake; owner enforcement only covered HTTP routes. A second driver station on the same AP could open a parallel WS and inject joystick frames, silently racing against the legitimate owner. Add an OwnerAuthorizer callback to WsJoystick. DriverStation wires it to enforceOwner (in silent mode — we close the socket instead of emitting an HTTP 403 on a WS frame). Handshake and every subsequent frame re-check ownership; non-owner clients get their session torn down immediately.
Link-layer reliability release: * ws_joystick: 3-fail ping tolerance (was single-fail close) * /health and /info open for monitoring (no owner required) * /info no longer leaks WiFi password * gamepad state zeroed on owner release (no stale input) * PROBOT_DS_TIMEOUT_FORCE_STOP configurable (default 1 = safe) * /joystick WS enforces owner at handshake and per-frame
- README: bilingual rewrite to accurately describe what 0.2.8 ships (joystick DS + WS transport), drop references to modules that do not exist in this branch (PID/Kalman/LQR/mecanum), document the new configuration macros and the "What changed in 0.2.8" section - CHANGELOG.md: first entry describing the five reliability changes and upgrade notes for PROBOT_DS_TIMEOUT_FORCE_STOP - gitignore: exclude connection-test/ and .claude/
Replace the WS PING with a 2-byte BINARY heartbeat ('H', seq) every 2s.
Browsers auto-pong pings invisibly to JS, so the page could never tell
a live link from a dead one. Server-side 3-fail close logic unchanged.
Client: own sends no longer count as link activity (ws.send() into a
dead TCP socket succeeds silently, masking mid-drive link loss); stale
threshold 3s -> 5s (two missed heartbeats). Also fixes the idle
reconnect churn loop.
UI polling hardened: telemetry 50ms -> 150ms with in-flight guard and
timeout, same guard on /getState, hidden tabs stop polling, dead
Password row removed from Logs.
…lect Owner fields were touched concurrently by the httpd task and sysloop without synchronization - all access now goes through a portMUX critical section, with logging and side effects kept outside the lock. GamepadService::write gains a writer spinlock (called from both WS/HTTP handlers and the owner-release zeroing path); readers stay lock-free. PROBOT_WIFI_AP_CHANNEL 0 now auto-selects the least congested of the non-overlapping channels (1/5/9/13) by scanning at boot; /info reports the actual channel. New PROBOT_DS_OWNER_TIMEOUT_MS macro (default 5000). httpd pinned to core 0 so core 1 stays exclusive to user code. Dead constants removed from core_config.
probot::devices::Servo drives hobby servos on 50Hz/14-bit LEDC hardware PWM and allocates channels from the top of the range downward, so a timer collision with analogWrite motor PWM (the main software cause of servo jitter) is structurally impossible. No pulses until first write() to avoid the boot jump. Requires arduino-esp32 3.x; no-op stub off-ESP32. ServoTest shows jitter-safe servo control from the joystick (including the separate-BEC power warning); TankDrive is the BTS7960-style dual motor template teams keep asking for.
Wrap-around, overflow-keeps-newest, oversized single message, clear, and seq behavior - the trickiest header had zero coverage.
README rewritten minimal and task-oriented: a quick-start that actually
compiles (the old one called nonexistent methods on io::gamepad()),
correct install name ("probot" in Library Manager) and repo URLs
(nfrproducts/probot-lib), competition channel planning (1/5/9/13 +
auto-select), servo jitter guide, status LED table, and an AI-usage
section with a paste-able prompt.
API.md is the single-page full reference (lifecycle, joystick API,
servo, telemetry, macros, HTTP/WS protocol incl. the binary frame and
heartbeat). llms.txt gives AI tools the rules + raw links. keywords.txt
for IDE highlighting. CONTRIBUTING cleaned of stale branches and
removed examples; FUTURE_WORK pruned to communication-only scope.
Version 0.2.9 synced across manifests (idf_component.yml was stuck at
0.2.7). No tag - 0.2.9 stays open for further changes.
GitHub reports nfrproducts/probot-lib as moved; raw links in llms.txt and the AI prompt must point at the canonical location.
TX power: WIFI_POWER_19_5dBm (=78) quantizes DOWN to 18 dBm in the IDF API ([72,79]->72); any value >=80 yields the true API max of 20 dBm. Request WIFI_POWER_21dBm for +2 dB over the previous setting (S3 datasheet PHY caps at 21 dBm on 11b rates; 20 dBm is the API ceiling). 802.11b rates off by default (PROBOT_WIFI_ENABLE_11B=1 restores): beacons drop from 1 Mbps DSSS (~2.5% airtime per AP) to 6 Mbps OFDM (~0.4%) - significant at a venue with dozens of robot APs. OFDM is mandatory for all 11g+ clients, i.e. any 2010+ tablet. httpd: TCP_NODELAY via open_fn (esp_http_server never sets it; Nagle x delayed-ACK holds small server->client frames for tens of ms), TCP keepalive 3/2/2 (~7 s dead-client detection at the TCP layer), send_wait_timeout 5s->2s (bounds stalls on a wedged client), max_open_sockets 10. WS sends serialized with a mutex: concurrent httpd_ws_send_frame_async calls to one fd interleave header/payload and corrupt frames (esp-idf issues #14495, #5405). Heartbeat skips a round rather than blocking the timer-service task. New macros: PROBOT_INPUT_TIMEOUT_MS (input zeroing window, FRC=125 FTC=300 XRP=500), PROBOT_WIFI_PMF_REQUIRED (deauth-spoof protection, off by default for old-tablet compat).
…untime channel switch Push: a dedicated task streams 'S' frames (state+health JSON, on change or every 1s - doubles as the heartbeat) and 'T' frames (telemetry text, on change) to all WS clients via mutex-guarded sendToAll. The page no longer polls /telemetry, /getState or /health over HTTP; it falls back to 1 Hz HTTP polling only while the WS is down, plus one /health every 10s for the RTT display. The WS stays connected across Init/Stop; with no gamepad the client sends a 1-byte 'P' keepalive every 2s so the owner slot and DS activity stay alive. Telemetry ring buffer gained a spinlock (user-core writer vs network-core reader had none) and a copyBuffer API; GamepadService gained lastWriteMs for true frame age. Captive portal (PROBOT_CAPTIVE_PORTAL=0 disables): DNS catch-all on the softAP + spoofed OS connectivity probes (Android generate_204, iOS hotspot-detect, Windows connecttest/ncsi) pop a landing page that sends users to the DS in a real browser; wpad.dat answered with a plain 404 to stop proxy-probe storms; all other 404s redirect to the portal. Diagnostics: AP STA join/leave logged with MAC + IEEE reason code; /health and the S frame now carry joyAgeMs, sta count and the last disconnect reason; shown on the Logs page. Channel: NVS override (UI > macro precedence, 0 = auto-scan) and /setChannel?ch=N switches 1-13 live via CSA (csa_count=3) so clients migrate without dropping; ch=0 applies at next boot. Logs page gained a channel picker; /info reports chSource.
Four parallel reviewers (runtime/concurrency, DS server vs IDF 5.5 sources, web UI JS, docs/examples consistency) audited the whole library. Everything found is fixed: CRITICAL: DS-timeout FORCE_STOP never stopped the robot (0.2.8 bug) - setting lastStatus alongside the status suppressed the STOP transition, so teleop kept running and robotEnd never ran. CRITICAL: state/gamepad/ telemetry CAS spinlocks livelocked under same-core priority inversion - all three converted to portMUX critical sections (readers too). Runtime: init/end workers no longer delete their own handles (use-after-free); cooperative stop with 60ms grace before vTaskDelete; sysloop subscribes to the task watchdog (it supervised zero tasks); deadline-miss warning made one-shot (was ~1000x/s, wiping telemetry); INITED reported only after robotInit completes; task-create failures logged; autoLen clamped. DS server: client-list arrays sized for max sockets (broadcasts died entirely above 8 fds); broadcast-time owner filter (handshake-time authorization stops being invoked on newer IDF); WS control-frame payloads drained/echoed per RFC 6455; oversize frames close cleanly; >20-axis frames rejected instead of misparsed into phantom buttons; 11b-rate disable moved between wifi stop/start per API contract; owner-release side effects no longer clobber a concurrent re-acquire; SSID sanitized into /info JSON and portal HTML; portal buffer sized; updateController reads split bodies; joyAge clamped. Servo: channels claimed once per instance - repeated DS Init burned a fresh LEDC channel per attach until all servos died mid-match. Web UI: joystick frames clamp to 20 axes/20 buttons (HOTAS-class controllers were silently ignored); gamepad dropdown rebuilt only on device changes (60Hz rebuild reset the selection); late onclose from a dying socket can no longer orphan its replacement; 600ms command grace window stops stale S frames reverting the button; healthFailCount reset by WS health; status detail visible; portrait channel row fits. Examples: TankDrive autonomous rewritten - delay(2000) tripped the deadline-miss kill on every single run. Docs/meta: CHANGELOG no longer describes the dead 'H'-frame design; CI runs on dev/dev-* and compiles all examples; idf_component.yml URL and library.json name fixed; keywords.txt completed; FUTURE_WORK updated (incl. OTA verification of 11b/CSA on hardware).
Boot-time auto channel select was reachable via PROBOT_WIFI_AP_CHANNEL 0
and risky for a fleet: robots powered on together all scan an empty band
(none beaconing yet), every candidate scores 0, and they converge on
channel 1 — the opposite of distribution. For a coordinated competition,
deterministic manual assignment is correct; auto removes operator control.
Auto-select is now OFF by default and gated behind a separate, explicit
PROBOT_WIFI_AUTO_CHANNEL macro (single-robot use):
- PROBOT_WIFI_AP_CHANNEL must be 1-13 when auto is off; bare channel 0
is now a clear compile error pointing to the new flag (was: silent scan).
- NVS ch=0 repurposed to "clear the manual pin, use firmware default"
(was "auto at boot"); a manual pin 1-13 still wins over auto.
- Channel 13 kept in candidates {1,5,9,13}; docs warn region-locked
clients may not see ch12-13 — assign 1-11 by hand if a device can't.
- CSA live-switch doc softened (some tablets drop+rescan; don't switch
mid-match). Serial now prints "Channel: N (macro/nvs/auto)".
Docs/UI/CHANGELOG synced. All 3 examples and the auto-on branch compile;
9/9 host tests pass.
The NeoPixel mutex was taken with portMAX_DELAY and held across the blocking pixel.show(); a task killed mid-show() orphaned it forever and could wedge the sysloop (which also drives the LED). setColor/set/ setBrightness now only stash an atomic word; the blocking show() runs from a single task (the sysloop) via flush(). No mutex, one show() caller, no reentrancy.
… emergency stop Root-cause fix for the issue #21 freeze. Replace the per-phase task create/vTaskDelete model with ONE persistent user task (core 1, created at boot, never killed in normal operation) running a phase state machine. Buttons set an atomic "requested mode"; the task performs the transition at its own loop boundary, so a Stop or phase change can no longer interrupt user code mid-transaction and orphan a Wire/malloc lock. The pure logic lives in core/lifecycle.hpp (host unit-testable). - Stall watchdog is halt-safe: a loop iteration past PROBOT_LOOP_DEADLINE_MS zeroes inputs and holds safe (red LED) with NO kill and NO reboot, preserving homed/relative mechanism state. - TWDT watches only the sysloop — a wedge reboots only on a supervisor/library fault, never on user code. - Terminal emergencyStop(): cut the enable pin, kill the user task, run robotEnd() in a fresh task under a PROBOT_ESTOP_END_MS watchdog (reboot if it hangs on an orphaned bus), then latch until reboot. - New macros: PROBOT_LOOP_DEADLINE_MS, PROBOT_WDT_TIMEOUT_S, PROBOT_ESTOP_END_MS, PROBOT_ESTOP_ENABLE_PIN, USER_LOOP_PERIOD_MS. STACK_USER 4096 -> 8192 (one task now hosts all six hooks). BREAKING: Stop takes effect at the next loop boundary (up to one iteration of latency); a wedged autonomous no longer auto-recovers to teleop (it falls to halt-safe). The 6-hook API is source-compatible.
…meout /robotControl gains cmd=estop (requests the terminal emergency stop, run by the sysloop) and cmd=reboot (ESP.restart, the way to clear the latch). While estop-latched, init/start are refused with 409. The estop flag is published on the 'S' push frame and /getState so the UI can show a persistent banner. httpd recv_wait_timeout lowered 5s -> 2s, symmetric with send_wait_timeout, to bound a half-open client's worker hold.
Add a red EMERGENCY STOP button to Match Control (sends cmd=estop) and a full-screen "EMERGENCY STOPPED — reboot required" overlay driven by the estop state field, with a Reboot button (cmd=reboot).
Cover the phase state machine (full STOP->INIT->TELEOP->STOP path, autonomous entry stamping the start time), the supervisor's status->mode mapping and autonomous-period expiry, and stall detection.
…0.3.0
README/API.md/CHANGELOG: the new lifecycle and loop-return contract
("every iteration must return; bound blocking calls"), halt-safe stall
behavior, emergency stop + PROBOT_ESTOP_ENABLE_PIN, the new macros, and
migration notes. Version 0.2.9 -> 0.3.0 (synced to library.properties,
library.json, idf_component.yml).
The public emergencyStop() ran the full sequence inline, including vTaskDelete(userTask). Called from a user hook (which runs in that very task) it would delete its own caller mid-sequence, so robotEnd never spawned and the status never latched. emergencyStop() now only sets the request flag; the sysloop runs the actual kill/robotEnd/latch sequence (detail::runEmergencyStop), so it never deletes its caller and a team can safely trigger an e-stop from teleopLoop on a detected fault.
The builtin status LED is now driven solely by the library — the manual color API (builtinled::setColor/set/setBrightness) is removed, since the LED color always means something (match phase / stalled / e-stop). One internal render() entry point, called only from the sysloop. Add PROBOT_RSL_PIN: an FRC-RSL-style signal light on a plain digital pin that the library drives — blink while the robot can move (teleop/auton), solid on otherwise. Also NEOPIXEL_BRIGHTNESS macro (default 32). BREAKING: sketches calling builtinled::setColor/set/setBrightness no longer compile; remove those calls (use PROBOT_RSL_PIN for an indicator).
…n instead The library is the communication + lifecycle layer; it shouldn't own/maintain output-hardware wrappers (the same reasoning already applied to motors, which never had a class). probot::devices::Servo is removed; servos are now driven with raw LEDC, documented as the reference pattern. The pattern keeps the jitter fix the class provided: attach the servo on a HIGH LEDC channel (ledcAttachChannel(pin, 50, 14, 7)) so it never shares a timer with analogWrite motor PWM (which uses the low channels) — the timer collision is the #1 software cause of servo jitter. ServoTest is rewritten as the worked example; README / API.md / llms.txt updated. BREAKING: sketches using probot::devices::Servo won't compile — switch to the ledcAttachChannel/ledcWrite pattern in examples/ServoTest.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Promotes 0.2.7 → 0.3.0. Safety/lifecycle rework that closes the
issue #21 freeze at the root, plus an emergency stop. Compile-verified
(all examples), host-tested (15/15), and emergency stop verified on
hardware.
Highlights
vTaskDelete;one persistent user task + phase state machine. User code is never
killed mid-transaction → no orphaned Wire/malloc lock (issue Probot Deadline Miss Teleop Blocked Hatası #21 root
cause). Halt-safe stall watchdog (no kill/reboot). TWDT watches only
the supervisor.
cmd=estop→ kill usertask → robotEnd under a watchdog → latch until reboot. Optional
PROBOT_ESTOP_ENABLE_PINhardware kill line.PROBOT_RSL_PIN(FRC-RSL-stylesignal light).
(library is comms + lifecycle; output hardware is the user's).
Breaking
probot::devices::Servoremoved (use theexamples/ServoTestpattern).builtinled::setColor/set/setBrightnessremoved (LED is library-driven).no longer auto-recovers to teleop (falls to halt-safe).
Full details in CHANGELOG.md.