Skip to content

Release 0.3.0 — cooperative lifecycle + emergency stop#22

Merged
tunapro1234 merged 27 commits into
stablefrom
dev
Jun 26, 2026
Merged

Release 0.3.0 — cooperative lifecycle + emergency stop#22
tunapro1234 merged 27 commits into
stablefrom
dev

Conversation

@tunapro1234

Copy link
Copy Markdown
Collaborator

Promotes 0.2.7 → 0.3.0. Safety/lifecycle rework that closes the
issue #21 freeze at the root, plus an emergency stop. Compile-verified
(all examples), host-tested (15/15), and emergency stop verified on
hardware.

Highlights

  • Cooperative single-task lifecycle: no more per-phase vTaskDelete;
    one persistent user task + phase state machine. User code is never
    killed mid-transaction → no orphaned Wire/malloc lock (issue Probot Deadline Miss Teleop Blocked Hatası #21 root
    cause). Halt-safe stall watchdog (no kill/reboot). TWDT watches only
    the supervisor.
  • Emergency stop (terminal): UI button / cmd=estop → kill user
    task → robotEnd under a watchdog → latch until reboot. Optional
    PROBOT_ESTOP_ENABLE_PIN hardware kill line.
  • Status LED is status-only; added PROBOT_RSL_PIN (FRC-RSL-style
    signal light).
  • Servo class removed — servos documented as a raw-LEDC pattern
    (library is comms + lifecycle; output hardware is the user's).

Breaking

  • probot::devices::Servo removed (use the examples/ServoTest pattern).
  • builtinled::setColor/set/setBrightness removed (LED is library-driven).
  • Stop now takes effect at the next loop boundary; a wedged autonomous
    no longer auto-recovers to teleop (falls to halt-safe).

Full details in CHANGELOG.md.

A single failed ws_send_frame_async used to close the WebSocket
connection, which caused spurious disconnects in noisy RF
environments where one lost frame is routine.

Track per-fd consecutive failure counts; close only after 3 in a
row. Successful send resets the counter. Stale fds are pruned on
each tick.
Previously /health required the caller to be the DS owner, which
meant judges/referees and side-monitoring stations could not observe
robot liveness while an active driver was connected. They'd either
get 403 or have to steal the ownership slot mid-match.

Drop enforceOwner() from handleHealth and handleInfo. Also remove
the password field from /info since it is now publicly readable by
anything on the AP; passwords should not leak to other teams' scans.
When the DS owner is released (idle timeout, force disconnect, or a
new client claims the slot), the gamepad buffer kept its last
known values. User teleop loops reading getA() / getRawAxis() would
see the last button press frozen, causing motors to keep whatever
command was current when the link died.

Call _gs.write(now, null, 0, null, 0) in releaseOwner() to zero the
state so reads return neutral regardless of the driver's last frame.
Previously DS timeout always forced Status::STOP, killing the teleop
task and requiring a full init+start cycle to resume driving. For
teams that want a softer behavior — keep user loops running while
the gamepad is zeroed, then resume on reconnect without restart —
add PROBOT_DS_TIMEOUT_FORCE_STOP define.

Default is 1 (current behavior, safe). Setting it to 0 just releases
the owner and closes WS sessions; teleop keeps running with neutral
input.
The /joystick WebSocket previously accepted any client at handshake;
owner enforcement only covered HTTP routes. A second driver station
on the same AP could open a parallel WS and inject joystick frames,
silently racing against the legitimate owner.

Add an OwnerAuthorizer callback to WsJoystick. DriverStation wires
it to enforceOwner (in silent mode — we close the socket instead of
emitting an HTTP 403 on a WS frame). Handshake and every subsequent
frame re-check ownership; non-owner clients get their session torn
down immediately.
Link-layer reliability release:
  * ws_joystick: 3-fail ping tolerance (was single-fail close)
  * /health and /info open for monitoring (no owner required)
  * /info no longer leaks WiFi password
  * gamepad state zeroed on owner release (no stale input)
  * PROBOT_DS_TIMEOUT_FORCE_STOP configurable (default 1 = safe)
  * /joystick WS enforces owner at handshake and per-frame
- README: bilingual rewrite to accurately describe what 0.2.8 ships
  (joystick DS + WS transport), drop references to modules that do
  not exist in this branch (PID/Kalman/LQR/mecanum), document the
  new configuration macros and the "What changed in 0.2.8" section
- CHANGELOG.md: first entry describing the five reliability changes
  and upgrade notes for PROBOT_DS_TIMEOUT_FORCE_STOP
- gitignore: exclude connection-test/ and .claude/
Replace the WS PING with a 2-byte BINARY heartbeat ('H', seq) every 2s.
Browsers auto-pong pings invisibly to JS, so the page could never tell
a live link from a dead one. Server-side 3-fail close logic unchanged.

Client: own sends no longer count as link activity (ws.send() into a
dead TCP socket succeeds silently, masking mid-drive link loss); stale
threshold 3s -> 5s (two missed heartbeats). Also fixes the idle
reconnect churn loop.

UI polling hardened: telemetry 50ms -> 150ms with in-flight guard and
timeout, same guard on /getState, hidden tabs stop polling, dead
Password row removed from Logs.
…lect

Owner fields were touched concurrently by the httpd task and sysloop
without synchronization - all access now goes through a portMUX
critical section, with logging and side effects kept outside the lock.
GamepadService::write gains a writer spinlock (called from both WS/HTTP
handlers and the owner-release zeroing path); readers stay lock-free.

PROBOT_WIFI_AP_CHANNEL 0 now auto-selects the least congested of the
non-overlapping channels (1/5/9/13) by scanning at boot; /info reports
the actual channel. New PROBOT_DS_OWNER_TIMEOUT_MS macro (default 5000).
httpd pinned to core 0 so core 1 stays exclusive to user code. Dead
constants removed from core_config.
probot::devices::Servo drives hobby servos on 50Hz/14-bit LEDC hardware
PWM and allocates channels from the top of the range downward, so a
timer collision with analogWrite motor PWM (the main software cause of
servo jitter) is structurally impossible. No pulses until first write()
to avoid the boot jump. Requires arduino-esp32 3.x; no-op stub off-ESP32.

ServoTest shows jitter-safe servo control from the joystick (including
the separate-BEC power warning); TankDrive is the BTS7960-style dual
motor template teams keep asking for.
Wrap-around, overflow-keeps-newest, oversized single message, clear,
and seq behavior - the trickiest header had zero coverage.
README rewritten minimal and task-oriented: a quick-start that actually
compiles (the old one called nonexistent methods on io::gamepad()),
correct install name ("probot" in Library Manager) and repo URLs
(nfrproducts/probot-lib), competition channel planning (1/5/9/13 +
auto-select), servo jitter guide, status LED table, and an AI-usage
section with a paste-able prompt.

API.md is the single-page full reference (lifecycle, joystick API,
servo, telemetry, macros, HTTP/WS protocol incl. the binary frame and
heartbeat). llms.txt gives AI tools the rules + raw links. keywords.txt
for IDE highlighting. CONTRIBUTING cleaned of stale branches and
removed examples; FUTURE_WORK pruned to communication-only scope.

Version 0.2.9 synced across manifests (idf_component.yml was stuck at
0.2.7). No tag - 0.2.9 stays open for further changes.
GitHub reports nfrproducts/probot-lib as moved; raw links in llms.txt
and the AI prompt must point at the canonical location.
TX power: WIFI_POWER_19_5dBm (=78) quantizes DOWN to 18 dBm in the IDF
API ([72,79]->72); any value >=80 yields the true API max of 20 dBm.
Request WIFI_POWER_21dBm for +2 dB over the previous setting (S3
datasheet PHY caps at 21 dBm on 11b rates; 20 dBm is the API ceiling).

802.11b rates off by default (PROBOT_WIFI_ENABLE_11B=1 restores):
beacons drop from 1 Mbps DSSS (~2.5% airtime per AP) to 6 Mbps OFDM
(~0.4%) - significant at a venue with dozens of robot APs. OFDM is
mandatory for all 11g+ clients, i.e. any 2010+ tablet.

httpd: TCP_NODELAY via open_fn (esp_http_server never sets it; Nagle x
delayed-ACK holds small server->client frames for tens of ms), TCP
keepalive 3/2/2 (~7 s dead-client detection at the TCP layer),
send_wait_timeout 5s->2s (bounds stalls on a wedged client),
max_open_sockets 10.

WS sends serialized with a mutex: concurrent httpd_ws_send_frame_async
calls to one fd interleave header/payload and corrupt frames (esp-idf
issues #14495, #5405). Heartbeat skips a round rather than blocking the
timer-service task.

New macros: PROBOT_INPUT_TIMEOUT_MS (input zeroing window, FRC=125
FTC=300 XRP=500), PROBOT_WIFI_PMF_REQUIRED (deauth-spoof protection,
off by default for old-tablet compat).
…untime channel switch

Push: a dedicated task streams 'S' frames (state+health JSON, on change
or every 1s - doubles as the heartbeat) and 'T' frames (telemetry text,
on change) to all WS clients via mutex-guarded sendToAll. The page no
longer polls /telemetry, /getState or /health over HTTP; it falls back
to 1 Hz HTTP polling only while the WS is down, plus one /health every
10s for the RTT display. The WS stays connected across Init/Stop; with
no gamepad the client sends a 1-byte 'P' keepalive every 2s so the
owner slot and DS activity stay alive. Telemetry ring buffer gained a
spinlock (user-core writer vs network-core reader had none) and a
copyBuffer API; GamepadService gained lastWriteMs for true frame age.

Captive portal (PROBOT_CAPTIVE_PORTAL=0 disables): DNS catch-all on the
softAP + spoofed OS connectivity probes (Android generate_204, iOS
hotspot-detect, Windows connecttest/ncsi) pop a landing page that sends
users to the DS in a real browser; wpad.dat answered with a plain 404
to stop proxy-probe storms; all other 404s redirect to the portal.

Diagnostics: AP STA join/leave logged with MAC + IEEE reason code;
/health and the S frame now carry joyAgeMs, sta count and the last
disconnect reason; shown on the Logs page.

Channel: NVS override (UI > macro precedence, 0 = auto-scan) and
/setChannel?ch=N switches 1-13 live via CSA (csa_count=3) so clients
migrate without dropping; ch=0 applies at next boot. Logs page gained a
channel picker; /info reports chSource.
Four parallel reviewers (runtime/concurrency, DS server vs IDF 5.5
sources, web UI JS, docs/examples consistency) audited the whole
library. Everything found is fixed:

CRITICAL: DS-timeout FORCE_STOP never stopped the robot (0.2.8 bug) -
setting lastStatus alongside the status suppressed the STOP transition,
so teleop kept running and robotEnd never ran. CRITICAL: state/gamepad/
telemetry CAS spinlocks livelocked under same-core priority inversion -
all three converted to portMUX critical sections (readers too).

Runtime: init/end workers no longer delete their own handles
(use-after-free); cooperative stop with 60ms grace before vTaskDelete;
sysloop subscribes to the task watchdog (it supervised zero tasks);
deadline-miss warning made one-shot (was ~1000x/s, wiping telemetry);
INITED reported only after robotInit completes; task-create failures
logged; autoLen clamped.

DS server: client-list arrays sized for max sockets (broadcasts died
entirely above 8 fds); broadcast-time owner filter (handshake-time
authorization stops being invoked on newer IDF); WS control-frame
payloads drained/echoed per RFC 6455; oversize frames close cleanly;
>20-axis frames rejected instead of misparsed into phantom buttons;
11b-rate disable moved between wifi stop/start per API contract;
owner-release side effects no longer clobber a concurrent re-acquire;
SSID sanitized into /info JSON and portal HTML; portal buffer sized;
updateController reads split bodies; joyAge clamped.

Servo: channels claimed once per instance - repeated DS Init burned a
fresh LEDC channel per attach until all servos died mid-match.

Web UI: joystick frames clamp to 20 axes/20 buttons (HOTAS-class
controllers were silently ignored); gamepad dropdown rebuilt only on
device changes (60Hz rebuild reset the selection); late onclose from a
dying socket can no longer orphan its replacement; 600ms command grace
window stops stale S frames reverting the button; healthFailCount
reset by WS health; status detail visible; portrait channel row fits.

Examples: TankDrive autonomous rewritten - delay(2000) tripped the
deadline-miss kill on every single run.

Docs/meta: CHANGELOG no longer describes the dead 'H'-frame design;
CI runs on dev/dev-* and compiles all examples; idf_component.yml URL
and library.json name fixed; keywords.txt completed; FUTURE_WORK
updated (incl. OTA verification of 11b/CSA on hardware).
Boot-time auto channel select was reachable via PROBOT_WIFI_AP_CHANNEL 0
and risky for a fleet: robots powered on together all scan an empty band
(none beaconing yet), every candidate scores 0, and they converge on
channel 1 — the opposite of distribution. For a coordinated competition,
deterministic manual assignment is correct; auto removes operator control.

Auto-select is now OFF by default and gated behind a separate, explicit
PROBOT_WIFI_AUTO_CHANNEL macro (single-robot use):

- PROBOT_WIFI_AP_CHANNEL must be 1-13 when auto is off; bare channel 0
  is now a clear compile error pointing to the new flag (was: silent scan).
- NVS ch=0 repurposed to "clear the manual pin, use firmware default"
  (was "auto at boot"); a manual pin 1-13 still wins over auto.
- Channel 13 kept in candidates {1,5,9,13}; docs warn region-locked
  clients may not see ch12-13 — assign 1-11 by hand if a device can't.
- CSA live-switch doc softened (some tablets drop+rescan; don't switch
  mid-match). Serial now prints "Channel: N (macro/nvs/auto)".

Docs/UI/CHANGELOG synced. All 3 examples and the auto-on branch compile;
9/9 host tests pass.
The NeoPixel mutex was taken with portMAX_DELAY and held across the
blocking pixel.show(); a task killed mid-show() orphaned it forever and
could wedge the sysloop (which also drives the LED). setColor/set/
setBrightness now only stash an atomic word; the blocking show() runs
from a single task (the sysloop) via flush(). No mutex, one show()
caller, no reentrancy.
… emergency stop

Root-cause fix for the issue #21 freeze. Replace the per-phase task
create/vTaskDelete model with ONE persistent user task (core 1, created
at boot, never killed in normal operation) running a phase state
machine. Buttons set an atomic "requested mode"; the task performs the
transition at its own loop boundary, so a Stop or phase change can no
longer interrupt user code mid-transaction and orphan a Wire/malloc
lock. The pure logic lives in core/lifecycle.hpp (host unit-testable).

- Stall watchdog is halt-safe: a loop iteration past
  PROBOT_LOOP_DEADLINE_MS zeroes inputs and holds safe (red LED) with NO
  kill and NO reboot, preserving homed/relative mechanism state.
- TWDT watches only the sysloop — a wedge reboots only on a
  supervisor/library fault, never on user code.
- Terminal emergencyStop(): cut the enable pin, kill the user task, run
  robotEnd() in a fresh task under a PROBOT_ESTOP_END_MS watchdog (reboot
  if it hangs on an orphaned bus), then latch until reboot.
- New macros: PROBOT_LOOP_DEADLINE_MS, PROBOT_WDT_TIMEOUT_S,
  PROBOT_ESTOP_END_MS, PROBOT_ESTOP_ENABLE_PIN, USER_LOOP_PERIOD_MS.
  STACK_USER 4096 -> 8192 (one task now hosts all six hooks).

BREAKING: Stop takes effect at the next loop boundary (up to one
iteration of latency); a wedged autonomous no longer auto-recovers to
teleop (it falls to halt-safe). The 6-hook API is source-compatible.
…meout

/robotControl gains cmd=estop (requests the terminal emergency stop, run
by the sysloop) and cmd=reboot (ESP.restart, the way to clear the
latch). While estop-latched, init/start are refused with 409. The estop
flag is published on the 'S' push frame and /getState so the UI can show
a persistent banner. httpd recv_wait_timeout lowered 5s -> 2s, symmetric
with send_wait_timeout, to bound a half-open client's worker hold.
Add a red EMERGENCY STOP button to Match Control (sends cmd=estop) and a
full-screen "EMERGENCY STOPPED — reboot required" overlay driven by the
estop state field, with a Reboot button (cmd=reboot).
Cover the phase state machine (full STOP->INIT->TELEOP->STOP path,
autonomous entry stamping the start time), the supervisor's
status->mode mapping and autonomous-period expiry, and stall detection.
…0.3.0

README/API.md/CHANGELOG: the new lifecycle and loop-return contract
("every iteration must return; bound blocking calls"), halt-safe stall
behavior, emergency stop + PROBOT_ESTOP_ENABLE_PIN, the new macros, and
migration notes. Version 0.2.9 -> 0.3.0 (synced to library.properties,
library.json, idf_component.yml).
The public emergencyStop() ran the full sequence inline, including
vTaskDelete(userTask). Called from a user hook (which runs in that very
task) it would delete its own caller mid-sequence, so robotEnd never
spawned and the status never latched. emergencyStop() now only sets the
request flag; the sysloop runs the actual kill/robotEnd/latch sequence
(detail::runEmergencyStop), so it never deletes its caller and a team can
safely trigger an e-stop from teleopLoop on a detected fault.
The builtin status LED is now driven solely by the library — the manual
color API (builtinled::setColor/set/setBrightness) is removed, since the
LED color always means something (match phase / stalled / e-stop). One
internal render() entry point, called only from the sysloop.

Add PROBOT_RSL_PIN: an FRC-RSL-style signal light on a plain digital pin
that the library drives — blink while the robot can move (teleop/auton),
solid on otherwise. Also NEOPIXEL_BRIGHTNESS macro (default 32).

BREAKING: sketches calling builtinled::setColor/set/setBrightness no
longer compile; remove those calls (use PROBOT_RSL_PIN for an indicator).
…n instead

The library is the communication + lifecycle layer; it shouldn't own/maintain
output-hardware wrappers (the same reasoning already applied to motors, which
never had a class). probot::devices::Servo is removed; servos are now driven
with raw LEDC, documented as the reference pattern.

The pattern keeps the jitter fix the class provided: attach the servo on a
HIGH LEDC channel (ledcAttachChannel(pin, 50, 14, 7)) so it never shares a
timer with analogWrite motor PWM (which uses the low channels) — the timer
collision is the #1 software cause of servo jitter. ServoTest is rewritten as
the worked example; README / API.md / llms.txt updated.

BREAKING: sketches using probot::devices::Servo won't compile — switch to the
ledcAttachChannel/ledcWrite pattern in examples/ServoTest.
@tunapro1234 tunapro1234 merged commit e5907a5 into stable Jun 26, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant