fix(firmware): stop ESP32-S3 sendto ENOMEM tight loop - 10 Hz self-ping, 300 ms backoff, 128 TX buffers (#1135)#1142
Open
binhusmachado-code wants to merge 1 commit into
Conversation
…uvnet#1135) On a fresh ESP32-S3 the node enters a permanent `sendto ENOMEM` loop from the second CSI callback onward and zero UDP frames ever leave the device (the aggregator stays `esp32:offline`), even though Wi-Fi, DHCP and ICMP are healthy and pkt_yield sits at 0 pps. Per the analysis in ruvnet#1135, during the first ~1 s after boot the 50 Hz self-ping + mmWave UART probe + ESPNOW init + promiscuous sniffer all contend for the same lwIP pbuf / Wi-Fi dynamic-TX pools; `sendto` returns ENOMEM and the fixed 100 ms backoff from ruvnet#132 is too short to let the pools drain, so it re-fires into a still-full pool every cycle and loops forever. The S3 contends harder for these buffers than the C6 the original 0.6.x/0.7.0 tuning was verified against. Implements the three mitigations proposed in ruvnet#1135: * csi_collector.c: self-ping cadence 50 Hz -> 10 Hz (interval_ms 20 -> 100). Cuts ~52 back-to-back boot-time datagrams/s of TX flood while keeping the CSI OFDM source alive. Interval comment, header comment and log string updated. * stream_sender.c: ENOMEM_COOLDOWN_MS 100 -> 300 so the backoff outlasts the pool pressure instead of re-firing into a still-full pool. * sdkconfig.defaults: CONFIG_ESP_WIFI_DYNAMIC_TX_BUFFER_NUM 64 -> 128 (max of the IDF 1..128 range) for TX headroom during the boot contention window. Scoped to the S3: the bump lives in the base sdkconfig.defaults, so to leave the untested C6 build unchanged it is pinned back to 64 in sdkconfig.defaults.esp32c6. Also tidied a stale "50 Hz" self-ping header comment and a stale "100 ms" backoff comment in adaptive_controller.c so they match the new runtime behavior. Measured on an ESP32-S3-DevKitC-1-class board (QFN56 rev v0.2, 16MB/8MB, USB-Serial/JTAG, WPA2 2.4 GHz; aggregator UDP :5005 on macOS), built and flashed with ESP-IDF v5.4: before: sendto ENOMEM tight loop, yield 0 pps, 0 frames reach the host after: yield 9-13 pps, no ENOMEM, 300+ CSI frames/min received, vitals parsing Fixes the egress/ENOMEM half (bug ruvnet#1) of ruvnet#1135 only; the phantom-LD2410-on- floating-UART detection (bug ruvnet#2) is out of scope and belongs with the ruvnet#1107/ruvnet#1119 mmwave-validation work. Verified on ESP32-S3 only, not on C6. Refs ruvnet#132, ruvnet#521, ruvnet#954, ruvnet#1107, ruvnet#1119.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A fresh ESP32-S3 node enters a permanent
sendto ENOMEMloop and never emits a single UDP frame. This applies the three mitigations proposed in #1135 and gets the egress path back to 9–13 pps / 300+ CSI frames/min, measured end-to-end on real S3 hardware.Addresses #1135 — fixes bug #1 (egress ENOMEM) only; bug #2 (phantom LD2410 on a floating UART) is intentionally out of scope (see Testing & scope).
Problem
On a fresh ESP32-S3, from the second CSI callback onward the node spins in a permanent
sendto ENOMEMloop and zero UDP frames ever leave the device — the aggregator staysesp32:offline— even though Wi-Fi, DHCP and ICMP are all healthy.pkt_yield_per_secsits at 0 pps.This matches the earlier S3 report in #1107, whose ENOMEM/yield-0 half #1119 explicitly deferred as "most likely a separate network-path issue." #1135 pins that separate egress issue down.
Root cause
Per the analysis in #1135: during the first ~1 s after boot, the 50 Hz self-ping + mmWave UART probe + ESPNOW init + promiscuous sniffer all contend for the same lwIP pbuf / Wi-Fi dynamic-TX pools.
sendtoreturnsENOMEM, and the fixed 100 ms backoff introduced in #132 is too short to let the pools drain, so the backoff re-fires into a still-full pool every cycle and loops forever. The S3 contends harder for these buffers than the C6 the original 0.6.x/0.7.0 tuning was verified against, which is why it surfaces there.The fix — the three mitigations proposed in #1135
main/csi_collector.c— self-ping cadence 50 Hz → 10 Hz (cfg.interval_ms20 → 100). Removes ~52 back-to-back boot-time datagrams/s of TX flood while keeping the CSI OFDM source alive. Interval comment, block-header comment, and theESP_LOGIstartup string updated to@10Hz.main/stream_sender.c—ENOMEM_COOLDOWN_MS100 → 300. The backoff now outlasts the pbuf/mbox pressure instead of re-firing into a still-full pool. (Flat, not exponential — see Testing & scope.)sdkconfig.defaults—CONFIG_ESP_WIFI_DYNAMIC_TX_BUFFER_NUM64 → 128. TX headroom for the boot contention window.128is the max of the Kconfigrange 1 128on ESP-IDF v5.4; the project hasCONFIG_ESP_WIFI_DYNAMIC_TX_BUFFER=y(no PSRAM → dynamic TX), so the value is actually applied, not silently dropped.Scoped to the S3. Because the bump lives in the base
sdkconfig.defaults, the C6 build would inherit it through the overlay chain. Since this was hardware-tested on the S3 only, I pinnedCONFIG_ESP_WIFI_DYNAMIC_TX_BUFFER_NUM=64insdkconfig.defaults.esp32c6, so the C6 build is byte-identical to today. Happy to drop that pin if you'd rather the C6 take the bump too.These are conservative, reversible tunings: two constants, one buffer-count bump (~few KB heap), one C6 pin, and two stale-comment fixes. No API, protocol, or dependency change.
Measured results
Built and flashed with ESP-IDF v5.4 on an ESP32-S3-DevKitC-1-class board (QFN56 rev v0.2, 16 MB flash / 8 MB PSRAM, native USB-Serial/JTAG; WPA2 2.4 GHz). Aggregator = a stdlib UDP listener on
:5005(macOS).sendtoENOMEMpkt_yield_per_secThe ~10 pps floor stays comfortably above the
min_pkt_yield = 5 ppsDEGRADED gate, so no flapping. The edge pipeline measures its true sample rate from inter-frame timestamps and re-tunes (#987), so dropping the self-ping 5× does not break the vitals/BPM math; 10 Hz is still >5× Nyquist for HR.Testing & scope (honesty)
min_pkt_yield = 5DEGRADED threshold, and it needs added state — better as its own hardware-tested follow-up with a capped max.version.txt); stream_sender: sendto ENOMEM in a tight loop on ESP32-S3 (v0.8.1-esp32) — 0 UDP frames ever leave the node #1135 was filed against v0.8.1-esp32. The egress root cause and the 100 ms backoff are identical in both, but please confirm it lands cleanly on the current v0.8.x branch.feature_state-emit ENOMEM under load (noted in the in-treesdkconfig.defaultscomment block) is a separate adaptive-controller emit-cadence follow-up; the 300 ms backoff likely masks part of it but does not address it.Refs
#132 (the original ENOMEM backoff this extends — the 100 ms value #1135 says is too short on S3), #521 / #954 (self-ping CSI source), #1107 / #1119 (the mmwave-validation cluster + the deferred separate egress half this resolves).