Skip to content

fix(firmware): stop ESP32-S3 sendto ENOMEM tight loop - 10 Hz self-ping, 300 ms backoff, 128 TX buffers (#1135)#1142

Open
binhusmachado-code wants to merge 1 commit into
ruvnet:mainfrom
binhusmachado-code:fix/s3-sendto-enomem-1135
Open

fix(firmware): stop ESP32-S3 sendto ENOMEM tight loop - 10 Hz self-ping, 300 ms backoff, 128 TX buffers (#1135)#1142
binhusmachado-code wants to merge 1 commit into
ruvnet:mainfrom
binhusmachado-code:fix/s3-sendto-enomem-1135

Conversation

@binhusmachado-code

Copy link
Copy Markdown

A fresh ESP32-S3 node enters a permanent sendto ENOMEM loop and never emits a single UDP frame. This applies the three mitigations proposed in #1135 and gets the egress path back to 9–13 pps / 300+ CSI frames/min, measured end-to-end on real S3 hardware.

Addresses #1135 — fixes bug #1 (egress ENOMEM) only; bug #2 (phantom LD2410 on a floating UART) is intentionally out of scope (see Testing & scope).

Problem

On a fresh ESP32-S3, from the second CSI callback onward the node spins in a permanent sendto ENOMEM loop and zero UDP frames ever leave the device — the aggregator stays esp32:offline — even though Wi-Fi, DHCP and ICMP are all healthy. pkt_yield_per_sec sits at 0 pps.

This matches the earlier S3 report in #1107, whose ENOMEM/yield-0 half #1119 explicitly deferred as "most likely a separate network-path issue." #1135 pins that separate egress issue down.

Root cause

Per the analysis in #1135: during the first ~1 s after boot, the 50 Hz self-ping + mmWave UART probe + ESPNOW init + promiscuous sniffer all contend for the same lwIP pbuf / Wi-Fi dynamic-TX pools. sendto returns ENOMEM, and the fixed 100 ms backoff introduced in #132 is too short to let the pools drain, so the backoff re-fires into a still-full pool every cycle and loops forever. The S3 contends harder for these buffers than the C6 the original 0.6.x/0.7.0 tuning was verified against, which is why it surfaces there.

The fix — the three mitigations proposed in #1135

  1. main/csi_collector.c — self-ping cadence 50 Hz → 10 Hz (cfg.interval_ms 20 → 100). Removes ~52 back-to-back boot-time datagrams/s of TX flood while keeping the CSI OFDM source alive. Interval comment, block-header comment, and the ESP_LOGI startup string updated to @10Hz.
  2. main/stream_sender.cENOMEM_COOLDOWN_MS 100 → 300. The backoff now outlasts the pbuf/mbox pressure instead of re-firing into a still-full pool. (Flat, not exponential — see Testing & scope.)
  3. sdkconfig.defaultsCONFIG_ESP_WIFI_DYNAMIC_TX_BUFFER_NUM 64 → 128. TX headroom for the boot contention window. 128 is the max of the Kconfig range 1 128 on ESP-IDF v5.4; the project has CONFIG_ESP_WIFI_DYNAMIC_TX_BUFFER=y (no PSRAM → dynamic TX), so the value is actually applied, not silently dropped.

Scoped to the S3. Because the bump lives in the base sdkconfig.defaults, the C6 build would inherit it through the overlay chain. Since this was hardware-tested on the S3 only, I pinned CONFIG_ESP_WIFI_DYNAMIC_TX_BUFFER_NUM=64 in sdkconfig.defaults.esp32c6, so the C6 build is byte-identical to today. Happy to drop that pin if you'd rather the C6 take the bump too.

These are conservative, reversible tunings: two constants, one buffer-count bump (~few KB heap), one C6 pin, and two stale-comment fixes. No API, protocol, or dependency change.

Measured results

Built and flashed with ESP-IDF v5.4 on an ESP32-S3-DevKitC-1-class board (QFN56 rev v0.2, 16 MB flash / 8 MB PSRAM, native USB-Serial/JTAG; WPA2 2.4 GHz). Aggregator = a stdlib UDP listener on :5005 (macOS).

Metric Before After
sendto ENOMEM tight loop, never drains none observed
pkt_yield_per_sec 0 pps 9–13 pps
CSI frames reaching host 0 300+ / min
Vitals packets none parsing on host

The ~10 pps floor stays comfortably above the min_pkt_yield = 5 pps DEGRADED gate, so no flapping. The edge pipeline measures its true sample rate from inter-frame timestamps and re-tunes (#987), so dropping the self-ping 5× does not break the vitals/BPM math; 10 Hz is still >5× Nyquist for HR.

Testing & scope (honesty)

  • S3 only, not C6. Verified end-to-end on ESP32-S3 hardware only (no C6 board on hand). This inverts RuView's usual C6-first coverage, so I'm flagging it plainly. The C6 build is left unchanged by this PR (the TX-buffer pin above).
  • Flat 300 ms backoff, not exponential. stream_sender: sendto ENOMEM in a tight loop on ESP32-S3 (v0.8.1-esp32) — 0 UDP frames ever leave the node #1135 also suggested an exponential schedule; I kept it flat because only the flat build was hardware-verified, and with the self-ping flood cut + 128 buffers the backoff now rarely fires (0 ENOMEM observed). Exponential under sustained pressure could grow the cooldown enough to starve CSI sends below the min_pkt_yield = 5 DEGRADED threshold, and it needs added state — better as its own hardware-tested follow-up with a capped max.
  • Version-tree caveat. The patched tree is v0.7.0 (version.txt); stream_sender: sendto ENOMEM in a tight loop on ESP32-S3 (v0.8.1-esp32) — 0 UDP frames ever leave the node #1135 was filed against v0.8.1-esp32. The egress root cause and the 100 ms backoff are identical in both, but please confirm it lands cleanly on the current v0.8.x branch.
  • Not a total ENOMEM elimination. This kills the boot-time loop. A residual airtime-bound feature_state-emit ENOMEM under load (noted in the in-tree sdkconfig.defaults comment block) is a separate adaptive-controller emit-cadence follow-up; the 300 ms backoff likely masks part of it but does not address it.
  • Single board, single network, no long soak. @Gelsoluis offered to test patches on the v0.8.1-esp32 S3 + Docker rig — that validation before merge would de-risk the version-tree and single-board caveats.
  • No CHANGELOG entry is included (the repo CHANGELOG looks release-automated); happy to add one if you'd like.

Refs

#132 (the original ENOMEM backoff this extends — the 100 ms value #1135 says is too short on S3), #521 / #954 (self-ping CSI source), #1107 / #1119 (the mmwave-validation cluster + the deferred separate egress half this resolves).

…uvnet#1135)

On a fresh ESP32-S3 the node enters a permanent `sendto ENOMEM` loop from the
second CSI callback onward and zero UDP frames ever leave the device (the
aggregator stays `esp32:offline`), even though Wi-Fi, DHCP and ICMP are healthy
and pkt_yield sits at 0 pps.

Per the analysis in ruvnet#1135, during the first ~1 s after boot the 50 Hz self-ping
+ mmWave UART probe + ESPNOW init + promiscuous sniffer all contend for the same
lwIP pbuf / Wi-Fi dynamic-TX pools; `sendto` returns ENOMEM and the fixed 100 ms
backoff from ruvnet#132 is too short to let the pools drain, so it re-fires into a
still-full pool every cycle and loops forever. The S3 contends harder for these
buffers than the C6 the original 0.6.x/0.7.0 tuning was verified against.

Implements the three mitigations proposed in ruvnet#1135:

* csi_collector.c: self-ping cadence 50 Hz -> 10 Hz (interval_ms 20 -> 100).
  Cuts ~52 back-to-back boot-time datagrams/s of TX flood while keeping the CSI
  OFDM source alive. Interval comment, header comment and log string updated.
* stream_sender.c: ENOMEM_COOLDOWN_MS 100 -> 300 so the backoff outlasts the
  pool pressure instead of re-firing into a still-full pool.
* sdkconfig.defaults: CONFIG_ESP_WIFI_DYNAMIC_TX_BUFFER_NUM 64 -> 128 (max of
  the IDF 1..128 range) for TX headroom during the boot contention window.

Scoped to the S3: the bump lives in the base sdkconfig.defaults, so to leave the
untested C6 build unchanged it is pinned back to 64 in sdkconfig.defaults.esp32c6.
Also tidied a stale "50 Hz" self-ping header comment and a stale "100 ms" backoff
comment in adaptive_controller.c so they match the new runtime behavior.

Measured on an ESP32-S3-DevKitC-1-class board (QFN56 rev v0.2, 16MB/8MB,
USB-Serial/JTAG, WPA2 2.4 GHz; aggregator UDP :5005 on macOS), built and flashed
with ESP-IDF v5.4:

  before: sendto ENOMEM tight loop, yield 0 pps, 0 frames reach the host
  after:  yield 9-13 pps, no ENOMEM, 300+ CSI frames/min received, vitals parsing

Fixes the egress/ENOMEM half (bug ruvnet#1) of ruvnet#1135 only; the phantom-LD2410-on-
floating-UART detection (bug ruvnet#2) is out of scope and belongs with the
ruvnet#1107/ruvnet#1119 mmwave-validation work. Verified on ESP32-S3 only, not on C6.

Refs ruvnet#132, ruvnet#521, ruvnet#954, ruvnet#1107, ruvnet#1119.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant