Skip to content

stream_sender: sendto ENOMEM in a tight loop on ESP32-S3 (v0.8.1-esp32) — 0 UDP frames ever leave the node #1135

Description

@Gelsoluis

stream_sender: sendto ENOMEM in a tight loop on ESP32-S3 (v0.8.1-esp32) — 0 UDP frames ever leave the node

Summary

A fresh ESP32-S3 (8 MB, no peripherals attached) flashed with the v0.8.1-esp32 release assets enters a permanent sendto ENOMEM loop on the second CSI callback. The aggregator's source state never advances past esp32:offline. While diagnosing this I also found what looks like a separate phantom LD2410 detection on a floating UART — I'm filing both here since they were observed in the same run; happy to split into two issues if preferred.

Environment

Board ESP32-S3, QFN56 rev v0.2, 8 MB embedded PSRAM, USB-Serial/JTAG
Flash 8 MB, DIO @ 80 MHz
Firmware v0.8.1-esp32 release assets (bootloader.bin, partition-table.bin, ota_data_initial.bin, esp32-csi-node-s3-8mb.bin) at 0x0 / 0x8000 / 0xf000 / 0x20000
esptool 5.3.0 (host: macOS)
Aggregator ruvnet/wifi-densepose:latest in Docker, UDP 0.0.0.0:5005 mapped to host
Wi-Fi WPA2-PSK, 2.4 GHz, ch 10, RSSI -39 to -42 dBm
Network ESP32 ↔ host on same /24, ICMP RTT ~12 ms
mmWave sensor None physically connected (UART1 TX/RX pins floating)
NVS Wiped before each test (esptool erase-region 0x9000 0x6000) and re-provisioned via the bundled provision.py

What works

  • Wi-Fi associates, DHCP returns an IP, host↔ESP32 ICMP is healthy.
  • CSI capture itself is alive: callbacks fire at expected cadence with valid len and rssi fields.
  • The aggregator's UDP receiver works: injecting a hand-built UDP packet at 127.0.0.1:5005 from the host immediately promotes its source from simulated to esp32. So the failure is strictly on the node's egress path.

What fails

Annotated boot log (timestamps in ms since reset; nothing else is talking on this LAN):

I (4124) main: Got IP: 192.168.0.19
I (4124) stream_sender: UDP sender initialized: 192.168.0.25:5005
I (4144) csi_collector: WiFi modem sleep disabled (WIFI_PS_NONE) for CSI capture
I (4154) wifi:ic_enable_sniffer
I (4154) csi_collector: Promiscuous mode enabled (MGMT-only, RuView#396)
I (4164) csi_collector: self-ping started -> 192.168.0.1 @50Hz (CSI OFDM source, fix #521/#954)
I (4184) ESPNOW: espnow [version: 2.0] init
I (4194) edge_proc: Initializing edge processing (tier=2, top_k=8, vital_interval=1000ms, ...)
I (4294) mmwave: Probing UART1 (TX=17, RX=18) for mmWave sensor...
I (4304) mmwave: Probing at 115200 baud (MR60BHA2)...
I (4494) csi_collector: CSI cb #1: len=128 rssi=-25 ch=10        ← first send succeeds (no log = OK)
I (5364) mmwave: Probing at 256000 baud (LD2410)...
I (5544) csi_collector: CSI cb #2: len=128 rssi=-25 ch=10
W (5544) stream_sender: sendto ENOMEM — backing off for 100 ms   ← first failure on CSI cb #2
W (5544) csi_collector: sendto failed (fail #1)
I (5564) mmwave: Detected LD2410 at 256000 baud (caps=0x000c)    ← (see "bug #2" below)
I (5564) mmwave: mmWave UART task started (type=LD2410)
W (5564) stream_sender: sendto suppressed (ENOMEM backoff, 1 dropped)
... (steady-state — every send either ENOMEMs or is suppressed)

The aggregator stays at {"source":"esp32:offline"} indefinitely; zero CSI frames reach it over the network even though L2/L3 is healthy.

Bug #1 — primary: permanent sendto ENOMEM from CSI cb #2 onward

The first stream_sender_send (on CSI cb #1) appears to succeed (no failure log). The very next one fails with ENOMEM and never recovers — every subsequent attempt either ENOMEMs or is suppressed by the 100 ms backoff. The 100 ms backoff is shorter than what's needed for the underlying pbuf/mbox pressure to clear, so the node is stuck.

The 1050 ms gap between cb #1 and cb #2 is occupied by:

  • the 50 Hz self-ping to the gateway (csi_collector: self-ping started ... @50Hz) — that's ~52 UDP datagrams enqueued back-to-back into LWIP;
  • the MR60BHA2 UART probe at 115200 baud for ~1060 ms;
  • ESPNOW init + c6_espnow tx loop;
  • promiscuous + sniffer RX consuming Wi-Fi RX buffers.

It looks like LWIP pbufs / WiFi dynamic TX buffers / UDP send mbox saturate during that 1 s and never drain. sdkconfig.defaults already mentions a sibling fix for an earlier ENOMEM (note above CONFIG_LWIP_UDP_RECVMBOX_SIZE=32 / CONFIG_LWIP_TCPIP_RECVMBOX_SIZE=64 / CONFIG_ESP_WIFI_DYNAMIC_TX_BUFFER_NUM=64), but on S3 those values don't appear sufficient — possibly because S3 + Sniffer + 50 Hz self-ping + ESPNOW competes harder for buffers than the C6 target the 0.6.7 build was verified against.

Possible fixes worth considering:

  • Drop the self-ping cadence (50 Hz → 10 Hz?) when the LD2410/mmWave or ESPNOW tasks are also TX-active.
  • Raise CONFIG_ESP_WIFI_DYNAMIC_TX_BUFFER_NUM / CONFIG_LWIP_TCPIP_RECVMBOX_SIZE further in the S3-specific sdkconfig overlays.
  • When stream_sender has been in ENOMEM backoff for >N consecutive cycles, exponentially extend the backoff (the current fixed 100 ms is too short) and emit a single warning instead of one per attempt.

Bug #2 — secondary: phantom LD2410 detection on a floating UART

With no mmWave sensor wired to UART1 (TX=17, RX=18), the firmware still concludes Detected LD2410 at 256000 baud (caps=0x000c) and spawns the LD2410 reader task. The v0.8.1-esp32 release notes specifically called out a fix for "false MR60BHA2 detection → ENOMEM by requiring validated sensor headers instead of accepting bare byte patterns" — the LD2410 path looks like it still accepts loose patterns and so trips on floating-pin noise at 256000 baud.

This isn't the trigger of Bug #1 (the timing rules it out — first ENOMEM at 5544 ms, LD2410 declared at 5564 ms), but the resulting mmWave UART task adds steady load to a system that's already in a fragile buffer state.

Suggested fix: gate mmwave: Detected LD2410 on a validated frame header (length + checksum + magic), matching what was done for MR60BHA2 in v0.8.1.

What I tried

  1. release_bins/s3-adr110/ in-tree bins — same ENOMEM loop.
  2. release_bins/s3-fair-adr110/ in-tree bins — same.
  3. Fresh download of v0.8.1-esp32 release assets — same.
  4. esptool erase-region 0x9000 0x6000 to wipe NVS, then provision.py --reset --edge-tier 2 --target-ip <host> --target-port 5005 — same.
  5. Confirmed Wi-Fi credentials, IP, gateway, and aggregator IP/port are correct (ping host↔ESP32 OK).
  6. Confirmed the aggregator's UDP receiver works by sending a synthetic CSI packet from the host — source promoted to esp32 immediately, then back to esp32:offline after the synthetic stream stops.

Repro

  1. Flash a bare ESP32-S3 (8 MB, no mmWave sensor connected) with the v0.8.1-esp32 release assets at 0x0 / 0x8000 / 0xf000 / 0x20000.
  2. python3 provision.py --port <port> --chip esp32s3 --ssid <SSID> --password <pw> --target-ip <host> --target-port 5005 --edge-tier 2 --reset.
  3. Run RuView aggregator on <host>:5005.
  4. Watch ESP32 serial: first stream_sender: sendto ENOMEM — backing off for 100 ms appears on CSI cb Embedded device like ESP32 and Rasbperry Pi #2 and never goes away. Phantom mmwave: Detected LD2410 ... appears in the same window.
  5. Watch GET /api/v1/status on the aggregator — stays esp32:offline indefinitely.

Happy to test patches on this board if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingfirmwareESP32 firmware

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions