fix(nrf52): run app loop on dedicated FreeRTOS task to avoid stack overflow#40
fix(nrf52): run app loop on dedicated FreeRTOS task to avoid stack overflow#40disq wants to merge 4 commits into
Conversation
…erflow The Adafruit_nRF52_Arduino framework runs Arduino loop() on a 4KB task stack (LOOP_STACK_SZ = 256*4, an unconditional #define that build flags cannot override). Heavy LittleFS work from loop() -- notably ClientACL::saveSessionKeys -> Adafruit_LittleFS::open() allocating large lfs_dir/lfs_info structs on the stack (measured peak ~4.7KB) -- overflows the 4KB chunk and corrupts the heap object directly below it (observed: the RadioLib Module*), causing a hardfault/lockup after admin/session activity on repeaters. Move the application loop onto a dedicated FreeRTOS task with an 8KB stack (via new header-only helpers/MeshWorkerTask.h). On nRF52 the Arduino loop() just vTaskDelays; the existing loop body runs unchanged on the worker task, keeping the board.sleep()/hasPendingWork() gating single-threaded. Other platforms are unaffected. Applied to the two examples that hit the saveSessionKeys path: simple_repeater and simple_room_server. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Moves the Arduino loop() body onto a dedicated FreeRTOS task with an 8KB stack on NRF52 platforms, to avoid stack overflows caused by heavy LittleFS work running on the framework's 4KB loop_task stack.
Changes:
- Adds a new
MeshWorkerTask.hhelper that creates a FreeRTOS task running a supplied loop body. - Wires the helper into
simple_room_serverandsimple_repeaterexamples, splittingloop()into a worker-task body and a thin idleloop(). - Guards all new logic behind
NRF52_PLATFORM.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| src/helpers/MeshWorkerTask.h | New header defining startMeshWorker() and a FreeRTOS task that calls the registered loop body. |
| examples/simple_room_server/main.cpp | Refactors loop() into runMainLoop() started on the mesh worker task for NRF52. |
| examples/simple_repeater/main.cpp | Same refactor as above for the repeater example. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…tion Copilot review feedback on the nRF52 mesh-worker fix: - Move the worker task + state out of the header into MeshWorkerTask.cpp so there is a single shared definition (the header was defining non-inline static globals/functions, fragile across translation units; inline-variable fix isn't viable since nRF52 builds aren't guaranteed C++17). Header now declares only startMeshWorker(). - Guard the loop-body pointer: the worker task does 'if (_meshLoopBody) _meshLoopBody();' so it can never deref a null pointer. - Check the xTaskCreate() return value and MESH_DEBUG_PRINTLN + return false on failure (8KB stack alloc out of FreeRTOS heap) instead of silently appearing to boot with no mesh loop running. - Reduce per-example boilerplate: the loop body is now a fixed-name meshAppLoop() on all platforms (no #if around the function signature); loop() either yields (nRF52) or calls meshAppLoop() directly (others). The shared .cpp is guarded by NRF52_PLATFORM, so it compiles empty on other platforms; verified it links cleanly into a non-modified nRF52 example (RAK_4631_companion_radio_usb) that neither defines meshAppLoop nor calls startMeshWorker. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The previous commit made startMeshWorker() return bool but the call sites discarded it, so worker-task creation failure was still only visible under MESH_DEBUG. Act on the result: on failure, log and halt() -- mirroring the existing radio_init() failure handling in setup() -- so the node fails loudly instead of booting with no mesh loop running. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Address two further review points: - Worker yields with vTaskDelay(1) instead of vTaskDelay(pdMS_TO_TICKS(1)) so it always yields at least one tick regardless of configTICK_RATE_HZ (on this framework the tick rate is 1024 Hz so the old form was already 1 tick, but the explicit form is tick-rate independent). - startMeshWorker() now rejects a null loopBody and a second call (which would leak the previous task by overwriting _meshTaskHandle), returning false and logging instead of silently misbehaving. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Seems this is also affecting other nRF such as companion BLE, so I'm going to open another PR on Adafruit_nRF52_Arduino to relax the |
|
Superseded by #41, which fixes the same nRF52 |
Problem
nRF52 (Adafruit_nRF52_Arduino) repeaters crash / go silent after admin or session activity — reproducible on
dev_pluswithMESH_PACKET_LOGGING+MESH_DEBUGenabled on a RAK3401 repeater. It is a stack overflow on the framework's Arduinoloop_task:loop()onloop_taskwith a 4KB stack —#define LOOP_STACK_SZ (256*4)incores/nRF5/main.cpp. It's an unconditional#define, so-DLOOP_STACK_SZ=...fromplatformio.inicannot override it (it just emits-Wmacro-redefinedand is discarded).the_mesh.loop()→ClientACL::saveSessionKeys→Adafruit_LittleFS::open("/s_sess_keys")allocates largelfs_dir/lfs_infostructs on the stack. Measured peak usage is ~4.7KB, overrunning the 4KB chunk by a few hundred bytes and scribbling on whatever heap object sits directly below the stack — observed in a crash dump as the RadioLibModule*, whose smashedhalpointer then caused a hardfault/lockup on the next radio call.Why
dev_plusand notdev— the LittleFS upgrade is the triggerdev_pluspins the LittleFS v1.7.2 framework (weebl2000/Adafruit_nRF52_Arduinomeshcore-patches/ tag1.10700.1, which is the head commit724e00aof meshcore-dev/Adafruit_nRF52_Arduino PR #1).devpins that same fork'smaster(d541301), which does not include PR #1 and still runs LittleFS v1.6.The v1.6 → v1.7.2 upgrade enlarged the stack footprint of the file-open path (
lfs_dir_find/lfs_stat) enough to push thesaveSessionKeyscall from under 4KB to ~4.7KB — i.e. across theloop_taskstack limit. That's why the crash reproduces ondev_plusbut not ondev. The small 4KB loop-task stack is the underlying constraint; the LittleFS upgrade is what crossed the line. (Reverting the upgrade is not the goal — it brings real fixes; this PR gives the file-open path enough stack to run safely.)Fix
Run the application loop on a dedicated FreeRTOS task with an 8KB stack (measured peak ~4.7KB + ~3KB headroom) instead of the framework's 4KB
loop_task. No framework edits, noplatformio.inihacks, nRF52-only.src/helpers/MeshWorkerTask.hprovidesstartMeshWorker(body), whichxTaskCreates a"mesh"task (2048 words / 8KB) atTASK_PRIO_LOW— matching the framework's own loop task priority — that calls the body in avTaskDelay(1)loop.loop()body is renamed torunMainLoop()(unchanged, including the leadingboard.loop();) and started on the worker; on nRF52 the Arduinoloop()justvTaskDelay(1000)s. Other platforms callrunMainLoop()directly — no behavioral change off nRF52.The entire loop body moves (not just
the_mesh.loop()) sosimple_repeater'sboard.sleep()/hasPendingWork()gating stays single-threaded and correctly ordered — it's the same code, just on a bigger stack.Scope
Applied to the two examples that hit the
ClientACL::saveSessionKeyscrash path:simple_repeaterandsimple_room_server. Other nRF52 examples share the same latent 4KB-stack risk but are left out of this PR.Verification
pio run -e RAK_3401_repeater— the exactMESH_PACKET_LOGGING+MESH_DEBUGconfig that reproduces the crash; clean buildpio run -e RAK_4631_room_server— clean build-Wmacro-redefined(we never touchLOOP_STACK_SZ)On-hardware: flash a RAK3401/RAK4631 repeater and drive the previously-crashing workload (admin sync, neighbour discover, repeated LoRa status requests) for several minutes — expect no LoRa-silence / lockup. Optional gdb check: walk the
0xa5a5a5a5canary fill from((TCB_t*)_meshTaskHandle)->pxStackupward — expect peak ~4.7KB used of 8KB, ~3KB free.Note for reviewers
An alternative/complementary fix is a one-line
#ifndef LOOP_STACK_SZguard in the pinnedAdafruit_nRF52_Arduinoframework fork so-DLOOP_STACK_SZworks fromplatformio.ini— that would cover all nRF52 builds at once. Sincedev_plusalready pins a MeshCore framework fork, that change is readily available, but it's left as a separate, independent change.🤖 Generated with Claude Code