Skip to content

fix(nrf52): run app loop on dedicated FreeRTOS task to avoid stack overflow#40

Closed
disq wants to merge 4 commits into
weebl2000:dev_plusfrom
disq:fix/nrf52-mesh-worker-task-stack-devplus
Closed

fix(nrf52): run app loop on dedicated FreeRTOS task to avoid stack overflow#40
disq wants to merge 4 commits into
weebl2000:dev_plusfrom
disq:fix/nrf52-mesh-worker-task-stack-devplus

Conversation

@disq
Copy link
Copy Markdown

@disq disq commented May 30, 2026

Problem

nRF52 (Adafruit_nRF52_Arduino) repeaters crash / go silent after admin or session activity — reproducible on dev_plus with MESH_PACKET_LOGGING + MESH_DEBUG enabled on a RAK3401 repeater. It is a stack overflow on the framework's Arduino loop_task:

  • The framework runs Arduino loop() on loop_task with a 4KB stack#define LOOP_STACK_SZ (256*4) in cores/nRF5/main.cpp. It's an unconditional #define, so -DLOOP_STACK_SZ=... from platformio.ini cannot override it (it just emits -Wmacro-redefined and is discarded).
  • the_mesh.loop()ClientACL::saveSessionKeysAdafruit_LittleFS::open("/s_sess_keys") allocates large lfs_dir / lfs_info structs on the stack. Measured peak usage is ~4.7KB, overrunning the 4KB chunk by a few hundred bytes and scribbling on whatever heap object sits directly below the stack — observed in a crash dump as the RadioLib Module*, whose smashed hal pointer then caused a hardfault/lockup on the next radio call.

Why dev_plus and not dev — the LittleFS upgrade is the trigger

dev_plus pins the LittleFS v1.7.2 framework (weebl2000/Adafruit_nRF52_Arduino meshcore-patches / tag 1.10700.1, which is the head commit 724e00a of meshcore-dev/Adafruit_nRF52_Arduino PR #1). dev pins that same fork's master (d541301), which does not include PR #1 and still runs LittleFS v1.6.

The v1.6 → v1.7.2 upgrade enlarged the stack footprint of the file-open path (lfs_dir_find / lfs_stat) enough to push the saveSessionKeys call from under 4KB to ~4.7KB — i.e. across the loop_task stack limit. That's why the crash reproduces on dev_plus but not on dev. The small 4KB loop-task stack is the underlying constraint; the LittleFS upgrade is what crossed the line. (Reverting the upgrade is not the goal — it brings real fixes; this PR gives the file-open path enough stack to run safely.)

Fix

Run the application loop on a dedicated FreeRTOS task with an 8KB stack (measured peak ~4.7KB + ~3KB headroom) instead of the framework's 4KB loop_task. No framework edits, no platformio.ini hacks, nRF52-only.

  • New header-only src/helpers/MeshWorkerTask.h provides startMeshWorker(body), which xTaskCreates a "mesh" task (2048 words / 8KB) at TASK_PRIO_LOW — matching the framework's own loop task priority — that calls the body in a vTaskDelay(1) loop.
  • In each affected example, the existing loop() body is renamed to runMainLoop() (unchanged, including the leading board.loop();) and started on the worker; on nRF52 the Arduino loop() just vTaskDelay(1000)s. Other platforms call runMainLoop() directly — no behavioral change off nRF52.

The entire loop body moves (not just the_mesh.loop()) so simple_repeater's board.sleep() / hasPendingWork() gating stays single-threaded and correctly ordered — it's the same code, just on a bigger stack.

Scope

Applied to the two examples that hit the ClientACL::saveSessionKeys crash path: simple_repeater and simple_room_server. Other nRF52 examples share the same latent 4KB-stack risk but are left out of this PR.

Verification

  • pio run -e RAK_3401_repeater — the exact MESH_PACKET_LOGGING + MESH_DEBUG config that reproduces the crash; clean build
  • pio run -e RAK_4631_room_server — clean build
  • No -Wmacro-redefined (we never touch LOOP_STACK_SZ)

On-hardware: flash a RAK3401/RAK4631 repeater and drive the previously-crashing workload (admin sync, neighbour discover, repeated LoRa status requests) for several minutes — expect no LoRa-silence / lockup. Optional gdb check: walk the 0xa5a5a5a5 canary fill from ((TCB_t*)_meshTaskHandle)->pxStack upward — expect peak ~4.7KB used of 8KB, ~3KB free.

Note for reviewers

An alternative/complementary fix is a one-line #ifndef LOOP_STACK_SZ guard in the pinned Adafruit_nRF52_Arduino framework fork so -DLOOP_STACK_SZ works from platformio.ini — that would cover all nRF52 builds at once. Since dev_plus already pins a MeshCore framework fork, that change is readily available, but it's left as a separate, independent change.

🤖 Generated with Claude Code

…erflow

The Adafruit_nRF52_Arduino framework runs Arduino loop() on a 4KB task
stack (LOOP_STACK_SZ = 256*4, an unconditional #define that build flags
cannot override). Heavy LittleFS work from loop() -- notably
ClientACL::saveSessionKeys -> Adafruit_LittleFS::open() allocating large
lfs_dir/lfs_info structs on the stack (measured peak ~4.7KB) -- overflows
the 4KB chunk and corrupts the heap object directly below it (observed:
the RadioLib Module*), causing a hardfault/lockup after admin/session
activity on repeaters.

Move the application loop onto a dedicated FreeRTOS task with an 8KB
stack (via new header-only helpers/MeshWorkerTask.h). On nRF52 the
Arduino loop() just vTaskDelays; the existing loop body runs unchanged on
the worker task, keeping the board.sleep()/hasPendingWork() gating
single-threaded. Other platforms are unaffected. Applied to the two
examples that hit the saveSessionKeys path: simple_repeater and
simple_room_server.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 30, 2026 21:53
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Moves the Arduino loop() body onto a dedicated FreeRTOS task with an 8KB stack on NRF52 platforms, to avoid stack overflows caused by heavy LittleFS work running on the framework's 4KB loop_task stack.

Changes:

  • Adds a new MeshWorkerTask.h helper that creates a FreeRTOS task running a supplied loop body.
  • Wires the helper into simple_room_server and simple_repeater examples, splitting loop() into a worker-task body and a thin idle loop().
  • Guards all new logic behind NRF52_PLATFORM.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
src/helpers/MeshWorkerTask.h New header defining startMeshWorker() and a FreeRTOS task that calls the registered loop body.
examples/simple_room_server/main.cpp Refactors loop() into runMainLoop() started on the mesh worker task for NRF52.
examples/simple_repeater/main.cpp Same refactor as above for the repeater example.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/helpers/MeshWorkerTask.h Outdated
Comment thread src/helpers/MeshWorkerTask.h Outdated
Comment thread src/helpers/MeshWorkerTask.h Outdated
Comment thread examples/simple_room_server/main.cpp Outdated
Comment thread examples/simple_repeater/main.cpp Outdated
disq and others added 2 commits May 30, 2026 23:08
…tion

Copilot review feedback on the nRF52 mesh-worker fix:

- Move the worker task + state out of the header into MeshWorkerTask.cpp so
  there is a single shared definition (the header was defining non-inline
  static globals/functions, fragile across translation units; inline-variable
  fix isn't viable since nRF52 builds aren't guaranteed C++17). Header now
  declares only startMeshWorker().
- Guard the loop-body pointer: the worker task does
  'if (_meshLoopBody) _meshLoopBody();' so it can never deref a null pointer.
- Check the xTaskCreate() return value and MESH_DEBUG_PRINTLN + return false
  on failure (8KB stack alloc out of FreeRTOS heap) instead of silently
  appearing to boot with no mesh loop running.
- Reduce per-example boilerplate: the loop body is now a fixed-name
  meshAppLoop() on all platforms (no #if around the function signature);
  loop() either yields (nRF52) or calls meshAppLoop() directly (others).

The shared .cpp is guarded by NRF52_PLATFORM, so it compiles empty on other
platforms; verified it links cleanly into a non-modified nRF52 example
(RAK_4631_companion_radio_usb) that neither defines meshAppLoop nor calls
startMeshWorker.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The previous commit made startMeshWorker() return bool but the call sites
discarded it, so worker-task creation failure was still only visible under
MESH_DEBUG. Act on the result: on failure, log and halt() -- mirroring the
existing radio_init() failure handling in setup() -- so the node fails loudly
instead of booting with no mesh loop running.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 30, 2026 22:17
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

Comment thread src/helpers/MeshWorkerTask.cpp
Comment thread src/helpers/MeshWorkerTask.cpp
Comment thread examples/simple_room_server/main.cpp
Comment thread examples/simple_room_server/main.cpp
Address two further review points:
- Worker yields with vTaskDelay(1) instead of vTaskDelay(pdMS_TO_TICKS(1)) so
  it always yields at least one tick regardless of configTICK_RATE_HZ (on this
  framework the tick rate is 1024 Hz so the old form was already 1 tick, but
  the explicit form is tick-rate independent).
- startMeshWorker() now rejects a null loopBody and a second call (which would
  leak the previous task by overwriting _meshTaskHandle), returning false and
  logging instead of silently misbehaving.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@disq
Copy link
Copy Markdown
Author

disq commented May 31, 2026

Seems this is also affecting other nRF such as companion BLE, so I'm going to open another PR on Adafruit_nRF52_Arduino to relax the LOOP_STACK_SZ defn.

@disq
Copy link
Copy Markdown
Author

disq commented May 31, 2026

Superseded by #41, which fixes the same nRF52 loop_task stack overflow at the framework level (enlarging LOOP_STACK_SZ via build flag) instead of per-example worker tasks — covering all nRF52 builds, BLE companion included, with no per-example boilerplate.

@disq disq closed this May 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants