fix: power brownout causing early shutdown#2627
Conversation
This is probably another "needs soaking" fix as it touches power.
Backstory on this one:
I noticed the sensor firmware build was aggressively sending
"Battery is low" messages constantly when a RAK19007 was below
50%. (These messages only show up in third party clients to all
node admins, as the stock MeshCore mobile client doesn't let
one see messages from a sensor node. Seems another power draw
sending that message, but not part of this PR.)
Then people on the local mesh have been on and off talking
about certain nodes randomly losing their contact lists on
some node types, and others were talking about Heltec V4
brownouts. I also observed Heltec v4 die prematurely around
50% and I started thinking they were all related.
Started digging into the code and found a few potential leads:
- MeshCore does a "lazy" write on `dirty_contacts_expiry`
in a 5 second window.
- The shutdown/restart path do not clean this up
- Low battery check is a poll every 8 seconds with
no awareness of other things going on in the node
**On the power piece:**
Heltec V4 and other higher-powered nodes can hit the battery
harder when transmitting, below 50%, lithium batteries sag
more dramatically than they do at higher charge states.
If the power check happens at the same time as transmit,
the shutdown code gets called prematurely and shuts down
the node.
**On the file write piece:**
If the shutdown or restart paths are called, the code just calls
`shutdown()` or `reboot()` without checking and calling
`saveContacts()`. There do not appear to be any other file writes
that act this way.
**The Fix**
The change is kept using AUTO_SHUTDOWN_MILLIVOLTS so it respects
previous power threshold decisions across all node types.
With this change, all restart or shutdown paths will make
sure to call `saveContacts()` before shutting down to stop
the list from becoming corrupted.
It also suspends reading battery level for 250ms during transmit
(adjustable) so a power sag doesn't trigger an early shutdown.
On Heltec V4 at least, the MeshCore software power threshold is
much higher than the board's internal brownout/shutdown threshold.
**Tested on**
- Heltec v4
- RAK 19007
- Heltec T096
- RAK 19003
On the Heltec v4, I can now pass 50% and get down to 36% before it shuts
down. Although the voltage at 36% should probably actually say 5%
[based on some voltage curve sites like this one](https://voltagebasics.com/lithium-polymer-battery-voltage-chart/).
That is probably an idea for future mobile app improvements, the MCU
temp and battery voltage could be calculated in the app itself to generate
the battery percent and it would likely seem a bit more "accurate"
on all board types without having to add math in the node code.
|
From other PR research, I looked to see if it was possible the shutdown or restart paths might get triggered in "bad" states and edge cases that could lead to file corruption. It appears that the code path can't be called if it's a board only powered by USB and unplugged because unplugged means no power to execute code. It also looks like the |
|
I recently modified the hasPendingWork() rule to include dirty_contacts_expiry != 0. |
|
Oooh cool, will check that out and see! |
|
Yeah, it looks like hasPendingWork could replace the ShutdownHandler class in my PR entirely. It also looks like I missed adding any shutdown handling to Question, since this is your party and I'm just sampling the whiskey: do you think it's reasonable to have an up to 5 second delay on companion power off? And not trying to lead like a greasy car salesman. If I'm reading it right, waiting for hasPendingWork during shutdown via button press/etc. would result in shutdown message and/or buzzer, then 0 to 5 seconds of arbitrary wait time if the loop has to loop, and only then reboot or display off, radio off, board off. Code-wise, likely cleaner to manage. User experience might seem random, or user might think it hung. Or I could probably rig it so regardless if it has to wait for a work loop or not, it always takes 5 seconds to turn off. |
|
I have another thought that might solve the code and user problems both in one, as this PR is actually two features not one.
I can start a new PR based on some of my shutdown-write code, but with the below, and discussion can continue there:
|
- This PR is now just brownout protection for sensor and
companion
- Will open another PR with dirty write flush changes per PR
review, that PR can sort how the write improvements
Update: Removed the contact expiration ShutdownHandler and associated code so this PR is just the brownout fix. I will open another PR that is just about contact flushing based on comments in this PR.
This is probably another "needs soaking" fix as it touches power.
Backstory on this one:
I noticed the sensor firmware build was aggressively sending
"Battery is low" messages constantly when a RAK19007 was below
50%. (These messages only show up in third party clients to all
node admins, as the stock MeshCore mobile client doesn't let
one see messages from a sensor node. Seems another power draw
sending that message, but not part of this PR.)
Then people on the local mesh have been on and off talking
about certain nodes randomly losing their contact lists on
some node types, and others were talking about Heltec V4
brownouts. I also observed Heltec v4 die prematurely around
50% and I started thinking they were all related.
Started digging into the code and found a few potential leads:
dirty_contacts_expiryin a 5 second window.
no awareness of other things going on in the node
On the power piece:
Heltec V4 and other higher-powered nodes can hit the battery
harder when transmitting, below 50%, lithium batteries sag
more dramatically than they do at higher charge states.
If the power check happens at the same time as transmit,
the shutdown code gets called prematurely and shuts down
the node.
On the file write piece:
If the shutdown or restart paths are called, the code just calls
shutdown()orreboot()without checking and callingsaveContacts(). There do not appear to be any other file writesthat act this way.
The Fix
The change is kept using AUTO_SHUTDOWN_MILLIVOLTS so it respects
previous power threshold decisions across all node types.
With this change, all restart or shutdown paths will makesure to call
saveContacts()before shutting down to stopthe list from becoming corrupted.
It also suspends reading battery level for 250ms during transmit
(adjustable) so a power sag doesn't trigger an early shutdown.
On Heltec V4 at least, the MeshCore software power threshold is
much higher than the board's internal brownout/shutdown threshold.
Tested on
On the Heltec v4, I can now pass 50% and get down to 36% before it shuts
down. Although the voltage at 36% should probably actually say 5%
based on some voltage curve sites like this one.
That is probably an idea for future mobile app improvements, the MCU
temp and battery voltage could be used to calculate the battery percentage
in the app itself and it would likely seem a bit more "accurate"
on all board types without having to add math in the node code.