fix: reconnect(): respect QoS and fail-safe#254
Conversation
819bb6d to
a5a620b
Compare
a5a620b to
2451d2a
Compare
vladak
left a comment
There was a problem hiding this comment.
Stashing the topics is fine, to me. I am not comfortable with the broad catch, though.
| while subscribed_topics: | ||
| feed = subscribed_topics.pop() | ||
| self.subscribe(*feed) | ||
| except Exception: |
There was a problem hiding this comment.
I wonder if the broad exception could be reduced to the MQTT exception ?
There was a problem hiding this comment.
To be clear, no matter what the exception is, we re-raise it (see the bare raise a few lines below). This is the moral equivalent of a finally or defer clause; we aren't masking nor handling the exception, merely pausing its propagation long enough to make sure our object is left in a sane state. In fact, we could do it with finally, if you'd prefer.
As I see it, there are three options here, in addition to what I've implemented:
- Track
_original_subscriptions(or_remaining_original_subscriptions) in the object. This costs some space and complexity, but isn't otherwise too terrible. It does introduce an edge case where we would potentially subscribe to a topic twice, but that's probably not too awful. - Narrow the scope of the exception being caught. This reduces the likelihood that we stomp on someone else's toe, but reintroduces the risk that an error that is not within our caught scope (even something so prosaic as an
IndexErrorarising partway through a re-subscription) could cause us to violate our API contract and not fully re-subscribe upon reconnect. - We could leave it to the caller to identify this situation. This feels like the worst option; it requires the caller either to issue spurious subscribe()s, or to look at our private class vars (
_subscribed_topics). Plus, it means our guarantee of re-subscription upon reconnect cannot be relied upon.
There was a problem hiding this comment.
Thanks for the detailed information about the thought process, really appreciated.
I think the key question here is whether anything besides MMQTTException being thrownraised from the depths of the library code is expected to be recoverable (in general and also w.r.t. the internal MQTT object state). My take on this is that if there is, it should be wrapped in MMQTTException, i.e. I do not see the need for the broad exception catch.
There was a problem hiding this comment.
Ah, I see. Thank you for laying that out so clearly!
To work out the best path, I think it's helpful to have a real scenario. One of the lines within the try-catch is:
self.logger.debug("Reconnected with broker")Let us imagine that a custom global logging handler has a PotM bug, and that an exception will be raised from self.logger.debug if it is called at 4:56 A.M. on any Tuesday in March of 2026.
I put myself in the position of an engineer (who doesn't control the MiniMQTT library) who became aware of this bug when it triggered last week, but doesn't quite know how to reproduce it yet. Helpfully, the custom logger handler throws a corresponding CustomLoggingExceptions, and my kernel's a nice, clean loop, so I can do something like:
current_state = State()
while True:
try:
current_state.run_main_loop_once()
except CustomLoggingException as e:
# upload lots of debugging info, then...
passAnd inside of run_main_loop_once, we already had something like:
try:
mqtt_client.ping()
except MMQTTException:
# Per docs for MMQTTException, "In general, the robust way to recover is to call reconnect()."
mqtt_client.reconnect()
Perfect! Now I've got resilient code that won't crash in the face of a CLE, but will give me lots of debugging info.
Problem is, and we happen to fail our ping right around 4:55 A.M., and the time ticking over to 4:56 A.M. happens to occur partway through the reconnect loop, and if the next debug that gets called happens to be the one inside of the MiniMQTT library's reconnect(), then when I resume after that CLE, I will only be subscribed to a fraction of my topics. As you can see, that's quite a difficult scenario to debug.
However, I think the bigger issue is that, if that bug is found a different way that doesn't result in a partial re-subscription, then even given a very skilled programmer who is tasked with working around that bug (and, let us say, is somehow prevented from directly addressing the bug itself), their solution almost certainly would rely on reconnect()'s apparent semantics and thus would introduce a new, far more subtle bug that's incredibly difficult to reproduce. Indeed, even given an omniscient programmer who foresaw how reconnect() would be affected, their only options to handle it cleanly are:
- Reach inside of their client to query
_subscribed_topicsand compare that to a locally-kept complete list, or - Tear down their client entirely and rebuild it from scratch any time an error occurs in an MQTT function.
Both of these require a lot more (branching) code, and both carry costs in at least two of the three categories of CPU, memory, and/or network traffic. Further, they require a level of defensive coding that seems unreasonable to expect from a consumer of this library.
Put simply, our API contract isn't supposed to require this sort of legwork from our upstream consumer. They were told that the resub_topics parameter worked in a particular way. I think it'd therefore be a bug if reconnect(resub_topics=True) didn't result in a full resubscription if called twice, even if the first call threw an exception of some kind, so long as the second of those calls succeeded. The whole idea of "reconnect and resubscribe" is to restore a known-good state. Let's do that.
|
Here's a test case, primarily meant for #253 however can serve as a test for #252 as well: diff --git a/tests/test_reconnect.py b/tests/test_reconnect.py
index 52b8c76..f5f73fe 100644
--- a/tests/test_reconnect.py
+++ b/tests/test_reconnect.py
@@ -237,3 +237,71 @@ def test_reconnect_not_connected() -> None:
assert user_data.get("disconnect") == False
assert mqtt_client._connection_manager.close_cnt == 0
+
+
+def test_reconnect_subscribe_failure() -> None:
+ """
+ Test reconnect() will not lose previously subscribed topics on subscribe
+ failure inside reconnect().
+
+ This is a bit finicky as it relies on reconnect() calling subscribe() for each
+ topic separately and in reverse order. Also, it checks the internal
+ _subscribed_topics variable and assumes it stores the topics-to-be-subscribed
+ rather than already subscribed topics.
+ """
+ logging.basicConfig()
+ logger = logging.getLogger(__name__)
+ logger.setLevel(logging.DEBUG)
+
+ host = "localhost"
+ port = 1883
+
+ mqtt_client = MQTT.MQTT(
+ broker=host,
+ port=port,
+ ssl_context=ssl.create_default_context(),
+ connect_retries=1,
+ )
+
+ mocket = Mocket(
+ bytearray([
+ 0x20, # CONNACK
+ 0x02,
+ 0x00,
+ 0x00,
+ 0x90, # SUBACK
+ 0x03,
+ 0x00,
+ 0x01,
+ 0x00,
+ 0x00,
+ 0x20, # CONNACK
+ 0x02,
+ 0x00,
+ 0x00,
+ 0x90, # SUBACK
+ 0x02,
+ 0x00,
+ 0x02,
+ 0x00,
+ 0x90, # SUBACK to make subscribe to bar fail
+ 0x02,
+ 0x00,
+ 0x03,
+ 0x80,
+ ])
+ )
+ mqtt_client._connection_manager = FakeConnectionManager(mocket)
+ mqtt_client.connect()
+
+ mqtt_client.logger = logger
+
+ topics = [("bar", 0), ("foo", 0)]
+ logger.info(f"subscribing to {topics}")
+ mqtt_client.subscribe(topics)
+
+ with pytest.raises(MQTT.MMQTTException):
+ logger.info("reconnecting")
+ mqtt_client.reconnect()
+
+ assert set(mqtt_client._subscribed_topics) == set(topics) |
See the two attached issues for more information.
The alternative to the overly-broad
exceptwould be to burn some RAM on storing a "true" copy separately somewhere in the class. This felt like a reasonable compromise to avoid that.Closes #252
Closes #253
See #255 for a version of this PR that's more limited in scope, if you'd rather.