Skip to content

fix: reconnect(): respect QoS and fail-safe#254

Open
BMDan wants to merge 3 commits intoadafruit:mainfrom
BMDan:fix/reconnect_qos_and_drops
Open

fix: reconnect(): respect QoS and fail-safe#254
BMDan wants to merge 3 commits intoadafruit:mainfrom
BMDan:fix/reconnect_qos_and_drops

Conversation

@BMDan
Copy link
Copy Markdown

@BMDan BMDan commented Feb 12, 2026

See the two attached issues for more information.

The alternative to the overly-broad except would be to burn some RAM on storing a "true" copy separately somewhere in the class. This felt like a reasonable compromise to avoid that.

Closes #252
Closes #253

See #255 for a version of this PR that's more limited in scope, if you'd rather.

@BMDan BMDan force-pushed the fix/reconnect_qos_and_drops branch from 819bb6d to a5a620b Compare February 12, 2026 05:40
@BMDan BMDan force-pushed the fix/reconnect_qos_and_drops branch from a5a620b to 2451d2a Compare February 12, 2026 05:45
Copy link
Copy Markdown
Contributor

@vladak vladak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stashing the topics is fine, to me. I am not comfortable with the broad catch, though.

Comment thread adafruit_minimqtt/adafruit_minimqtt.py Outdated
while subscribed_topics:
feed = subscribed_topics.pop()
self.subscribe(*feed)
except Exception:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the broad exception could be reduced to the MQTT exception ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, no matter what the exception is, we re-raise it (see the bare raise a few lines below). This is the moral equivalent of a finally or defer clause; we aren't masking nor handling the exception, merely pausing its propagation long enough to make sure our object is left in a sane state. In fact, we could do it with finally, if you'd prefer.

As I see it, there are three options here, in addition to what I've implemented:

  1. Track _original_subscriptions (or _remaining_original_subscriptions) in the object. This costs some space and complexity, but isn't otherwise too terrible. It does introduce an edge case where we would potentially subscribe to a topic twice, but that's probably not too awful.
  2. Narrow the scope of the exception being caught. This reduces the likelihood that we stomp on someone else's toe, but reintroduces the risk that an error that is not within our caught scope (even something so prosaic as an IndexError arising partway through a re-subscription) could cause us to violate our API contract and not fully re-subscribe upon reconnect.
  3. We could leave it to the caller to identify this situation. This feels like the worst option; it requires the caller either to issue spurious subscribe()s, or to look at our private class vars (_subscribed_topics). Plus, it means our guarantee of re-subscription upon reconnect cannot be relied upon.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed information about the thought process, really appreciated.

I think the key question here is whether anything besides MMQTTException being thrownraised from the depths of the library code is expected to be recoverable (in general and also w.r.t. the internal MQTT object state). My take on this is that if there is, it should be wrapped in MMQTTException, i.e. I do not see the need for the broad exception catch.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Thank you for laying that out so clearly!

To work out the best path, I think it's helpful to have a real scenario. One of the lines within the try-catch is:

        self.logger.debug("Reconnected with broker")

Let us imagine that a custom global logging handler has a PotM bug, and that an exception will be raised from self.logger.debug if it is called at 4:56 A.M. on any Tuesday in March of 2026.

I put myself in the position of an engineer (who doesn't control the MiniMQTT library) who became aware of this bug when it triggered last week, but doesn't quite know how to reproduce it yet. Helpfully, the custom logger handler throws a corresponding CustomLoggingExceptions, and my kernel's a nice, clean loop, so I can do something like:

current_state = State()

while True:
    try:
        current_state.run_main_loop_once()
    except CustomLoggingException as e:
        # upload lots of debugging info, then...
        pass

And inside of run_main_loop_once, we already had something like:

try:
    mqtt_client.ping()
except MMQTTException:
    # Per docs for MMQTTException, "In general, the robust way to recover is to call reconnect()."
    mqtt_client.reconnect()

Perfect! Now I've got resilient code that won't crash in the face of a CLE, but will give me lots of debugging info.

Problem is, and we happen to fail our ping right around 4:55 A.M., and the time ticking over to 4:56 A.M. happens to occur partway through the reconnect loop, and if the next debug that gets called happens to be the one inside of the MiniMQTT library's reconnect(), then when I resume after that CLE, I will only be subscribed to a fraction of my topics. As you can see, that's quite a difficult scenario to debug.

However, I think the bigger issue is that, if that bug is found a different way that doesn't result in a partial re-subscription, then even given a very skilled programmer who is tasked with working around that bug (and, let us say, is somehow prevented from directly addressing the bug itself), their solution almost certainly would rely on reconnect()'s apparent semantics and thus would introduce a new, far more subtle bug that's incredibly difficult to reproduce. Indeed, even given an omniscient programmer who foresaw how reconnect() would be affected, their only options to handle it cleanly are:

  1. Reach inside of their client to query _subscribed_topics and compare that to a locally-kept complete list, or
  2. Tear down their client entirely and rebuild it from scratch any time an error occurs in an MQTT function.

Both of these require a lot more (branching) code, and both carry costs in at least two of the three categories of CPU, memory, and/or network traffic. Further, they require a level of defensive coding that seems unreasonable to expect from a consumer of this library.

Put simply, our API contract isn't supposed to require this sort of legwork from our upstream consumer. They were told that the resub_topics parameter worked in a particular way. I think it'd therefore be a bug if reconnect(resub_topics=True) didn't result in a full resubscription if called twice, even if the first call threw an exception of some kind, so long as the second of those calls succeeded. The whole idea of "reconnect and resubscribe" is to restore a known-good state. Let's do that.

@vladak
Copy link
Copy Markdown
Contributor

vladak commented Feb 23, 2026

Here's a test case, primarily meant for #253 however can serve as a test for #252 as well:

diff --git a/tests/test_reconnect.py b/tests/test_reconnect.py
index 52b8c76..f5f73fe 100644
--- a/tests/test_reconnect.py
+++ b/tests/test_reconnect.py
@@ -237,3 +237,71 @@ def test_reconnect_not_connected() -> None:
 
     assert user_data.get("disconnect") == False
     assert mqtt_client._connection_manager.close_cnt == 0
+
+
+def test_reconnect_subscribe_failure() -> None:
+    """
+    Test reconnect() will not lose previously subscribed topics on subscribe
+    failure inside reconnect().
+
+    This is a bit finicky as it relies on reconnect() calling subscribe() for each
+    topic separately and in reverse order. Also, it checks the internal
+    _subscribed_topics variable and assumes it stores the topics-to-be-subscribed
+    rather than already subscribed topics.
+    """
+    logging.basicConfig()
+    logger = logging.getLogger(__name__)
+    logger.setLevel(logging.DEBUG)
+
+    host = "localhost"
+    port = 1883
+
+    mqtt_client = MQTT.MQTT(
+        broker=host,
+        port=port,
+        ssl_context=ssl.create_default_context(),
+        connect_retries=1,
+    )
+
+    mocket = Mocket(
+        bytearray([
+            0x20,  # CONNACK
+            0x02,
+            0x00,
+            0x00,
+            0x90,  # SUBACK
+            0x03,
+            0x00,
+            0x01,
+            0x00,
+            0x00,
+            0x20,  # CONNACK
+            0x02,
+            0x00,
+            0x00,
+            0x90,  # SUBACK
+            0x02,
+            0x00,
+            0x02,
+            0x00,
+            0x90,  # SUBACK to make subscribe to bar fail
+            0x02,
+            0x00,
+            0x03,
+            0x80,
+        ])
+    )
+    mqtt_client._connection_manager = FakeConnectionManager(mocket)
+    mqtt_client.connect()
+
+    mqtt_client.logger = logger
+
+    topics = [("bar", 0), ("foo", 0)]
+    logger.info(f"subscribing to {topics}")
+    mqtt_client.subscribe(topics)
+
+    with pytest.raises(MQTT.MMQTTException):
+        logger.info("reconnecting")
+        mqtt_client.reconnect()
+
+    assert set(mqtt_client._subscribed_topics) == set(topics)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

reconnect() can produce partial resubscriptions reconnect() loses QoS on subscribed topics

2 participants