Skip to content

[Bug] TOCTOU race condition in MQClientInstance.brokerVersionTable access #10214

@daguimu

Description

@daguimu

Before Creating the Bug Report

  • I found a bug, not just asking a question, which should be created in GitHub Discussions.

  • I have searched the GitHub Issues and GitHub Discussions of this repository and believe that this is not a duplicate.

  • I have confirmed that this bug belongs to the current repository, not other repositories of RocketMQ.

Runtime platform environment

All platforms

RocketMQ version

develop branch (latest)

JDK Version

JDK 8+

Describe the Bug

In MQClientInstance.java, the brokerVersionTable (a ConcurrentHashMap) is accessed using a non-atomic check-then-act pattern in multiple locations:

Location 1 - sendHeartbeatToBroker() (line ~657):

if (!this.brokerVersionTable.containsKey(brokerName)) {
    this.brokerVersionTable.put(brokerName, new ConcurrentHashMap<>(4));
}
this.brokerVersionTable.get(brokerName).put(addr, version);

Location 2 - sendHeartbeatToAllBrokerV2() (line ~737):
Same pattern as Location 1.

Location 3 - findBrokerVersion() (line ~1304):

if (this.brokerVersionTable.containsKey(brokerName)) {
    if (this.brokerVersionTable.get(brokerName).containsKey(brokerAddr)) {
        return this.brokerVersionTable.get(brokerName).get(brokerAddr);
    }
}

Between containsKey() and get(), another thread could remove the entry, causing a NullPointerException when the result of get() is dereferenced.

Notably, a similar operation in the same class at line ~814 already uses the correct pattern:

ConcurrentHashMap<String, Integer> inner = MQClientInstance.this.brokerVersionTable
    .computeIfAbsent(brokerName, k -> new ConcurrentHashMap<>(4));
inner.put(brokerAddr, version);

Steps to Reproduce

The race condition occurs under concurrent heartbeat sending and broker disconnection. When a broker is removed from the version table concurrently with a heartbeat send or version lookup, a NullPointerException can occur.

What Did You Expect to See?

Thread-safe access to brokerVersionTable using atomic operations like computeIfAbsent and local variable caching.

What Did You See Instead?

Non-atomic check-then-act pattern that can result in NullPointerException under concurrent access.

Additional Context

Related to previously reported #8743 (closed by stale bot without fix).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions