Skip to content

[Bug] a dead-lock case in ServiceInstanceChangeListener will cause consumer can`t use latest provider instances list #16149

@mz0113

Description

@mz0113

Pre-check

  • I am sure that all the content I provide is in English.

Search before asking

  • I had searched in the issues and found no similar issues.

Apache Dubbo Component

Java SDK (apache/dubbo)

Dubbo Version

3.0.12 or 3.2.18 also have this bug
jdk1.8
windows 11

Steps to reproduce this issue

服务提供者下线时候,或者重启后变更IP情况下,有概率触发此死锁问题,表现就是消费者端感知不到提供者节点已经变化,造成no provider等情况,我看了下线程栈,nacos.publisher-com.alibaba.nacos.client.naming.event.InstancesChangeEvent 会一直wait等待获取元数据的锁,而这把锁是被元数据重试线程一直抢占,而每次元数据重试都会卡好几秒(因为provider已经下线了调不通了)。InstancesChangeEvent 没办法继续执行,就会导致nacos的事件队列发生堆积(就是那个queue),会堆积到好几千上万,消费不动事件队列了。

目前看下来是概率性的,要巧合一点,但是基本上几天会必现一次,尤其是线下测试环境因为服务变动非常频繁更加容易触发。

我看代码中,元数据重试线程每次执行重试任务都是以org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener#allInstances 这个字段存储的最新的提供者列表进行元数据获取的。但问题就恰好出现在这,一种虚拟死锁,allInstances 因为InstancesChangeEvent 线程抢不到锁一直卡着,无法更新为最新的提供者列表,因此allInstances 中一直有下线的提供者节点在,元数据重试线程就一直对这些节点进行重试,进一步抢占元数据锁。最终陷入一种死锁的逻辑中,再无自行恢复的可能了。

Image Image Image Image

====以下是英文翻译=====

‌Service providers going offline, or changing IP addresses after restart, may probabilistically trigger this deadlock issue.‌ The symptom is that consumer-side clients fail to detect changes in provider nodes, resulting in "no provider" errors. Upon reviewing the thread dump, the nacos.publisher-com.alibaba.nacos.client.naming.event.InstancesChangeEvent thread continuously waits to acquire the metadata lock, which is monopolized by the metadata retry thread. Each metadata retry gets blocked for several seconds—since the target provider is already offline and unreachable. As a result, InstancesChangeEvent cannot proceed, causing the Nacos event queue (i.e., the internal queue) to accumulate thousands or even tens of thousands of pending events, eventually becoming unprocessable.

This issue occurs probabilistically but tends to manifest at least once every few days, especially in offline testing environments where service changes are frequent.

From code analysis, the metadata retry thread uses the org.apache.dubbo.registry.client.event.listener.ServiceInstancesChangedListener#allInstances field—supposedly holding the latest provider list—to perform metadata retrieval. However, the crux lies here: a virtual deadlock occurs. Since the InstancesChangeEvent thread cannot acquire the lock to update allInstances, the list remains stale and still contains the offline provider nodes. Consequently, the retry thread keeps attempting to access these defunct nodes, further holding the metadata lock. This creates a self-sustaining deadlock loop with no possibility of self-recovery.

What you expected to happen

do not dead-lock

Anything else

No response

Do you have a (mini) reproduction demo?

  • Yes, I have a minimal reproduction demo to help resolve this issue more effectively!

Are you willing to submit a pull request to fix on your own?

  • Yes I am willing to submit a pull request on my own!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions