fix: deadlock when increasing partitioned consumers by BewareMyPower · Pull Request #1500 · apache/pulsar-client-go

BewareMyPower · 2026-05-19T15:19:57Z

Motivation

#1494 introduces a possible deadlock after ensuring the thread safety.

To fix the thread safety issue, every APIs that need to find a specific sub-consumer, like Ack and Seek, will now require locking the parent consumer. However, in internalTopicSubscribeToPartitions, it will wait all sub-consumers' creations are done:

pulsar-client-go/pulsar/consumer_impl.go

Lines 414 to 416 in 87ce8f9

    
           cons, err := newPartitionConsumer(c, c.client, opts, c.messageCh, c.dlq, c.metrics) 
        
           ch <- ConsumerError{ 
        
           	err:       err,

pulsar-client-go/pulsar/consumer_impl.go

Line 428 in 87ce8f9

for ce := range ch {

range ch will wait the err returned by newPartitionConsumer is sent to the ch.

However, it could be blocked by grabConn:

pulsar-client-go/pulsar/consumer_partition.go

Line 434 in 87ce8f9

err := pc.grabConn("")

this could start a Subscribe RPC in the connection that might receive user's Ack or Seek requests.

Modifications

Add TestInternalTopicSubscribeToPartitionsDoesNotBlockExistingPartitionLookup to reproduce the deadlock issue.
Adopt the lock-free implementation to manage consumers, the test above will pass after test
Delay the dispatching logic after adding new sub-consumers to c.consumers, otherwise, a message could be received from a new sub-consumer and acknowledged before that consumer is added to c.consumers.
Add TestInternalTopicSubscribeToPartitionsPublishesConsumersBeforeDispatchingMessages to verify it works

…eceived and acknowledged

Copilot

Pull request overview

This PR addresses a deadlock risk introduced by prior thread-safety changes when partitioned consumers auto-discover and subscribe to new partitions, by switching partition-consumer management to a lock-free publication approach and delaying dispatcher startup until after the new consumer list is published.

Changes:

Replace the partition-consumer container on consumer with an atomic.Value-backed, copy-on-write list and update call sites to read via partitionConsumers().
Delay starting newly-created partition dispatchers until after the updated partition-consumer list is published.
Add targeted regression tests to reproduce the deadlock and to ensure consumers are published before dispatching begins.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
pulsar/reader_test.go	Updates tests to use `partitionConsumers()` instead of direct `consumers` indexing.
pulsar/reader_impl.go	Uses `partitionConsumers()` snapshot when checking/using last message ID.
pulsar/message_chunking_test.go	Updates tests to use `partitionConsumers()` accessors.
pulsar/consumer_zero_queue.go	Updates `newPartitionConsumer` call to pass the new dispatcher-start flag.
pulsar/consumer_test.go	Adds regression tests + helpers for partition expansion deadlock and publication ordering.
pulsar/consumer_partition.go	Adds `startDispatcher` parameter and defers dispatcher startup accordingly; buffers `connectedCh`.
pulsar/consumer_impl.go	Reworks partition subscription to publish consumers atomically, updates partition lookups to use snapshots, and introduces `partitionConsumers()` helper.

Comments suppressed due to low confidence (1)

pulsar/consumer_impl.go:770

Seek/SeekByTime no longer coordinate with background partition discovery/expansion. If internalTopicSubscribeToPartitions publishes/starts dispatchers concurrently, newly-added partitions won’t be paused and can dispatch into messageCh while these methods are draining it, leading to inconsistent seek results and possible busy-drain loops. Consider introducing a lightweight barrier (eg, an RWMutex or atomic ‘update in progress’ gate) so seek operations can prevent partition publication/dispatcher start during their critical sections, without holding a lock across RPCs.

func (c *consumer) Seek(msgID MessageID) error {
	consumers := c.partitionConsumers()

	if len(consumers) > 1 {
		return newError(SeekFailed, "for partition topic, seek command should perform on the individual partitions")
	}

	consumer, err := findPartitionConsumer(consumers, msgID)
	if err != nil {
		return err
	}
	consumer.pauseDispatchMessage()
	// clear messageCh
	for len(c.messageCh) > 0 {
		<-c.messageCh
	}

	return consumer.Seek(msgID)
}

func (c *consumer) SeekByTime(time time.Time) error {
	var errs error
	consumers := c.partitionConsumers()

	for _, cons := range consumers {
		cons.pauseDispatchMessage()
	}
	// clear messageCh
	for len(c.messageCh) > 0 {
		<-c.messageCh
	}

	// run SeekByTime on every partition of topic
	for _, cons := range consumers {
		if err := cons.SeekByTime(time); err != nil {
			msg := fmt.Sprintf("unable to SeekByTime for topic=%s subscription=%s", c.topic, c.Subscription())
			errs = pkgerrors.Wrap(newError(SeekFailed, err.Error()), msg)
		}
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

RobertIndie

Please take a look at the copilot's comments. Other part looks good to me.

BewareMyPower · 2026-05-20T12:16:08Z

@RobertIndie addressed, PTAL again

nodece · 2026-05-21T02:40:34Z

Great fix on the deadlock path from #1494 — the lock-free partitionConsumers() snapshot + delayed startDispatcher() is a solid direction for the original blocking issue.

I think there is still a new concurrency window introduced by removing the old mutex guarding consumers.

In the new code:

internalTopicSubscribeToPartitions() builds newConsumers, then publishes via c.consumers.Store(...), then starts dispatchers for new partitions.
closeWithCause() closes only one snapshot (consumers := c.partitionConsumers()) and no longer serializes with partition expansion.

So this interleaving seems possible:

Goroutine A enters internalTopicSubscribeToPartitions() and starts creating new partition consumers.
Goroutine B calls closeWithCause(), takes an old snapshot, closes those consumers, and continues close flow.
Goroutine A continues and still executes c.consumers.Store(...) + startDispatcher() for newly created consumers.

Result: newly added partition consumers may be published/started after close has already closed only the old snapshot.

Before this PR, the parent lock serialized expansion and close-related operations over c.consumers; removing that lock fixes deadlock, but also removes that lifecycle serialization.

Could we gate expansion with context cancellation or a closing flag (checked before create and before Store/startDispatcher), so in-flight expansion stops/cleans up once close starts?

BewareMyPower · 2026-05-21T06:49:12Z

@nodece closeWithCause calls c.stopDiscovery(), which cancels the timer and waits for the previous internalTopicSubscribeToPartitions is done by wg.Wait().

pulsar-client-go/pulsar/consumer_impl.go

Lines 346 to 349 in a1765ef

    
           return func() { 
        
           	ticker.Stop() 
        
           	close(stopDiscoveryCh) 
        
           	wg.Wait()

BewareMyPower added 4 commits May 19, 2026 21:59

add test to reproduce

3ec66f1

fix deadlock

dbba0c5

address issue that consumer is not added while the message could be r…

6a67b11

…eceived and acknowledged

remove mutex

dd17a2b

BewareMyPower requested review from RobertIndie, crossoverJie, mattisonchao and nodece May 19, 2026 15:21

BewareMyPower self-assigned this May 19, 2026

BewareMyPower added this to the 0.20.0 milestone May 19, 2026

BewareMyPower added 3 commits May 20, 2026 11:09

fix lint error

9e1a7ad

reduce duplicated code

aee60c1

apply gofmt

e0ca1e0

RobertIndie requested a review from Copilot May 20, 2026 07:06

Copilot started reviewing on behalf of RobertIndie May 20, 2026 07:06 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread pulsar/consumer_impl.go Outdated

Comment thread pulsar/consumer_impl.go

Comment thread pulsar/consumer_test.go Outdated

RobertIndie reviewed May 20, 2026

View reviewed changes

BewareMyPower added 2 commits May 20, 2026 20:13

Merge branch 'master' into bewaremypower/fix-increase-deadlock

1c84273

address copilot comments

2c616f7

nodece approved these changes May 21, 2026

View reviewed changes

BewareMyPower merged commit eade693 into apache:master May 21, 2026
11 of 12 checks passed

BewareMyPower deleted the bewaremypower/fix-increase-deadlock branch May 21, 2026 08:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: deadlock when increasing partitioned consumers#1500

fix: deadlock when increasing partitioned consumers#1500
BewareMyPower merged 9 commits into
apache:masterfrom
BewareMyPower:bewaremypower/fix-increase-deadlock

BewareMyPower commented May 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RobertIndie left a comment

Uh oh!

BewareMyPower commented May 20, 2026

Uh oh!

nodece commented May 21, 2026

Uh oh!

BewareMyPower commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	cons, err := newPartitionConsumer(c, c.client, opts, c.messageCh, c.dlq, c.metrics)
	ch <- ConsumerError{
	err: err,

Conversation

BewareMyPower commented May 19, 2026

Motivation

Modifications

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RobertIndie left a comment

Choose a reason for hiding this comment

Uh oh!

BewareMyPower commented May 20, 2026

Uh oh!

nodece commented May 21, 2026

Uh oh!

BewareMyPower commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants