fix(pooling): fix race conditions, deadlock, and connection leak in PoolEntry by dkasimovskiy · Pull Request #100 · tarantool/tarantool-java-sdk

dkasimovskiy · 2026-06-19T08:11:59Z

Summary

PoolEntry had several races between netty IO threads, the HashedWheelTimer heartbeat worker, and reconnect calls, manifesting as ABBA deadlocks, NPEs on inline connect failures, and leaked connections after a KILL/reconnect cycle in CI.

Production fixes

PoolEntry

Synchronize state mutations; mark fields volatile/AtomicBoolean for cross-thread visibility.
Narrow monitor critical sections to field mutations; call client.close() and emit() outside the monitor to break the ABBA deadlock between ConnectionImpl and PoolEntry monitors that hung DistributingRoundRobinBalancerTest.
Return the local chain future from internalConnect() so an inline connect failure cannot leave the caller observing a null connectFuture after handleConnectError() nulls it for reconnect.
Keep client.close() in shutdown() on every invocation (idempotent via closeChannel no-op) but guard only the onConnectionClosed emit, so a KILL-then-reconnect cycle cannot leak the new connection when its auth/ping subsequently fails.
Serialize connectAfter() reconnect-task scheduling to avoid double scheduling; double-check connectFuture in internalConnect() to return the in-flight future instead of starting a new connect.

IProtoClientPoolImpl

Synchronize forEach() on connectionPoolLock to avoid CME under concurrent setGroups().

Test fixes

Wait for box.stat.net().CONNECTIONS.current to stabilise in BasePoolTest.getActiveConnectionsCount: the IProto worker updates it asynchronously, so a single read often lags by 5-15 connections when 20+ are opened in a burst.
Collapse the wait-for-stable Lua script to a single line so tarantool emits one YAML document (SnakeYAML rejects multi-document streams).
Wait for the active connection count to reach the expected value in ConnectionPoolReconnectsTest post-reconnect assertions via a new waitForActiveConnections() helper.

I haven't forgotten about:

Related issues:

…oolEntry PoolEntry state transitions had several races between netty IO threads, the HashedWheelTimer heartbeat worker, and user-facing reconnect calls that surfaced as ABBA deadlocks, NPEs on inline connect failures, and leaked connections after a KILL/reconnect cycle in CI. PoolEntry: - Synchronize state mutations and mark connectFuture, heartbeatTask, reconnectTask, lastHeartbeatEvent, isLocked, and isShutdown volatile/AtomicBoolean for cross-thread visibility. - Narrow the entry-monitor critical sections to field mutations; call client.close() and emit() outside the monitor to break the ABBA deadlock between ConnectionImpl and PoolEntry monitors that hung DistributingRoundRobinBalancerTest. - Return the local connect future from internalConnect() so an inline connect failure cannot leave the caller observing a null connectFuture after handleConnectError() nulls it for reconnect. - Keep client.close() in shutdown() on every invocation (closeChannel is idempotent if already closed) but guard only the onConnectionClosed emit, so a KILL-then-reconnect cycle cannot leak the new connection when its auth/ping subsequently fails. - Serialize connectAfter() reconnect-task scheduling to avoid double scheduling; add a double-check of connectFuture in internalConnect() to return the in-flight future instead of starting a new connect. IProtoClientPoolImpl: - Synchronize forEach() on connectionPoolLock to avoid CME under concurrent setGroups(). Tests (BasePoolTest / ConnectionPoolReconnectsTest): - Wait for box.stat.net().CONNECTIONS.current to stabilise in getActiveConnectionsCount: the IProto worker updates it asynchronously, so a single read often lags by 5-15 connections when 20+ are opened in a burst. - Collapse the wait-for-stable Lua script to a single line so tarantool emits one YAML document (SnakeYAML rejects multi-document streams). - Wait for the active connection count to reach the expected value in ConnectionPoolReconnectsTest post-reconnect assertions via a new waitForActiveConnections() helper. Verified locally on 3.5.0 and 2.11.8: ConnectionPoolReconnectsTest, ConnectionPoolTest, ConnectionPoolHeartbeatTest, DistributingRoundRobinBalancerTest, and unit tests all pass consistently where they were previously flaky.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…oolEntry Fix a relocated ABBA deadlock: `PoolEntry.connect()`/`internalConnect()` no longer hold the entry monitor across `client.connect()`, which would deadlock with the Netty close-callback path on the shared `ConnectionImpl` monitor during a close/reconnect overlap. Fix `onConnectionClosed` accounting: the `PoolEntry` shutdown idempotency flag is now reset per connection generation, so a KILL/reconnect cycle emits one close event per generation instead of suppressing all closes after the first. `IProtoClientPoolImpl.forEach()` snapshots clients under the pool lock and invokes the action outside it, avoiding `ConcurrentModificationException` under concurrent `setGroups()` without holding the lock across a user callback.

bitgorbovsky · 2026-06-22T08:25:40Z

   * <p>Also decrements count of unavailable clients and cancels reconnect task.
   */
-  public void unlock() {
+  public synchronized void unlock() {


Хорошо бы отразить здесь в комментарии, из каких потоков может быть вызован этот метод.

Из потока пользовательского кода: при вызове функции изменения настроек обоймы подключений может вызываться unlock(), когда закрываются лишние). Не вызывается практически никогда. (Событие А).

Из потока ввода/вывода Netty при обработке ответов на запросы IPROTO_PING (или другого метода для проверки): при анализе ответа может вызываться как lock, так и unlock. (Событие Б).

Из потока ввода/вывода Netty при обработке события успешного подключения: вызывается unlock. (Событие В).

Здесь указанные события одновременно произойти не могут:

Событием А можно пренебречь.

Событие Б не может произойти одновременно с событием В, так как событие Б случается только после того, как успешно подключились и запустили задачу проверки соединения.

bitgorbovsky · 2026-06-22T08:34:15Z

   * <p>Also increments count of unavailable clients.
   */
-  public void lock() {
+  public synchronized void lock() {


Тоже давай отразим в комментарии, из каких потоков может быть вызван этот метод, и что может случиться:

Из потока ввода/вывода Netty в случае, когда подключение не удаётся: вызывается handleConnectError. (обозначим как событие A).

Из потока ввода/вывода Netty в случае, когда происходит обрыв подключения: также вызывается handleConnectError. (обозначим как событие Б).

Из потока ввода/вывода Netty в случае, когда происходит анализ ответа на запрос PING и принимается решение заблокировать соединение, чтобы вывести его из распределения нагрузки. (обозначим как событие В).

Есть вероятность, что могут случиться одновременно события Б и В (решили вывести соединение из под нагрузки и тут же прилетает его завершение).2

В этом случае может увеличиться дважды счётчик unavailable.

Какие еще события могут произойти?

bitgorbovsky · 2026-06-22T08:41:51Z

+   * with the Netty close-callback path, which takes the {@code ConnectionImpl} monitor first and
+   * then re-enters {@link #handleConnectError(Object, Throwable)} on the entry.
+   */
  public void shutdown() {


shutdown() также надо проанализировать, из каких потоков он может быть вызван.
Как минимум, из потока пользовательского кода (закрытие обоймы соединений) ии... каких еще?

bitgorbovsky · 2026-06-22T08:42:57Z

-      return (Integer) result.get(0) - 1;
+      // box.stat.net().CONNECTIONS.current is updated asynchronously by the IProto worker;
+      // the loop's fiber.sleep lets it drain pending connections before we read.
+      String lua =


Мне кажется, что именно эти изменения, вкупе с изменениями ниже (waitForActiveConnections) и дают наибольшй эффект, нежели попытки обмазать всё блокировками в PoolEntry

bitgorbovsky · 2026-06-22T08:44:21Z

   * will not be returned to outer client.
   */
-  private boolean isLocked;
+  private volatile boolean isLocked;


Т.к. подключение в обойме может быть затронуто из нескольких потоков, для его блокировки и разблокировки, то вероятно, полезнее будет сделать isLocked как AtomicBoolean и переделать методы блокировки через него. Возможно, этого изменения будет достаточно

dkasimovskiy and others added 2 commits June 19, 2026 11:08

docs(changelog): add Pooling bug-fix entry for race/deadlock/leak fixes

6ba4d21

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

dkasimovskiy requested a review from bitgorbovsky June 19, 2026 08:45

dkasimovskiy added 2 commits June 19, 2026 17:28

bitgorbovsky requested changes Jun 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pooling): fix race conditions, deadlock, and connection leak in PoolEntry#100

fix(pooling): fix race conditions, deadlock, and connection leak in PoolEntry#100
dkasimovskiy wants to merge 4 commits into
masterfrom
fix/connection-pool-flaky-tests

dkasimovskiy commented Jun 19, 2026 •

edited

Loading

Uh oh!

bitgorbovsky Jun 22, 2026

Uh oh!

bitgorbovsky Jun 22, 2026

Uh oh!

bitgorbovsky Jun 22, 2026

Uh oh!

bitgorbovsky Jun 22, 2026

Uh oh!

bitgorbovsky Jun 22, 2026

Uh oh!

bitgorbovsky Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dkasimovskiy commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Production fixes

Test fixes

Uh oh!

bitgorbovsky Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

bitgorbovsky Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

bitgorbovsky Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

bitgorbovsky Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

bitgorbovsky Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

bitgorbovsky Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dkasimovskiy commented Jun 19, 2026 •

edited

Loading