Skip to content

feat(ssh): add SSH keepalive and auto-reconnect for proxy#37

Merged
zx06 merged 1 commit intomainfrom
feature/proxy-reconnect
Mar 25, 2026
Merged

feat(ssh): add SSH keepalive and auto-reconnect for proxy#37
zx06 merged 1 commit intomainfrom
feature/proxy-reconnect

Conversation

@zx06
Copy link
Copy Markdown
Owner

@zx06 zx06 commented Mar 25, 2026

问题

启动 xsql proxy 后,若 SSH 隧道与远端之间的网络断开,proxy 无法自动恢复。用户只在重试连接时才看到超时错误:

[proxy] failed to dial remote 10.0.0.1:63306: read tcp ... read: connection timed out

根因

  1. SSH Client 一次性创建,之后不再重建
  2. 无 keepalive 机制,死连接无法主动发现
  3. 无自动重连逻辑

方案

引入 SSH Keepalive + ReconnectDialer 两个机制:

SSH Keepalive

  • 周期性发送 keepalive@openssh.com 探测连接存活(默认 30s 间隔)
  • 连续 3 次失败判定为死连接

ReconnectDialer

  • 包装 SSH Client,在 dial 失败或 keepalive 检测到死连接时自动重连
  • 带指数退避的重试策略
  • 线程安全,支持并发 dial
  • 状态回调输出到 stderr:
    [proxy] SSH connection lost: <error>
    [proxy] reconnecting to SSH server...
    [proxy] SSH reconnected successfully
    

变更文件

文件 变更说明
internal/ssh/options.go 添加 KeepaliveIntervalKeepaliveCountMax 字段
internal/ssh/client.go 添加 SendKeepalive()Alive()、nil-safe DialContext()
internal/ssh/reconnect.go 新建 ReconnectDialer 实现
internal/app/conn.go 添加 ResolveReconnectableSSH,重构 resolveSSHOptions
cmd/xsql/proxy.go 集成 ReconnectDialer + 状态日志
docs/ssh-proxy.md 文档更新:keepalive 和自动重连说明

测试覆盖率

覆盖率
internal/ssh 87.8%
internal/app 85.1%
internal/proxy 94.2%

新增测试:

  • 17 个 ReconnectDialer 单元测试
  • 4 个 SSH client keepalive 测试
  • 4 个 app 层 reconnectable SSH 测试
  • 修复 TestConnect_DefaultPort/DefaultUser 测试超时问题

不变更

  • 错误码/退出码不变
  • JSON 输出格式不变
  • xsql query / xsql schema dump 不受影响(短生命周期连接)

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 25, 2026

Codecov Report

❌ Patch coverage is 78.72340% with 40 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.97%. Comparing base (add3ea7) to head (01bba5e).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
internal/ssh/reconnect.go 85.71% 13 Missing and 5 partials ⚠️
cmd/xsql/proxy.go 6.66% 13 Missing and 1 partial ⚠️
internal/ssh/client.go 61.53% 5 Missing ⚠️
internal/app/conn.go 88.46% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #37      +/-   ##
==========================================
+ Coverage   80.92%   80.97%   +0.04%     
==========================================
  Files          40       41       +1     
  Lines        2726     2901     +175     
==========================================
+ Hits         2206     2349     +143     
- Misses        397      423      +26     
- Partials      123      129       +6     
Flag Coverage Δ
e2e 44.09% <ø> (ø)
integration 44.09% <ø> (ø)
unittests 69.04% <78.72%> (+0.81%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Problem: When running 'xsql proxy', if the SSH tunnel connection to the
remote server is interrupted (e.g., network outage), the proxy becomes
unusable. Users only discover this when they attempt to use the proxy and
receive timeout errors.

Solution:
- Add SSH keepalive mechanism (keepalive@openssh.com probes) to detect
  dead connections proactively
- Introduce ReconnectDialer that wraps SSH Client with automatic
  reconnection on connection failure or keepalive death detection
- Integrate ReconnectDialer into the proxy command for seamless recovery
- Emit status events (connected/disconnected/reconnecting/reconnected)
  to stderr for user visibility

Changes:
- internal/ssh/options.go: Add KeepaliveInterval and KeepaliveCountMax
- internal/ssh/client.go: Add SendKeepalive(), Alive(), nil-safe DialContext
- internal/ssh/reconnect.go: New ReconnectDialer with keepalive monitoring
- internal/app/conn.go: Add ResolveReconnectableSSH, refactor resolveSSHOptions
- cmd/xsql/proxy.go: Use ReconnectDialer with status logging
- docs/ssh-proxy.md: Document keepalive and auto-reconnect behavior

Testing:
- 17 new tests for ReconnectDialer (reconnect_test.go)
- 4 new tests for SSH client keepalive methods
- 4 new tests for app layer reconnectable SSH
- Fix pre-existing test timeouts (TestConnect_DefaultPort/User)
- New code coverage: ssh 87.8%, app 85.1%, proxy 94.2%

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@zx06 zx06 force-pushed the feature/proxy-reconnect branch from fb7a2d9 to 01bba5e Compare March 25, 2026 03:15
@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
3.1% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

@zx06 zx06 merged commit a18431f into main Mar 25, 2026
13 of 14 checks passed
@zx06 zx06 deleted the feature/proxy-reconnect branch March 25, 2026 03:26
zx06 pushed a commit that referenced this pull request Mar 25, 2026
…nect coalescing

Fix remaining issues from PR #37 (SSH keepalive & auto-reconnect):

1. Eliminate code duplication: ResolveConnection() now uses resolveSSHOptions()
   instead of duplicating SSH options construction (fixes SonarQube 3.1% > 3%)

2. Context-aware SSH Connect: Replace ssh.Dial() with net.DialContext() +
   ssh.NewClientConn() so context cancellation/timeout interrupts both
   TCP connection and SSH handshake phases

3. Reconnect coalescing: Multiple concurrent DialContext/keepalive failures
   trigger a single reconnect instead of racing. Lock released during retry
   loop to avoid blocking other DialContext callers

4. Keepalive recovery: Keepalive monitoring restarts after failed reconnect
   attempts, preventing permanent loss of health detection

5. In-process SSH test server: testutil_test.go provides a real SSH server
   (using golang.org/x/crypto/ssh) for testing connect, keepalive, tunnel
   forwarding, and reconnection — runs on all platforms without Docker

Test coverage: internal/ssh 87.6% → 93.9%, internal/app 85.1% → 85.5%

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
zx06 added a commit that referenced this pull request Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant