Skip to content

feat(sdk): support per-request timeout to TCP/QUIC/WebSocket clients#3429

Open
chengxilo wants to merge 12 commits into
apache:masterfrom
chengxilo:client-timeout
Open

feat(sdk): support per-request timeout to TCP/QUIC/WebSocket clients#3429
chengxilo wants to merge 12 commits into
apache:masterfrom
chengxilo:client-timeout

Conversation

@chengxilo

@chengxilo chengxilo commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR address?

Closes #3419

Rationale

Clients had no deadline on individual request, so a stalled server could block a caller forever.

What changed?

send_raw would be blocked forever if server doesn't response. I added a request_timeout field (IggyDuration, default 300s) to TcpClientConfig, QuicClientConfig, and WebSocketClientConfig. Each client's send_raw now wraps its I/O in tokio::time::timeout

Local Execution

  • Passed
  • Pre-commit hooks ran

AI Usage

  1. Claude Opus 4.6
  2. implementation
  3. manually reviewed
  4. Yes

Change Log

By default, Iggy Rust SDK would have a 30s timeout for each request.

@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown

Thanks for the PR. It is labeled S-waiting-on-review and queued for review.

Slash commands (own line, regular comment) move it around the queue:

  • /ready - back to S-waiting-on-review after addressing feedback
  • /author - flip to S-waiting-on-author while you finish changes
  • /request-review @user-or-team - request a reviewer

See CONTRIBUTING.md for details.

@github-actions github-actions Bot added the S-waiting-on-review PR is waiting on a reviewer label Jun 6, 2026
@codecov

codecov Bot commented Jun 6, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 72.31834% with 80 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.44%. Comparing base (74d62eb) to head (1c827e8).
⚠️ Report is 12 commits behind head on master.

Files with missing lines Patch % Lines
core/sdk/src/quic/quic_client.rs 57.89% 13 Missing and 3 partials ⚠️
core/sdk/src/websocket/websocket_client.rs 75.00% 10 Missing and 5 partials ⚠️
core/sdk/src/client_provider.rs 65.78% 5 Missing and 8 partials ⚠️
core/sdk/src/clients/client_builder.rs 0.00% 12 Missing ⚠️
core/sdk/src/tcp/tcp_client.rs 82.25% 4 Missing and 7 partials ⚠️
...tion/quic_config/quic_connection_string_options.rs 69.23% 2 Missing and 2 partials ⚠️
...ration/tcp_config/tcp_connection_string_options.rs 69.23% 2 Missing and 2 partials ⚠️
core/common/src/types/args/mod.rs 25.00% 1 Missing and 2 partials ⚠️
...cket_config/websocket_connection_string_options.rs 71.42% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #3429      +/-   ##
============================================
- Coverage     74.41%   72.44%   -1.98%     
  Complexity      937      937              
============================================
  Files          1243     1243              
  Lines        125987   121394    -4593     
  Branches     101854    97305    -4549     
============================================
- Hits          93756    87942    -5814     
- Misses        29218    30167     +949     
- Partials       3013     3285     +272     
Components Coverage Δ
Rust Core 72.76% <72.31%> (-2.41%) ⬇️
Java SDK 62.44% <ø> (ø)
C# SDK 71.41% <ø> (-0.70%) ⬇️
Python SDK 88.88% <ø> (ø)
PHP SDK 84.29% <ø> (ø)
Node SDK 91.13% <ø> (ø)
Go SDK 40.14% <ø> (ø)
Files with missing lines Coverage Δ
core/common/src/error/iggy_error.rs 100.00% <ø> (ø)
...es/configuration/quic_config/quic_client_config.rs 100.00% <100.00%> (ø)
...guration/quic_config/quic_client_config_builder.rs 36.60% <100.00%> (+7.60%) ⬆️
...ypes/configuration/tcp_config/tcp_client_config.rs 100.00% <100.00%> (ø)
...figuration/tcp_config/tcp_client_config_builder.rs 71.55% <100.00%> (+3.51%) ⬆️
...ration/websocket_config/websocket_client_config.rs 62.35% <100.00%> (+0.90%) ⬆️
...ebsocket_config/websocket_client_config_builder.rs 37.96% <100.00%> (+7.75%) ⬆️
...cket_config/websocket_connection_string_options.rs 58.51% <71.42%> (+0.70%) ⬆️
core/common/src/types/args/mod.rs 45.51% <25.00%> (-0.54%) ⬇️
...tion/quic_config/quic_connection_string_options.rs 69.07% <69.23%> (+0.01%) ⬆️
... and 6 more

... and 121 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread core/common/src/types/configuration/quic_config/quic_client_config.rs Outdated
Comment thread core/sdk/src/tcp/tcp_client.rs Outdated
Comment thread core/sdk/src/websocket/websocket_client.rs Outdated
Comment thread core/sdk/src/tcp/tcp_client.rs
Comment thread core/sdk/src/quic/quic_client.rs Outdated
Comment thread core/common/src/types/configuration/tcp_config/tcp_client_config.rs Outdated
Comment thread core/sdk/src/tcp/tcp_client.rs Outdated
Comment thread core/sdk/src/client_provider.rs Outdated
Comment thread core/common/src/types/configuration/tcp_config/tcp_client_config.rs Outdated
Comment thread core/common/src/types/configuration/tcp_config/tcp_client_config.rs
@github-actions github-actions Bot added S-waiting-on-author PR is waiting on author response and removed S-waiting-on-review PR is waiting on a reviewer labels Jun 9, 2026
@chengxilo chengxilo requested a review from hubcio June 17, 2026 01:08
@chengxilo

Copy link
Copy Markdown
Contributor Author

/ready

@github-actions github-actions Bot added S-waiting-on-review PR is waiting on a reviewer and removed S-waiting-on-author PR is waiting on author response labels Jun 17, 2026
@chengxilo

Copy link
Copy Markdown
Contributor Author

/author

@chengxilo

Copy link
Copy Markdown
Contributor Author

After merging there is something need to be changed.

@github-actions github-actions Bot added S-waiting-on-author PR is waiting on author response and removed S-waiting-on-review PR is waiting on a reviewer labels Jun 21, 2026
@chengxilo

Copy link
Copy Markdown
Contributor Author

/ready

@github-actions github-actions Bot added S-waiting-on-review PR is waiting on a reviewer and removed S-waiting-on-author PR is waiting on author response labels Jun 21, 2026

@hubcio hubcio left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few findings that don't map onto changed lines:

  • the consumer poll loop error classifier (consumer.rs) only treats Disconnected | Unauthenticated | StaleClient as connection errors, so a RequestTimeout (and NotConnected) falls through as a raw error with can_poll left true and no backoff, on every transport. this is the real root of the quic no-recovery behavior noted in the quic comment. consumer.rs isn't touched by this PR so flagging it here.
  • the PR description says default 300s but the code is 30s everywhere (a later commit reduced it). update the description.
  • no test for the is_zero() no-timeout branch, nor for the vsr session-reset-on-timeout path.
  • the timeout is rust-sdk only; java/.net/python/go/c++/node don't get it - worth filing parity issues later on

if request_timeout.is_zero() {
io.await
} else {
match tokio::time::timeout(request_timeout.get_duration(), io).await {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the timeout wraps the whole io future including stream.lock().await, so the lock-acquire wait counts against the request budget. on master the read deadline was set after lock+write+flush, timing only the response read. since the heartbeat ping shares this same per-transport stream mutex, a request queued behind a stalled in-flight one can hit RequestTimeout on lock-wait alone and then tear down the shared connection, cascading to every queued request. acquire the stream lock outside the timed region and time only the io. same in quic_client.rs and websocket_client.rs.

})?;

if matches!(result, Err(IggyError::Disconnected)) {
if matches!(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this teardown nulls the stream and sets Disconnected directly, bypassing disconnect(), so publish_event(DiagnosticEvent::Disconnected) never fires on a timeout - diagnostic subscribers miss every timeout disconnect. routing the timeout teardown through disconnect() would emit the event from one shared path. same gap on websocket (set_state in the timeout arm) and quic (no set_state at all).

Ok(result) => result,
Err(_) => {
// Reset to prevent response desync on the shared stream.
*stream.lock().await = None;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this inner *stream.lock().await = None is redundant with the outer self.stream.lock().await.take() that already runs for RequestTimeout(_). only the consensus_session reset is unique to this block. dropping the inner null and letting the single outer teardown handle stream + state removes the split-lock window and the duplicate work.

*stream.lock().await = None;
#[cfg(feature = "vsr")]
{
*consensus_session

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resetting consensus_session mints a fresh client_id. replicated metadata ops are deduped server-side on (client_id, session, request), so a mutation retried after a timed-out-but-committed op runs under the new client_id, misses dedup, and can double-apply. send_messages/partition writes are unaffected (deduped by message id). preserve the client_id across the timeout-reconnect, or suppress transparent retry of replicated mutations after a session reset. vsr-gated so not a blocker for the non-vsr build, but worth fixing before vsr ships.

if request_timeout.is_zero() {
io.await
} else {
match tokio::time::timeout(request_timeout.get_duration(), io).await {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unlike tcp/ws, the quic timeout path resets consensus_session (vsr) but never calls set_state(Disconnected) or closes the connection - state stays Connected. RequestTimeout is also absent from the quic retry list, so a quic client never auto-reconnects after a timeout (tcp/ws recover because their next call sees NotConnected). under a persistent server stall the consumer just re-opens open_bi() and re-times-out every request_timeout, surfacing the error each cycle without recovering. fix: set_state(Disconnected) here too, or add RequestTimeout to the retry list.

request_timeout: parse_duration(&args.request_timeout)?,
}));
}
TransportProtocol::Http => {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the http arm builds HttpClientConfig without request_timeout and sets no reqwest timeout, so --request-timeout is silently dropped when --transport http is used, while it applies to the other three transports. wire it into the reqwest client or reject the combination explicitly.

#[serde(skip_serializing_if = "Option::is_none")]
pub websocket_reconnection_interval: Option<String>,

/// The optional per-request timeout for send/receive operations

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth a note here that request_timeout values "0", "unlimited", "disabled", "none" all map to no timeout (infinite wait), since the is_zero() fast-path skips the timeout entirely. "0" reading as infinite rather than instant is the opposite of what most people expect.

@github-actions github-actions Bot added S-waiting-on-author PR is waiting on author response and removed S-waiting-on-review PR is waiting on a reviewer labels Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-author PR is waiting on author response

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support per-request timeout to TCP/QUIC/WebSocket clients

2 participants