Skip to content

test(client_server): isolate parallel tests via passive servers + dir…#84

Draft
JustinKovacich wants to merge 1 commit into
mainfrom
fix/client_server_test_fixture_collision
Draft

test(client_server): isolate parallel tests via passive servers + dir…#84
JustinKovacich wants to merge 1 commit into
mainfrom
fix/client_server_test_fixture_collision

Conversation

@JustinKovacich

Copy link
Copy Markdown
Contributor

…ect subscriber registration

The integration tests in tests/client_server.rs failed intermittently under cargo test's default parallel execution — 6 of 11 tests flaked with the exact subset varying run-to-run. Each test passed cleanly in isolation, which pointed at a shared-resource problem rather than a real concurrency bug inside the tests.

Root cause is structural, not application-level:

  • The SOME/IP Service Discovery port 30490 is spec-fixed, and the production SD binding uses SO_REUSEADDR + SO_REUSEPORT so that multiple endpoints on the same host can coexist.
  • Under cargo test, every Server::new and every Client discovery socket joins the same SO_REUSEPORT group at 0.0.0.0:30490. The Linux kernel hash-distributes each incoming SD datagram to exactly one socket in the group. A Subscribe sent by Test-A's Client can therefore be delivered to Test-B's Server, which filters it out by service-id and drops it. Test-A then times out waiting for a subscriber that the kernel routed elsewhere.
  • Unique per-test service_ids alone do not help — they let the wrong server discard the packet cleanly, but the right server still never sees it.

The permanent fix has three parts, all test-side:

  1. Every Server is now built via Server::new_passive. Passive servers bind SD to an ephemeral port instead of 30490 and are therefore not in the SO_REUSEPORT group. This eliminates the cross-test SD delivery lottery entirely. The crate already documents this as the supported pattern for consumers who own their own SD dispatcher (see Server::new_passive rustdoc).

  2. Subscriber bookkeeping is done directly via EventPublisher::register_subscriber, bypassing the SD handshake. A new bind_and_register_client helper binds the client's unicast socket on a test-specific port (via add_endpoint + a throwaway send_to_service, which exercises the unicast-bind side effect without touching the discovery socket) and then registers the bound address on the publisher. No test client joins the 30490 group anymore except the two tests whose assertion is specifically about SD lifecycle.

  3. A recv_next_unicast helper wraps updates.recv() for tests that do still bind discovery (the two SD-lifecycle tests), filtering out any DiscoveryUpdated cross-talk that leaks from other parallel tests' SD traffic. The filter is cheap and deterministic, and keeps the SD-exercising tests meaningful.

Secondary hygiene: each test now carries a unique SERVICE_ID const (0x5B01..0x5B0A) and a unique CLIENT_PORT const. Service-id uniqueness is defense-in-depth against any residual cross-talk; unicast-port uniqueness is required because UnicastSocket sets only SO_REUSEADDR (not SO_REUSEPORT) and colliding binds would fail.

Result: 10 consecutive clean parallel runs of
cargo test --all-features --test client_server
with 11/11 passing. Runtime also drops from ~2.3s (6-failure flake) to ~0.15s because passive servers skip the SD-group bind round-trip and tests no longer wait on wait_for_subscribers. Full suite (lib 409, integration 11, doc 3) all green.

Tests now passing reliably in parallel:

  • test_add_endpoint_and_send_to_service
  • test_client_request_resolves_via_unicast_reply
  • test_client_server_subscribe_and_receive_event
  • test_e2e_protect_on_publish_and_check_on_receive
  • test_multiple_subscribers_receive_events
  • test_subscribe_auto_binds_discovery
  • (the other five tests also updated for consistency; they were passing before but only by luck)

…ect subscriber registration

The integration tests in tests/client_server.rs failed intermittently
under `cargo test`'s default parallel execution — 6 of 11 tests flaked
with the exact subset varying run-to-run. Each test passed cleanly in
isolation, which pointed at a shared-resource problem rather than a
real concurrency bug inside the tests.

Root cause is structural, not application-level:

 * The SOME/IP Service Discovery port 30490 is spec-fixed, and the
   production SD binding uses `SO_REUSEADDR` + `SO_REUSEPORT` so that
   multiple endpoints on the same host can coexist.
 * Under `cargo test`, every `Server::new` and every Client discovery
   socket joins the same `SO_REUSEPORT` group at `0.0.0.0:30490`. The
   Linux kernel hash-distributes each incoming SD datagram to exactly
   one socket in the group. A `Subscribe` sent by Test-A's Client can
   therefore be delivered to Test-B's Server, which filters it out by
   service-id and drops it. Test-A then times out waiting for a
   subscriber that the kernel routed elsewhere.
 * Unique per-test `service_id`s alone do not help — they let the
   *wrong* server discard the packet cleanly, but the *right* server
   still never sees it.

The permanent fix has three parts, all test-side:

1. Every Server is now built via `Server::new_passive`. Passive servers
   bind SD to an ephemeral port instead of 30490 and are therefore not
   in the `SO_REUSEPORT` group. This eliminates the cross-test SD
   delivery lottery entirely. The crate already documents this as the
   supported pattern for consumers who own their own SD dispatcher
   (see `Server::new_passive` rustdoc).

2. Subscriber bookkeeping is done directly via
   `EventPublisher::register_subscriber`, bypassing the SD handshake.
   A new `bind_and_register_client` helper binds the client's unicast
   socket on a test-specific port (via `add_endpoint` + a throwaway
   `send_to_service`, which exercises the unicast-bind side effect
   *without* touching the discovery socket) and then registers the
   bound address on the publisher. No test client joins the 30490
   group anymore except the two tests whose assertion is specifically
   about SD lifecycle.

3. A `recv_next_unicast` helper wraps `updates.recv()` for tests that
   do still bind discovery (the two SD-lifecycle tests), filtering out
   any `DiscoveryUpdated` cross-talk that leaks from other parallel
   tests' SD traffic. The filter is cheap and deterministic, and keeps
   the SD-exercising tests meaningful.

Secondary hygiene: each test now carries a unique `SERVICE_ID` const
(0x5B01..0x5B0A) and a unique `CLIENT_PORT` const. Service-id
uniqueness is defense-in-depth against any residual cross-talk;
unicast-port uniqueness is required because `UnicastSocket` sets only
`SO_REUSEADDR` (not `SO_REUSEPORT`) and colliding `bind`s would fail.

Result: 10 consecutive clean parallel runs of
  `cargo test --all-features --test client_server`
with 11/11 passing. Runtime also drops from ~2.3s (6-failure flake)
to ~0.15s because passive servers skip the SD-group bind round-trip
and tests no longer wait on `wait_for_subscribers`. Full suite
(lib 409, integration 11, doc 3) all green.

Tests now passing reliably in parallel:
  * test_add_endpoint_and_send_to_service
  * test_client_request_resolves_via_unicast_reply
  * test_client_server_subscribe_and_receive_event
  * test_e2e_protect_on_publish_and_check_on_receive
  * test_multiple_subscribers_receive_events
  * test_subscribe_auto_binds_discovery
  * (the other five tests also updated for consistency; they were
     passing before but only by luck)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 24, 2026 20:45
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #84      +/-   ##
==========================================
+ Coverage   96.55%   96.68%   +0.12%     
==========================================
  Files          30       30              
  Lines        7993     7993              
==========================================
+ Hits         7718     7728      +10     
+ Misses        275      265      -10     

see 3 files with indirect coverage changes

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Stabilizes tests/client_server.rs under cargo test parallel execution by avoiding shared SOME/IP-SD port (30490) contention and reducing SD cross-talk between concurrently running tests.

Changes:

  • Switches integration tests to construct servers via Server::new_passive (ephemeral SD port) to avoid SO_REUSEPORT SD datagram hash distribution across tests.
  • Adds helpers to bind client unicast deterministically and register subscribers directly (bind_and_register_client), bypassing the SD subscribe handshake in most tests.
  • Adds recv_next_unicast helper to filter out stray DiscoveryUpdated events in tests that still bind discovery.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/client_server.rs
Comment on lines +84 to +93
client
.add_endpoint(service_id, instance_id, server_addr, client_port)
.await
.expect("add_endpoint failed");
let dummy = Message::<RawPayload>::new_sd(0x0001, &empty_sd_header());
let _ = client.send_to_service(service_id, instance_id, dummy).await;
let client_addr = SocketAddrV4::new(Ipv4Addr::LOCALHOST, client_port);
publisher
.register_subscriber(service_id, instance_id, event_group_id, client_addr)
.await;

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bind_and_register_client ignores the result of the dummy send_to_service call. If unicast binding fails (e.g., port unexpectedly in use) or the UDP send fails, the helper will still register the subscriber and the test will later time out in a much less actionable way. Capture the returned PendingResponse and assert the call succeeded (then drop it) so failures surface immediately and ensure the unicast socket is actually bound before registering the subscriber.

Copilot uses AI. Check for mistakes.
Comment thread tests/client_server.rs
Comment on lines +100 to +102
/// (e.g. tests whose assertion is about the SD lifecycle itself). The
/// outer `Result` is the timeout outcome; the inner `Option` is the
/// stream state. Returns `None` if the stream closes.

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment for recv_next_unicast says the function returns an outer Result for the timeout outcome, but the implementation uses tokio::time::timeout(...).await.expect(...) and therefore panics on timeout instead of returning it. Please update the docstring to match the actual behavior, or change the function signature to return a Result<Option<ClientUpdate<_>>, Elapsed> and let callers decide how to handle timeouts.

Suggested change
/// (e.g. tests whose assertion is about the SD lifecycle itself). The
/// outer `Result` is the timeout outcome; the inner `Option` is the
/// stream state. Returns `None` if the stream closes.
/// (e.g. tests whose assertion is about the SD lifecycle itself).
/// Returns `None` if the stream closes, and panics if `timeout`
/// elapses before a non-`DiscoveryUpdated` update arrives.

Copilot uses AI. Check for mistakes.
Comment thread tests/client_server.rs
//! Test-A's Client can land on Test-C's Server socket, Test-C filters
//! it out by service-id, and Test-A's Server never registers the
//! subscriber — the test then times out waiting for a subscriber that
//! was delivered to the wrong process.

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module-level docs describe packets being delivered to the wrong "process", but these integration tests run within a single test process (multiple threads/tasks). It would be more accurate to describe delivery to the wrong socket/server instance (or "test") to avoid confusion when diagnosing SD routing behavior under cargo test parallelism.

Suggested change
//! was delivered to the wrong process.
//! was delivered to the wrong server socket/test instance.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants