test(client_server): isolate parallel tests via passive servers + dir…#84
test(client_server): isolate parallel tests via passive servers + dir…#84JustinKovacich wants to merge 1 commit into
Conversation
…ect subscriber registration
The integration tests in tests/client_server.rs failed intermittently
under `cargo test`'s default parallel execution — 6 of 11 tests flaked
with the exact subset varying run-to-run. Each test passed cleanly in
isolation, which pointed at a shared-resource problem rather than a
real concurrency bug inside the tests.
Root cause is structural, not application-level:
* The SOME/IP Service Discovery port 30490 is spec-fixed, and the
production SD binding uses `SO_REUSEADDR` + `SO_REUSEPORT` so that
multiple endpoints on the same host can coexist.
* Under `cargo test`, every `Server::new` and every Client discovery
socket joins the same `SO_REUSEPORT` group at `0.0.0.0:30490`. The
Linux kernel hash-distributes each incoming SD datagram to exactly
one socket in the group. A `Subscribe` sent by Test-A's Client can
therefore be delivered to Test-B's Server, which filters it out by
service-id and drops it. Test-A then times out waiting for a
subscriber that the kernel routed elsewhere.
* Unique per-test `service_id`s alone do not help — they let the
*wrong* server discard the packet cleanly, but the *right* server
still never sees it.
The permanent fix has three parts, all test-side:
1. Every Server is now built via `Server::new_passive`. Passive servers
bind SD to an ephemeral port instead of 30490 and are therefore not
in the `SO_REUSEPORT` group. This eliminates the cross-test SD
delivery lottery entirely. The crate already documents this as the
supported pattern for consumers who own their own SD dispatcher
(see `Server::new_passive` rustdoc).
2. Subscriber bookkeeping is done directly via
`EventPublisher::register_subscriber`, bypassing the SD handshake.
A new `bind_and_register_client` helper binds the client's unicast
socket on a test-specific port (via `add_endpoint` + a throwaway
`send_to_service`, which exercises the unicast-bind side effect
*without* touching the discovery socket) and then registers the
bound address on the publisher. No test client joins the 30490
group anymore except the two tests whose assertion is specifically
about SD lifecycle.
3. A `recv_next_unicast` helper wraps `updates.recv()` for tests that
do still bind discovery (the two SD-lifecycle tests), filtering out
any `DiscoveryUpdated` cross-talk that leaks from other parallel
tests' SD traffic. The filter is cheap and deterministic, and keeps
the SD-exercising tests meaningful.
Secondary hygiene: each test now carries a unique `SERVICE_ID` const
(0x5B01..0x5B0A) and a unique `CLIENT_PORT` const. Service-id
uniqueness is defense-in-depth against any residual cross-talk;
unicast-port uniqueness is required because `UnicastSocket` sets only
`SO_REUSEADDR` (not `SO_REUSEPORT`) and colliding `bind`s would fail.
Result: 10 consecutive clean parallel runs of
`cargo test --all-features --test client_server`
with 11/11 passing. Runtime also drops from ~2.3s (6-failure flake)
to ~0.15s because passive servers skip the SD-group bind round-trip
and tests no longer wait on `wait_for_subscribers`. Full suite
(lib 409, integration 11, doc 3) all green.
Tests now passing reliably in parallel:
* test_add_endpoint_and_send_to_service
* test_client_request_resolves_via_unicast_reply
* test_client_server_subscribe_and_receive_event
* test_e2e_protect_on_publish_and_check_on_receive
* test_multiple_subscribers_receive_events
* test_subscribe_auto_binds_discovery
* (the other five tests also updated for consistency; they were
passing before but only by luck)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. @@ Coverage Diff @@
## main #84 +/- ##
==========================================
+ Coverage 96.55% 96.68% +0.12%
==========================================
Files 30 30
Lines 7993 7993
==========================================
+ Hits 7718 7728 +10
+ Misses 275 265 -10 |
There was a problem hiding this comment.
Pull request overview
Stabilizes tests/client_server.rs under cargo test parallel execution by avoiding shared SOME/IP-SD port (30490) contention and reducing SD cross-talk between concurrently running tests.
Changes:
- Switches integration tests to construct servers via
Server::new_passive(ephemeral SD port) to avoidSO_REUSEPORTSD datagram hash distribution across tests. - Adds helpers to bind client unicast deterministically and register subscribers directly (
bind_and_register_client), bypassing the SD subscribe handshake in most tests. - Adds
recv_next_unicasthelper to filter out strayDiscoveryUpdatedevents in tests that still bind discovery.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| client | ||
| .add_endpoint(service_id, instance_id, server_addr, client_port) | ||
| .await | ||
| .expect("add_endpoint failed"); | ||
| let dummy = Message::<RawPayload>::new_sd(0x0001, &empty_sd_header()); | ||
| let _ = client.send_to_service(service_id, instance_id, dummy).await; | ||
| let client_addr = SocketAddrV4::new(Ipv4Addr::LOCALHOST, client_port); | ||
| publisher | ||
| .register_subscriber(service_id, instance_id, event_group_id, client_addr) | ||
| .await; |
There was a problem hiding this comment.
bind_and_register_client ignores the result of the dummy send_to_service call. If unicast binding fails (e.g., port unexpectedly in use) or the UDP send fails, the helper will still register the subscriber and the test will later time out in a much less actionable way. Capture the returned PendingResponse and assert the call succeeded (then drop it) so failures surface immediately and ensure the unicast socket is actually bound before registering the subscriber.
| /// (e.g. tests whose assertion is about the SD lifecycle itself). The | ||
| /// outer `Result` is the timeout outcome; the inner `Option` is the | ||
| /// stream state. Returns `None` if the stream closes. |
There was a problem hiding this comment.
The doc comment for recv_next_unicast says the function returns an outer Result for the timeout outcome, but the implementation uses tokio::time::timeout(...).await.expect(...) and therefore panics on timeout instead of returning it. Please update the docstring to match the actual behavior, or change the function signature to return a Result<Option<ClientUpdate<_>>, Elapsed> and let callers decide how to handle timeouts.
| /// (e.g. tests whose assertion is about the SD lifecycle itself). The | |
| /// outer `Result` is the timeout outcome; the inner `Option` is the | |
| /// stream state. Returns `None` if the stream closes. | |
| /// (e.g. tests whose assertion is about the SD lifecycle itself). | |
| /// Returns `None` if the stream closes, and panics if `timeout` | |
| /// elapses before a non-`DiscoveryUpdated` update arrives. |
| //! Test-A's Client can land on Test-C's Server socket, Test-C filters | ||
| //! it out by service-id, and Test-A's Server never registers the | ||
| //! subscriber — the test then times out waiting for a subscriber that | ||
| //! was delivered to the wrong process. |
There was a problem hiding this comment.
The module-level docs describe packets being delivered to the wrong "process", but these integration tests run within a single test process (multiple threads/tasks). It would be more accurate to describe delivery to the wrong socket/server instance (or "test") to avoid confusion when diagnosing SD routing behavior under cargo test parallelism.
| //! was delivered to the wrong process. | |
| //! was delivered to the wrong server socket/test instance. |
…ect subscriber registration
The integration tests in tests/client_server.rs failed intermittently under
cargo test's default parallel execution — 6 of 11 tests flaked with the exact subset varying run-to-run. Each test passed cleanly in isolation, which pointed at a shared-resource problem rather than a real concurrency bug inside the tests.Root cause is structural, not application-level:
SO_REUSEADDR+SO_REUSEPORTso that multiple endpoints on the same host can coexist.cargo test, everyServer::newand every Client discovery socket joins the sameSO_REUSEPORTgroup at0.0.0.0:30490. The Linux kernel hash-distributes each incoming SD datagram to exactly one socket in the group. ASubscribesent by Test-A's Client can therefore be delivered to Test-B's Server, which filters it out by service-id and drops it. Test-A then times out waiting for a subscriber that the kernel routed elsewhere.service_ids alone do not help — they let the wrong server discard the packet cleanly, but the right server still never sees it.The permanent fix has three parts, all test-side:
Every Server is now built via
Server::new_passive. Passive servers bind SD to an ephemeral port instead of 30490 and are therefore not in theSO_REUSEPORTgroup. This eliminates the cross-test SD delivery lottery entirely. The crate already documents this as the supported pattern for consumers who own their own SD dispatcher (seeServer::new_passiverustdoc).Subscriber bookkeeping is done directly via
EventPublisher::register_subscriber, bypassing the SD handshake. A newbind_and_register_clienthelper binds the client's unicast socket on a test-specific port (viaadd_endpoint+ a throwawaysend_to_service, which exercises the unicast-bind side effect without touching the discovery socket) and then registers the bound address on the publisher. No test client joins the 30490 group anymore except the two tests whose assertion is specifically about SD lifecycle.A
recv_next_unicasthelper wrapsupdates.recv()for tests that do still bind discovery (the two SD-lifecycle tests), filtering out anyDiscoveryUpdatedcross-talk that leaks from other parallel tests' SD traffic. The filter is cheap and deterministic, and keeps the SD-exercising tests meaningful.Secondary hygiene: each test now carries a unique
SERVICE_IDconst (0x5B01..0x5B0A) and a uniqueCLIENT_PORTconst. Service-id uniqueness is defense-in-depth against any residual cross-talk; unicast-port uniqueness is required becauseUnicastSocketsets onlySO_REUSEADDR(notSO_REUSEPORT) and collidingbinds would fail.Result: 10 consecutive clean parallel runs of
cargo test --all-features --test client_serverwith 11/11 passing. Runtime also drops from ~2.3s (6-failure flake) to ~0.15s because passive servers skip the SD-group bind round-trip and tests no longer wait on
wait_for_subscribers. Full suite (lib 409, integration 11, doc 3) all green.Tests now passing reliably in parallel: