#15 - Add Spark socket wire benchmark notes#110
Conversation
Signed-off-by: Cliff Burdick <cburdick@nvidia.com>
|
| Filename | Overview |
|---|---|
| SPARK_SOCKET_WIRE_README.md | New documentation file describing Linux namespace setup and PHY-counter verification for Spark socket wire benchmarks; no code logic changes. |
| src/managers/socket/daqiri_socket_mgr.cpp | Three focused fixes: UDP payload size cap at 65507 bytes, dead-connection guard added to send_tx_burst TCP path (conn->running.load()), and TCP connection reuse with live-state check in socket_connect_to_server; removal of self-erase from tcp_rx_loop is compensated by the reuse guard. Burst is freed unconditionally at end of send_tx_burst even when validation short-circuits. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[send_tx_burst] --> B{protocol?}
B -- UDP --> C[send_udp_burst]
C --> D{any pkt_len > 65507?}
D -- yes --> E[return false / CONNECT_FAILURE]
D -- no --> F[sendmmsg loop]
F --> G[update tx_pkts / tx_bytes]
B -- TCP --> H{conn == null OR NOT running?}
H -- yes --> I[CONNECT_FAILURE]
H -- no --> J[send_tcp_burst]
J --> G
G --> K[free_all_packets + free_tx_burst]
E --> K
I --> K
subgraph socket_connect_to_server
L[primary_conn_id set AND IP/port match?] -- yes --> M{connections_ find + running.load?}
M -- alive --> N[return existing conn_id]
M -- dead or missing --> O[create_tcp_client_connection]
L -- no --> O
end
Reviews (2): Last reviewed commit: "#15 - Guard TCP socket connection reuse" | Re-trigger Greptile
Signed-off-by: Cliff Burdick <cburdick@nvidia.com>
dleshchev
left a comment
There was a problem hiding this comment.
looks like dead connections can be still an issue; also the .md file needs to be placed somewhere more appropriate.
| @@ -0,0 +1,414 @@ | |||
| # Spark Socket Wire Benchmark Notes | |||
There was a problem hiding this comment.
should this file live in tutorials?
| CLIENT_IF=enp1s0f0np0 | ||
| SERVER_IF=enp1s0f1np1 | ||
|
|
||
| CLIENT_IP=10.250.0.1 |
There was a problem hiding this comment.
may be needs a note that this is device-specific?
There was a problem hiding this comment.
This tutorial is specific to spark if that's what you mean
|
|
||
| The socket benchmark config must use the namespace IPs. Use a large iteration | ||
| count because current `daqiri_bench_socket` treats `iterations: 0` as zero work, | ||
| not as "run until --seconds expires". |
There was a problem hiding this comment.
probably need an issue for that
|
Added into PR #114 |
This PR documents how to run Linux networking loopback tests (RDMA/Sockets) on the DGX Spark platform. Without configuring the system like this the benchmarks are not realistic since the traffic is looped through Linux rather than leaving on the wire.