Conversation
The Cisco TRex download site (trex-tgn.cisco.com) has known SSL certificate issues causing curl exit code 60. Add -k flag to curl commands to skip certificate verification (standard practice for this site, matching official docs' --no-check-certificate). Also update TRex from v3.06 to v3.08 to fix Python 3.12 incompatibility issues on Amazon Linux 2023. https://claude.ai/code/session_015n3n3sffdZqXVnQSudVRH8
Three critical bugs in generate_trex_config(): 1. GATEWAY_MAC and TREX_DATA_MAC variables were never set — TRex config always got broadcast MAC (ff:ff:ff:ff:ff:ff) which AWS VPC drops for unicast traffic 2. Gateway MAC retry logic discovered the MAC but never assigned it back to the variable 3. ens6 is bound to vfio-pci after boot, so ip link/neigh commands fail — now uses IMDS for source MAC and temporarily unbinds ENI for gateway MAC discovery via ARP Also store discovered MACs as script-level variables so run_benchmark_for_config can reuse them without re-discovery. https://claude.ai/code/session_015n3n3sffdZqXVnQSudVRH8
[CI] Stage: DeployInfrastructure ready.
|
[Perf] Stage: DeployDeploying |
[Perf] Stage: Instances Ready
|
[CI] Stage: SummaryAll tests PASSED. ARP seeding: kernel /proc/net/arp (automatic)
|
Performance Test Failure (Run 22813349704)Branch: failure-summary.json{
"failed_step": "perf-test",
"error": "Script exited with code 2",
"exit_code": 2,
"timestamp": "2026-03-08T04:00:52.884941Z",
"trex_instance_id": "i-048452886b6bb870e",
"dut_instance_id": "i-077e0512fc0b3e003",
"commit": "62d0d68bca3da8df7054f111a72866767572998f",
"run_url": "https://github.com/gspivey/dpdk-stdlib-rust/actions/runs/22813349704"
}```
<details><summary>dut-environment.txt</summary>
(failed to collect) (failed to collect) === Interface State === Network devices using kernel driver0000:00:05.0 'Elastic Network Adapter (ENA) ec20' if=ens5 drv=ena unused=vfio-pci Active No 'Baseband' devices detectedNo 'Crypto' devices detectedNo 'DMA' devices detectedNo 'Eventdev' devices detectedNo 'Mempool' devices detectedNo 'Compress' devices detectedNo 'Misc (rawdev)' devices detectedNo 'Regex' devices detected=== Ethtool Stats (ens6) === === Interface State === Network devices using kernel driver0000:00:05.0 'Elastic Network Adapter (ENA) ec20' if=ens5 drv=ena unused=vfio-pci Active No 'Baseband' devices detectedNo 'Crypto' devices detectedNo 'DMA' devices detectedNo 'Eventdev' devices detectedNo 'Mempool' devices detectedNo 'Compress' devices detectedNo 'Misc (rawdev)' devices detectedNo 'Regex' devices detected=== Ethtool Stats (ens6) === === Interface State === === Interface State === Compiling smallvec v1.15.1 warning: Network devices using kernel driver0000:00:05.0 'Elastic Network Adapter (ENA) ec20' if=ens5 drv=ena unused=vfio-pci Active No 'Baseband' devices detectedNo 'Crypto' devices detectedNo 'DMA' devices detectedNo 'Eventdev' devices detectedNo 'Mempool' devices detectedNo 'Compress' devices detectedNo 'Misc (rawdev)' devices detectedNo 'Regex' devices detected=== DUT instance ready === === TRex user-data starting at Sun Mar 8 03:56:50 UTC 2026 === |
✅ Integration Tests Passed (Run 22813195574)Branch: Test Results
Application Logsreceiver-echo-server.logsender-echo-server.logsender-test-client.logreceiver-test-client-iperf.logsender-test-client-iperf.log
|
Root cause: secondary ENI attachments are separate CloudFormation resources that complete after instance boot. User-data loops 60s looking for device-number=1 in IMDS but the ENI isn't attached yet. Changes: - Extend ENI wait to 180s in user-data (non-fatal if not found) - Add wait_and_bind_eni() in orchestrator (Phase 2b) that waits for ENI attachment via SSM after CFN deploy completes - TRex user-data no longer tries dpdk-devbind.py (not installed on TRex AMI) — TRex binds the NIC internally when started - generate_trex_config() discovers MACs while ENI is in kernel mode, then takes ens6 down so TRex can claim it - DUT ENI starts in kernel mode; orchestrator binds per config https://claude.ai/code/session_015n3n3sffdZqXVnQSudVRH8
[Perf] Stage: DeployDeploying |
[CI] Stage: DeployInfrastructure ready.
|
[Perf] Stage: Instances Ready
|
Performance Test Failure (Run 22813704578)Branch: failure-summary.json{
"failed_step": "perf-test",
"error": "Script exited with code 2",
"exit_code": 2,
"timestamp": "2026-03-08T04:26:26.283458Z",
"trex_instance_id": "i-0498164fb0a19bb99",
"dut_instance_id": "i-0f714347ef32816b8",
"commit": "6fcf835e80fe9e2613dfa449899d18f807bdda6b",
"run_url": "https://github.com/gspivey/dpdk-stdlib-rust/actions/runs/22813704578"
}```
<details><summary>dut-networking-diag-failure.txt</summary>
=== Interface State === Network devices using kernel driver0000:00:05.0 'Elastic Network Adapter (ENA) ec20' if=ens5 drv=ena unused=vfio-pci Active No 'Baseband' devices detectedNo 'Crypto' devices detectedNo 'DMA' devices detectedNo 'Eventdev' devices detectedNo 'Mempool' devices detectedNo 'Compress' devices detectedNo 'Misc (rawdev)' devices detectedNo 'Regex' devices detected=== Ethtool Stats (ens6) === === Interface State === Compiling pkg-config v0.3.32 warning: Network devices using kernel driver0000:00:05.0 'Elastic Network Adapter (ENA) ec20' if=ens5 drv=ena unused=vfio-pci Active No 'Baseband' devices detectedNo 'Crypto' devices detectedNo 'DMA' devices detectedNo 'Eventdev' devices detectedNo 'Mempool' devices detectedNo 'Compress' devices detectedNo 'Misc (rawdev)' devices detectedNo 'Regex' devices detected=== DUT instance ready === === TRex user-data starting at Sun Mar 8 04:21:23 UTC 2026 === |
TRex AMI doesn't have system dpdk-devbind.py, but TRex ships with dpdk_nic_bind.py and dpdk_setup_ports.py. Use these to unbind the data ENI from kernel ena driver before starting TRex. Also: - Increase TRex start timeout from 60s to 120s - Simplify TRex user-data to not attempt vfio-pci binding (no dpdk-devbind.py available) - generate_trex_config discovers gateway MAC while ENI is in kernel mode, then uses TRex's tools to bind to vfio-pci https://claude.ai/code/session_015n3n3sffdZqXVnQSudVRH8
[Perf] Stage: DeployDeploying |
[CI] Stage: SummaryAll tests PASSED. ARP seeding: kernel /proc/net/arp (automatic)
|
✅ Integration Tests Passed (Run 22813702773)Branch: Test Results
Application Logsreceiver-echo-server.logsender-echo-server.logsender-test-client.logreceiver-test-client-iperf.logsender-test-client-iperf.log
|
[Perf] Stage: Instances Ready
|
Performance Test Failure (Run 22813963652)Branch: failure-summary.json{
"failed_step": "perf-test",
"error": "Script exited with code 2",
"exit_code": 2,
"timestamp": "2026-03-08T04:43:51.359578Z",
"trex_instance_id": "i-0a217ae0546879480",
"dut_instance_id": "i-09685f0e124d98b6f",
"commit": "fe2e7c1fd1927368985b0ac385ed6814ab88840e",
"run_url": "https://github.com/gspivey/dpdk-stdlib-rust/actions/runs/22813963652"
}```
<details><summary>dut-networking-diag-failure.txt</summary>
=== Interface State === Network devices using kernel driver0000:00:05.0 'Elastic Network Adapter (ENA) ec20' if=ens5 drv=ena unused=vfio-pci Active No 'Baseband' devices detectedNo 'Crypto' devices detectedNo 'DMA' devices detectedNo 'Eventdev' devices detectedNo 'Mempool' devices detectedNo 'Compress' devices detectedNo 'Misc (rawdev)' devices detectedNo 'Regex' devices detected=== Ethtool Stats (ens6) === === Interface State === Compiling heck v0.5.0 warning: Network devices using kernel driver0000:00:05.0 'Elastic Network Adapter (ENA) ec20' if=ens5 drv=ena unused=vfio-pci Active No 'Baseband' devices detectedNo 'Crypto' devices detectedNo 'DMA' devices detectedNo 'Eventdev' devices detectedNo 'Mempool' devices detectedNo 'Compress' devices detectedNo 'Misc (rawdev)' devices detectedNo 'Regex' devices detected=== DUT instance ready === === TRex user-data starting at Sun Mar 8 04:38:48 UTC 2026 === |
[CI] Stage: DeployInfrastructure ready.
|
Post progress comments to PR at each Phase 4 step so we can see exactly which step fails without needing artifact downloads. https://claude.ai/code/session_015n3n3sffdZqXVnQSudVRH8
[Perf] Stage: DeployDeploying |
[CI] Stage: SummarySome tests FAILED (exit code: 1). ARP seeding: kernel /proc/net/arp (automatic)
|
[Perf] Stage: Instances Ready
|
Integration Test Failure (Run 22813963326)Branch: No failure-summary.json found receiver-user-data.log (8196 bytes, last 80 lines)sender-user-data.log (8196 bytes, last 80 lines)receiver-console-output.log (31750 bytes, last 80 lines)sender-console-output.log (31762 bytes, last 80 lines)All instance-logs files |
❌ Integration Tests Failed (Run 22813963326)Branch: Test Results
Application Logsreceiver-echo-server.logsender-test-client.log
|
Performance Test Failure (Run 22814220033)Branch: failure-summary.json{
"failed_step": "perf-test",
"error": "Script exited with code 2",
"exit_code": 2,
"timestamp": "2026-03-08T05:00:49.585007Z",
"trex_instance_id": "i-09366c7abf06d88f6",
"dut_instance_id": "i-07eb3426b4daf7ce5",
"commit": "e914f9c98c09ae4539c3f05563a855894703eeb5",
"run_url": "https://github.com/gspivey/dpdk-stdlib-rust/actions/runs/22814220033"
}```
<details><summary>dut-networking-diag-failure.txt</summary>
=== Interface State === Network devices using kernel driver0000:00:05.0 'Elastic Network Adapter (ENA) ec20' if=ens5 drv=ena unused=vfio-pci Active No 'Baseband' devices detectedNo 'Crypto' devices detectedNo 'DMA' devices detectedNo 'Eventdev' devices detectedNo 'Mempool' devices detectedNo 'Compress' devices detectedNo 'Misc (rawdev)' devices detectedNo 'Regex' devices detected=== Ethtool Stats (ens6) === === Interface State === Compiling pkg-config v0.3.32 warning: Network devices using kernel driver0000:00:05.0 'Elastic Network Adapter (ENA) ec20' if=ens5 drv=ena unused=vfio-pci Active No 'Baseband' devices detectedNo 'Crypto' devices detectedNo 'DMA' devices detectedNo 'Eventdev' devices detectedNo 'Mempool' devices detectedNo 'Compress' devices detectedNo 'Misc (rawdev)' devices detectedNo 'Regex' devices detected=== DUT instance ready === === TRex user-data starting at Sun Mar 8 04:55:53 UTC 2026 === |
[CI] Stage: DeployInfrastructure ready.
|
SSM send-command requires valid JSON in --parameters. Commands containing double quotes (wait_and_bind_eni, collect_environment_info, generate_trex_config) produced invalid JSON with the previous string interpolation approach. Use python3 json.dumps to properly escape command strings before passing to aws ssm send-command. This fixes both ssm_run_command and ssm_run_command_fire_and_forget. https://claude.ai/code/session_015n3n3sffdZqXVnQSudVRH8
[Perf] Stage: DeployDeploying |
[Perf] Benchmark Diag:
|
[Perf] Stage: Benchmark (2/4)Running |
[Perf] Stage: Results[04:40:21] INFO Generating markdown summary... Performance Test Results — c5n.2xlargeCommit: 512B packets
64B packets
Failed Configs
|
Only the first benchmark config succeeds because dut_bind_kernel always does a full unbind→rebind→dhclient cycle, even when ens6 is already on the ena driver. dhclient adds a default route through ens6 that can steal SSM traffic from ens5, breaking all subsequent SSM commands. Changes: - Skip unbind/rebind when already bound to ena (just ensure IP) - Remove dhclient entirely (use static IP from CDK) - Delete any default route through ens6 after rebind - Add plain-echo to cleanup pkill list https://claude.ai/code/session_015n3n3sffdZqXVnQSudVRH8
Same optimization as dut_bind_kernel: with config order kernel,kernel,DPDK,DPDK, the second DPDK config doesn't need to rebind. Also add plain-echo to DPDK cleanup pkill list. https://claude.ai/code/session_015n3n3sffdZqXVnQSudVRH8
Performance Test Failure (Run 22885883193)Branch: failure-summary.json{
"failed_step": "perf-test",
"error": "Script exited with code 1",
"exit_code": 2,
"timestamp": "2026-03-10T04:41:06.014198Z",
"trex_instance_id": "i-03239111e1a98f452",
"dut_instance_id": "i-0dceba0226a51d376",
"commit": "884da9464b3df564e5f38766a0edfe86ea42cf4d",
"run_url": "https://github.com/gspivey/dpdk-stdlib-rust/actions/runs/22885883193"
}```
<details><summary>dut-environment.txt</summary>
=== System Info === Network devices using kernel driver0000:00:05.0 'Elastic Network Adapter (ENA) ec20' if=ens5 drv=ena unused=vfio-pci Active No 'Baseband' devices detectedNo 'Crypto' devices detectedNo 'DMA' devices detectedNo 'Eventdev' devices detectedNo 'Mempool' devices detectedNo 'Compress' devices detectedNo 'Misc (rawdev)' devices detectedNo 'Regex' devices detected=== Network Interfaces === === System Info === test duration : 0.0 sec
|
[CI] Stage: DeployInfrastructure ready.
|
[Perf] Stage: DeployDeploying |
[CI] Stage: SummaryAll tests PASSED. ARP seeding: kernel /proc/net/arp (automatic)
|
✅ Integration Tests Passed (Run 22887847524)Branch: Test Results
Application Logsreceiver-echo-server.logsender-echo-server.logsender-test-client.logreceiver-test-client-iperf.logsender-test-client-iperf.log
|
[CI] Stage: DeployInfrastructure ready.
|
Performance Test Failure (Run 22887928197)Branch: failure-summary.json{
"failed_step": "perf-test",
"error": "Script exited with code 2",
"exit_code": 2,
"timestamp": "2026-03-10T05:34:22.853856Z",
"trex_instance_id": "",
"dut_instance_id": "",
"commit": "396e0b2a384340d4be5ecac2ee825e738c28ead3",
"run_url": "https://github.com/gspivey/dpdk-stdlib-rust/actions/runs/22887928197"
}```
<details><summary>Application Logs</summary>
</details>
<details><summary>Network & PCI State</summary>
</details>
<details><summary>Kernel Console (dmesg)</summary>
</details> |
[CI] Stage: SummaryAll tests PASSED. ARP seeding: kernel /proc/net/arp (automatic)
|
✅ Integration Tests Passed (Run 22887902423)Branch: Test Results
Application Logsreceiver-echo-server.logsender-echo-server.logsender-test-client.logreceiver-test-client-iperf.logsender-test-client-iperf.log
|
When a previous run leaves PerfTestStack in DELETE_FAILED state (e.g. ENI detachment timeout), the cleanup loop was spinning for 600s waiting for a state that never changes, then cdk deploy failed with "Stack is in DELETE_FAILED state and can not be updated." Fix: detect DELETE_FAILED as a terminal state and retry the destroy up to 3 times. On the final retry, use --retain-resources to skip the stuck ENI attachments (they get cleaned up when instances terminate anyway). https://claude.ai/code/session_015n3n3sffdZqXVnQSudVRH8
[Perf] Stage: DeployDeploying |
[CI] Stage: DeployInfrastructure ready.
|
[Perf] Stage: Instances Ready
|
[Perf] Stage: TRex ConfigStarting TRex configuration (MAC discovery + NIC binding)... |
[Perf] Stage: TRex Config OK
|
[Perf] Stage: TRex StartedTRex server running. Beginning benchmarks... |
[Perf] DUT ReadyDUT instance |
[Perf] Stage: Benchmark (1/4)Running |
[Perf] Benchmark Diag:
|
[CI] Stage: SummaryAll tests PASSED. ARP seeding: kernel /proc/net/arp (automatic)
|
[Perf] Benchmark Diag:
|
[Perf] Stage: Benchmark (2/4)Running |
✅ Integration Tests Passed (Run 22889338303)Branch: Test Results
Application Logsreceiver-echo-server.logsender-echo-server.logsender-test-client.logreceiver-test-client-iperf.logsender-test-client-iperf.log
|
[Perf] Stage: Results[06:26:06] INFO Generating markdown summary... Performance Test Results — c5n.2xlargeCommit: 512B packets
64B packets
Failed Configs
|
Performance Test Failure (Run 22889409785)Branch: failure-summary.json{
"failed_step": "perf-test",
"error": "Script exited with code 1",
"exit_code": 2,
"timestamp": "2026-03-10T06:26:55.643679Z",
"trex_instance_id": "i-0d51da559c22cf2ab",
"dut_instance_id": "i-00c895a1725fd48ec",
"commit": "9831d63b8219fe46bef9da94156e82d40d947b47",
"run_url": "https://github.com/gspivey/dpdk-stdlib-rust/actions/runs/22889409785"
}```
<details><summary>dut-environment.txt</summary>
=== System Info === Network devices using kernel driver0000:00:05.0 'Elastic Network Adapter (ENA) ec20' if=ens5 drv=ena unused=vfio-pci Active No 'Baseband' devices detectedNo 'Crypto' devices detectedNo 'DMA' devices detectedNo 'Eventdev' devices detectedNo 'Mempool' devices detectedNo 'Compress' devices detectedNo 'Misc (rawdev)' devices detectedNo 'Regex' devices detected=== Network Interfaces === === System Info === Cpu Utilization : 0.0 % Expected-PPS : 0.00 pps Active-flows : 0 Clients : 0 Socket-util : 0.0000 %
|
Summary
-kflag to curlgenerate_trex_config()whereGATEWAY_MACandTREX_DATA_MACwere never set, causing TRex to use broadcast MAC which AWS VPC dropsTest plan
https://claude.ai/code/session_015n3n3sffdZqXVnQSudVRH8