This driver allows per-CPU core latency testing against PCIe devices
It allows for extended testing capabilities with easy integration of device operations for specific devices. Currently, it supports IOMem region read testing on any available BAR on the device, and allows for timing interrupt dispatch servicing times by enabling and generating interrupts for supported cards.
The concept was originally inspired from pcie-lat by Andre Richter (https://github.com/andre-richter/pcie-lat) which leveraged only IOMem reading latency testing, but did not take into consideration SMT systems and individual core latency testing to the device.
This driver does NOT auto-claim all PCI devices. It only claims a single, explicitly-specified device per module instance. It requires you unload the current driver that is bound to the device.
-
Unload the device driver for the device you wish to test. Example:
echo 0000:1a:00.0 >/sys/bus/pci/devices/0000\:1a\:00.0/driver/unbind -
Load the device driver:
cd tests/./load_driver.sh 1a:00.0
The driver can be loaded manually with insmod, but the load_driver.sh script will automatically grab the PCI vendor device ID for you, which is required during the loading.
Once loaded, you should see the capabilities detected in dmesg. If interrupt testing is supported on the device, it will show it in the output in dmesg.
[ 2064.298554] pcilatdriver: Driver loaded, initializing... CPU: 2
[ 2064.298672] pcilatdriver: Driver initialized with device ID(s) specified: 8086:10d3
[ 2064.298676] pcilatdriver: Adding 8086:10D3 sub=FFFFFFFF:FFFFFFFF cls=00000000/00000000
[ 2064.298714] pcilatdriver: Probing device 1a:00.0
[ 2064.299777] pcilatdriver 0000:1a:00.0: Device is [Intel e1000e (1GbE)] - Interrupt timing supported!
[ 2064.299814] pcilatdriver 0000:1a:00.0: Scanning for BARs with IORESOURCE_MEM only...
[ 2064.299825] pcilatdriver 0000:1a:00.0: Found usable BAR: 0
[ 2064.299840] pcilatdriver 0000:1a:00.0: Found usable BAR: 3
[ 2064.300057] pcilatdriver 0000:1a:00.0: Requesting IRQ vectors, min: 1 max: 4 flags=0x6
[ 2064.300222] pcilatdriver 0000:1a:00.0: Received 4 vectors
[ 2064.300224] pcilatdriver 0000:1a:00.0: Registering vector 0: IRQ 69
[ 2064.300272] pcilatdriver 0000:1a:00.0: Registering vector 1: IRQ 70
[ 2064.300306] pcilatdriver 0000:1a:00.0: Registering vector 2: IRQ 71
[ 2064.300339] pcilatdriver 0000:1a:00.0: Registering vector 3: IRQ 76
[ 2064.300376] pcilatdriver 0000:1a:00.0: Device now claimed by pcilatdriver
Once the driver is loaded you should now have /sys/class/pcilatdriver available. Change directory to the device under this path. Example:
[root@customer 1a:00.0]# pwd;ls /sys/class/pcilatdriver/1a:00.0 device pcilat_devreg pcilat_interrupt pcilat_iomem power subsystem uevent [root@customer 1a:00.0]#
From here you can trigger the tests by writing to the pcilat_* nodes under the device. Reading the nodes will give the available commands/options, as well as a small portion of the sample set for any tests run.
A larger set of the data (100,000 entries) can be read from /dev/pcilatdriver/$device/{iomem,interrupt,devreg}
You can perform memory IO latency testing against the readable memory regions on PCI devices. The driver will show you which BAR's are available for the test both in dmesg and in the output of pcilat_iomem:
root@localhost # cat pcilat_iomem IOMem Read Latency Data Last run: bar=-1 iterations=0 ndelay_ns=200 completed_cpus=0 errors=0 Available BARs: 0 3 Sample buffer: count=0 capacity=100000 dropped=0 Commands: run [bar] [iterations] [ndelay_ns] | clear Summary format: cpu samples avg_us min_us max_us No samples collected. Run: echo 'run [bar] [iterations] [ndelay_ns]' > pcilat_iomem root@localhost #
root@localhost # echo "run 3 1000 800" >pcilat_iomem
The results are written to a circular ring buffer which currently holds 100,000 entries. They can be read from pcilat_iomem but will be truncated to 4000 entries there. For the full 100,000 entries, you can read /dev/pcilatdriver/$device/iomem.
Use the included python scripts in tools/ to parse the data or plot it.
The parsing script displays the data in a clean readable format on the CLI.
$ ./parse_iomem_latency.py /dev/pcilatdriver/1a\:00.0/iomem
Total samples: 4000
Per-CPU IOMem Read Latency
--------------------------
Key Count Min(us) Avg(us) P50(us) P95(us) P99(us) Max(us)
0 1000 2.000 2.229 2.125 2.417 2.542 11.000
1 1000 1.875 1.980 1.958 2.333 2.542 5.958
2 1000 2.250 2.712 2.792 2.877 3.501 12.959
3 1000 2.250 2.498 2.500 2.544 3.083 10.500
Per-BAR IOMem Read Latency
--------------------------
Key Count Min(us) Avg(us) P50(us) P95(us) P99(us) Max(us)
3 4000 1.875 2.355 2.334 2.833 3.125 12.959
Per CPU+BAR IOMem Read Latency
------------------------------
Key Count Min(us) Avg(us) P50(us) P95(us) P99(us) Max(us)
c0/b3 1000 2.000 2.229 2.125 2.417 2.542 11.000
c1/b3 1000 1.875 1.980 1.958 2.333 2.542 5.958
c2/b3 1000 2.250 2.712 2.792 2.877 3.501 12.959
c3/b3 1000 2.250 2.498 2.500 2.544 3.083 10.500
The plotting script will generate a .png image of a heatmap. You can also have it generate an HTML output with the data similar to the CLI and the heatmap included in the page. For more information, use plot_iomem_latency.py -h
Timing interrupts can be difficult because enabling the interrupts is absolutely device-specific. This portion is still under development, but is implemented for Intel e1000e. The driver has not been tested on physical hardware yet, as the current work is being done on VMWare with an emulated e1000e device. VMWare does not properly raise interrupts unless traffic reaches the RX and TX rings of the NIC, requiring a specific write directly to the IRQ to raise the interrupt.
As I gain access to more hardware, I will continue enhancing the driver support for different cards and hardware.
I'm more than happy, if you have hardware to lend me for remote login, to write the support for additional devices.
# echo 1 >pcilat_interrupt
# cat pcilat_interrupt
Interrupt Latency Data (0 samples)
Enabled: 1
Batch iterations: 1000
Dropped trigger timestamps: 0
Vectors: 4 | IRQs: 69 70 71 76
Set CPU affinity: echo <cpu> > /proc/irq/<IRQ>/smp_affinity_list
Commands: 1 | 0 | trigger | swtrigger [cpu] [irq|vector_index] | swmatrix [iters] [irq_mask_hex] | swloop [iters] [irq_mask_hex] | iterations [n] | clear
Format: handler_entry_ns trigger_to_handler_ns entry_to_ack_ns cpu irq
No samples collected yet
On physical hardware, you should start to see interrupts being raised/generated in /proc/interrupts and data propagating in /dev/pcilatdriver/$device/interrupts. If not, there are two ways to get the interrupts to start raising:
-
trigger - Triggers a hardware level interrupt for the device. This does not work on VMWare / emulated hardware
-
swmatrix | swloop - These trigger manual injections to the IRQ to force raising an interrupt request. The irq_mask_hex takes is for the IRQ vector, used to raise the requests on specific IRQ's. For example, if there are four IRQ's, 0x0 - 0x3 will be each individual IRQ registered to the device. 0xf will raise the interrupts on every IRQ.
Example run:
# echo "swloop 1000 0xf" >pcilat_interrupt
# cat /proc/interrupts | grep -i pcila
69: 1000 1000 1000 1000 MSI 13631488 Edge pcilatdriver
70: 1000 1000 1000 1000 MSI 13631489 Edge pcilatdriver
71: 1000 1000 1000 1000 MSI 13631490 Edge pcilatdriver
76: 1000 1000 1000 1000 MSI 13631491 Edge pcilatdriver
#
The scripts under tools (parse_irq_latency.py and plot_irq_latency.py) function the same as for the IOMem tests.
$ ./parse_irq_latency.py /dev/pcilatdriver/1a\:00.0/interrupt
Total samples: 16000
Metrics Measured
----------------
1. trigger_to_handler_us
Time from trigger timestamp capture to irq_handler entry.
This includes interrupt delivery/scheduling latency.
2. entry_to_ack_us
Time inside irq_handler from entry until ack completes.
This excludes circular-buffer logging overhead by design.
CPU Performance Comparison (lower is better)
--------------------------------------------
CPU Samples TrigAvg(us) TrigP95(us) vsBest% TrigAvgBar AckAvg(us) AckP95(us) AckBar
0 4000 7.427 16.825 3879.9% ############################ 1.433 2.291 ###########################.
1 4000 6.888 15.159 3591.0% #########################... 1.459 2.332 ###########################.
2 4000 6.904 15.828 3599.2% ##########################.. 1.466 2.291 ############################
3 4000 0.187 0.500 0.0% ............................ 1.309 2.165 ########################....
Fastest CPUs (trigger_to_handler_us avg)
-----------------------------------------
1. CPU 3: avg=0.187us p95=0.500us min=0.041us max=4.206us
2. CPU 1: avg=6.888us p95=15.159us min=1.375us max=60.263us
3. CPU 2: avg=6.904us p95=15.828us min=1.333us max=113.862us
4. CPU 0: avg=7.427us p95=16.825us min=1.541us max=145.514us
Slowest CPUs (trigger_to_handler_us avg)
-----------------------------------------
1. CPU 0: avg=7.427us p95=16.825us min=1.541us max=145.514us
2. CPU 2: avg=6.904us p95=15.828us min=1.333us max=113.862us
3. CPU 1: avg=6.888us p95=15.159us min=1.375us max=60.263us
4. CPU 3: avg=0.187us p95=0.500us min=0.041us max=4.206us
Per-CPU Latency
---------------
Key Count Min(us) Avg(us) P50(us) P95(us) P99(us) Max(us)
0 4000 1.541 7.427 4.373 16.825 25.238 145.514
1 4000 1.375 6.888 4.623 15.159 22.530 60.263
2 4000 1.333 6.904 4.706 15.828 24.571 113.862
3 4000 0.041 0.187 0.125 0.500 0.875 4.206
Per-CPU Entry-to-Ack Latency
----------------------------
Key Count Min(us) Avg(us) P50(us) P95(us) P99(us) Max(us)
0 4000 0.916 1.433 1.250 2.291 2.998 12.535
1 4000 0.958 1.459 1.333 2.332 2.999 13.868
2 4000 0.957 1.466 1.333 2.291 2.957 24.988
3 4000 0.875 1.309 1.166 2.165 2.832 9.620
Per-IRQ Latency
---------------
Key Count Min(us) Avg(us) P50(us) P95(us) P99(us) Max(us)
69 4000 0.041 10.194 10.912 21.284 29.446 145.514
70 4000 0.041 3.957 4.373 7.787 14.368 33.234
71 4000 0.041 3.695 3.957 7.622 14.451 35.150
76 4000 0.041 3.561 3.831 7.538 13.869 127.231
Per CPU+IRQ Latency
-------------------
Key Count Min(us) Avg(us) P50(us) P95(us) P99(us) Max(us)
c0/i69 1000 2.457 15.184 14.077 23.780 32.027 145.514
c0/i70 1000 2.915 4.926 4.290 8.471 14.454 24.947
c0/i71 1000 1.541 4.848 4.081 9.372 18.158 31.027
c0/i76 1000 1.541 4.752 3.915 7.790 14.579 127.231
c1/i69 1000 4.039 12.764 11.536 21.865 31.862 60.263
c1/i70 1000 3.331 5.293 4.623 7.996 14.621 22.198
c1/i71 1000 2.374 4.841 4.248 8.046 14.829 21.698
c1/i76 1000 1.375 4.656 3.998 7.705 13.869 26.196
c2/i69 1000 3.540 12.569 10.620 22.655 31.526 113.862
c2/i70 1000 3.290 5.424 4.664 8.758 15.785 33.234
c2/i71 1000 1.457 4.941 4.248 9.206 14.869 35.150
c2/i76 1000 1.333 4.681 3.999 7.622 14.161 19.991
c3/i69 1000 0.041 0.258 0.166 0.583 1.499 4.206
c3/i70 1000 0.041 0.183 0.125 0.458 0.792 3.582
c3/i71 1000 0.041 0.149 0.084 0.417 0.791 3.582
c3/i76 1000 0.041 0.156 0.083 0.458 0.750 3.915
The driver allows for reading and setting registers on the PCI device. This is particularly useful when working on adding support for new cards and needing to test interrupt enabling, for example.
The pcilat_devreg node allows reading a register by simply writing the register address to it:
# echo 0xD0 >pcilat_devreg
# cat pcilat_devreg
DevReg log (1 entries; full history via /dev/pcilatdriver/<bdf>/devreg):
[2078202165814] read 0xd0 = 0x0080e005 rc=0
#
For writing, simply echo the register and the value as two arguments.