Skip to content

JonnyWhatshisface/PCIeLatencyTesting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PCIe Latency Testing Driver

This driver allows per-CPU core latency testing against PCIe devices

It allows for extended testing capabilities with easy integration of device operations for specific devices. Currently, it supports IOMem region read testing on any available BAR on the device, and allows for timing interrupt dispatch servicing times by enabling and generating interrupts for supported cards.

The concept was originally inspired from pcie-lat by Andre Richter (https://github.com/andre-richter/pcie-lat) which leveraged only IOMem reading latency testing, but did not take into consideration SMT systems and individual core latency testing to the device.

Loading the driver

This driver does NOT auto-claim all PCI devices. It only claims a single, explicitly-specified device per module instance. It requires you unload the current driver that is bound to the device.

  • Unload the device driver for the device you wish to test. Example: echo 0000:1a:00.0 >/sys/bus/pci/devices/0000\:1a\:00.0/driver/unbind

  • Load the device driver: cd tests/ ./load_driver.sh 1a:00.0

The driver can be loaded manually with insmod, but the load_driver.sh script will automatically grab the PCI vendor device ID for you, which is required during the loading.

Once loaded, you should see the capabilities detected in dmesg. If interrupt testing is supported on the device, it will show it in the output in dmesg.

[ 2064.298554] pcilatdriver: Driver loaded, initializing... CPU: 2
[ 2064.298672] pcilatdriver: Driver initialized with device ID(s) specified: 8086:10d3
[ 2064.298676] pcilatdriver: Adding 8086:10D3 sub=FFFFFFFF:FFFFFFFF cls=00000000/00000000
[ 2064.298714] pcilatdriver: Probing device 1a:00.0
[ 2064.299777] pcilatdriver 0000:1a:00.0: Device is [Intel e1000e (1GbE)] - Interrupt timing supported!
[ 2064.299814] pcilatdriver 0000:1a:00.0: Scanning for BARs with IORESOURCE_MEM only...
[ 2064.299825] pcilatdriver 0000:1a:00.0: Found usable BAR: 0
[ 2064.299840] pcilatdriver 0000:1a:00.0: Found usable BAR: 3
[ 2064.300057] pcilatdriver 0000:1a:00.0: Requesting IRQ vectors, min: 1 max: 4 flags=0x6
[ 2064.300222] pcilatdriver 0000:1a:00.0: Received 4 vectors
[ 2064.300224] pcilatdriver 0000:1a:00.0: Registering vector 0: IRQ 69
[ 2064.300272] pcilatdriver 0000:1a:00.0: Registering vector 1: IRQ 70
[ 2064.300306] pcilatdriver 0000:1a:00.0: Registering vector 2: IRQ 71
[ 2064.300339] pcilatdriver 0000:1a:00.0: Registering vector 3: IRQ 76
[ 2064.300376] pcilatdriver 0000:1a:00.0: Device now claimed by pcilatdriver

Usage

Once the driver is loaded you should now have /sys/class/pcilatdriver available. Change directory to the device under this path. Example:

[root@customer 1a:00.0]# pwd;ls /sys/class/pcilatdriver/1a:00.0 device pcilat_devreg pcilat_interrupt pcilat_iomem power subsystem uevent [root@customer 1a:00.0]#

From here you can trigger the tests by writing to the pcilat_* nodes under the device. Reading the nodes will give the available commands/options, as well as a small portion of the sample set for any tests run.

A larger set of the data (100,000 entries) can be read from /dev/pcilatdriver/$device/{iomem,interrupt,devreg}

IOMem Testing

You can perform memory IO latency testing against the readable memory regions on PCI devices. The driver will show you which BAR's are available for the test both in dmesg and in the output of pcilat_iomem:

root@localhost # cat pcilat_iomem IOMem Read Latency Data Last run: bar=-1 iterations=0 ndelay_ns=200 completed_cpus=0 errors=0 Available BARs: 0 3 Sample buffer: count=0 capacity=100000 dropped=0 Commands: run [bar] [iterations] [ndelay_ns] | clear Summary format: cpu samples avg_us min_us max_us No samples collected. Run: echo 'run [bar] [iterations] [ndelay_ns]' > pcilat_iomem root@localhost #

Trigger IOMem test for bar 3, 1000 iterations, with a delay of 800ns per iteration:

root@localhost # echo "run 3 1000 800" >pcilat_iomem

Reading Results

The results are written to a circular ring buffer which currently holds 100,000 entries. They can be read from pcilat_iomem but will be truncated to 4000 entries there. For the full 100,000 entries, you can read /dev/pcilatdriver/$device/iomem.

Use the included python scripts in tools/ to parse the data or plot it.

Parsing script:

The parsing script displays the data in a clean readable format on the CLI.

$ ./parse_iomem_latency.py /dev/pcilatdriver/1a\:00.0/iomem 
Total samples: 4000

Per-CPU IOMem Read Latency
--------------------------
         Key    Count    Min(us)    Avg(us)    P50(us)    P95(us)    P99(us)    Max(us)
           0     1000      2.000      2.229      2.125      2.417      2.542     11.000
           1     1000      1.875      1.980      1.958      2.333      2.542      5.958
           2     1000      2.250      2.712      2.792      2.877      3.501     12.959
           3     1000      2.250      2.498      2.500      2.544      3.083     10.500

Per-BAR IOMem Read Latency
--------------------------
         Key    Count    Min(us)    Avg(us)    P50(us)    P95(us)    P99(us)    Max(us)
           3     4000      1.875      2.355      2.334      2.833      3.125     12.959

Per CPU+BAR IOMem Read Latency
------------------------------
         Key    Count    Min(us)    Avg(us)    P50(us)    P95(us)    P99(us)    Max(us)
       c0/b3     1000      2.000      2.229      2.125      2.417      2.542     11.000
       c1/b3     1000      1.875      1.980      1.958      2.333      2.542      5.958
       c2/b3     1000      2.250      2.712      2.792      2.877      3.501     12.959
       c3/b3     1000      2.250      2.498      2.500      2.544      3.083     10.500
Plotting Script

The plotting script will generate a .png image of a heatmap. You can also have it generate an HTML output with the data similar to the CLI and the heatmap included in the page. For more information, use plot_iomem_latency.py -h

Interrupt Dispatch Timing

Timing interrupts can be difficult because enabling the interrupts is absolutely device-specific. This portion is still under development, but is implemented for Intel e1000e. The driver has not been tested on physical hardware yet, as the current work is being done on VMWare with an emulated e1000e device. VMWare does not properly raise interrupts unless traffic reaches the RX and TX rings of the NIC, requiring a specific write directly to the IRQ to raise the interrupt.

As I gain access to more hardware, I will continue enhancing the driver support for different cards and hardware.

I'm more than happy, if you have hardware to lend me for remote login, to write the support for additional devices.

Enable interrupts for the device [if supported]:

# echo 1 >pcilat_interrupt
# cat pcilat_interrupt 
Interrupt Latency Data (0 samples)
Enabled: 1
Batch iterations: 1000
Dropped trigger timestamps: 0
Vectors: 4 | IRQs: 69 70 71 76
Set CPU affinity: echo <cpu> > /proc/irq/<IRQ>/smp_affinity_list
Commands: 1 | 0 | trigger | swtrigger [cpu] [irq|vector_index] | swmatrix [iters] [irq_mask_hex] | swloop [iters] [irq_mask_hex] | iterations [n] | clear
Format: handler_entry_ns trigger_to_handler_ns entry_to_ack_ns cpu irq
No samples collected yet

On physical hardware, you should start to see interrupts being raised/generated in /proc/interrupts and data propagating in /dev/pcilatdriver/$device/interrupts. If not, there are two ways to get the interrupts to start raising:

  • trigger - Triggers a hardware level interrupt for the device. This does not work on VMWare / emulated hardware

  • swmatrix | swloop - These trigger manual injections to the IRQ to force raising an interrupt request. The irq_mask_hex takes is for the IRQ vector, used to raise the requests on specific IRQ's. For example, if there are four IRQ's, 0x0 - 0x3 will be each individual IRQ registered to the device. 0xf will raise the interrupts on every IRQ.

Example run:

# echo "swloop 1000 0xf" >pcilat_interrupt 
# cat /proc/interrupts | grep -i pcila
 69:       1000       1000       1000       1000       MSI 13631488 Edge      pcilatdriver
 70:       1000       1000       1000       1000       MSI 13631489 Edge      pcilatdriver
 71:       1000       1000       1000       1000       MSI 13631490 Edge      pcilatdriver
 76:       1000       1000       1000       1000       MSI 13631491 Edge      pcilatdriver
# 
Parsing The Data

The scripts under tools (parse_irq_latency.py and plot_irq_latency.py) function the same as for the IOMem tests.

$ ./parse_irq_latency.py /dev/pcilatdriver/1a\:00.0/interrupt 
Total samples: 16000

Metrics Measured
----------------
1. trigger_to_handler_us
   Time from trigger timestamp capture to irq_handler entry.
   This includes interrupt delivery/scheduling latency.
2. entry_to_ack_us
   Time inside irq_handler from entry until ack completes.
   This excludes circular-buffer logging overhead by design.

CPU Performance Comparison (lower is better)
--------------------------------------------
  CPU  Samples  TrigAvg(us)  TrigP95(us)  vsBest% TrigAvgBar                      AckAvg(us)  AckP95(us) AckBar                        
    0     4000        7.427       16.825  3879.9% ############################         1.433       2.291 ###########################.  
    1     4000        6.888       15.159  3591.0% #########################...         1.459       2.332 ###########################.  
    2     4000        6.904       15.828  3599.2% ##########################..         1.466       2.291 ############################  
    3     4000        0.187        0.500     0.0% ............................         1.309       2.165 ########################....  

Fastest CPUs (trigger_to_handler_us avg)
-----------------------------------------
1. CPU 3: avg=0.187us p95=0.500us min=0.041us max=4.206us
2. CPU 1: avg=6.888us p95=15.159us min=1.375us max=60.263us
3. CPU 2: avg=6.904us p95=15.828us min=1.333us max=113.862us
4. CPU 0: avg=7.427us p95=16.825us min=1.541us max=145.514us

Slowest CPUs (trigger_to_handler_us avg)
-----------------------------------------
1. CPU 0: avg=7.427us p95=16.825us min=1.541us max=145.514us
2. CPU 2: avg=6.904us p95=15.828us min=1.333us max=113.862us
3. CPU 1: avg=6.888us p95=15.159us min=1.375us max=60.263us
4. CPU 3: avg=0.187us p95=0.500us min=0.041us max=4.206us

Per-CPU Latency
---------------
         Key    Count    Min(us)    Avg(us)    P50(us)    P95(us)    P99(us)    Max(us)
           0     4000      1.541      7.427      4.373     16.825     25.238    145.514
           1     4000      1.375      6.888      4.623     15.159     22.530     60.263
           2     4000      1.333      6.904      4.706     15.828     24.571    113.862
           3     4000      0.041      0.187      0.125      0.500      0.875      4.206

Per-CPU Entry-to-Ack Latency
----------------------------
         Key    Count    Min(us)    Avg(us)    P50(us)    P95(us)    P99(us)    Max(us)
           0     4000      0.916      1.433      1.250      2.291      2.998     12.535
           1     4000      0.958      1.459      1.333      2.332      2.999     13.868
           2     4000      0.957      1.466      1.333      2.291      2.957     24.988
           3     4000      0.875      1.309      1.166      2.165      2.832      9.620

Per-IRQ Latency
---------------
         Key    Count    Min(us)    Avg(us)    P50(us)    P95(us)    P99(us)    Max(us)
          69     4000      0.041     10.194     10.912     21.284     29.446    145.514
          70     4000      0.041      3.957      4.373      7.787     14.368     33.234
          71     4000      0.041      3.695      3.957      7.622     14.451     35.150
          76     4000      0.041      3.561      3.831      7.538     13.869    127.231

Per CPU+IRQ Latency
-------------------
         Key    Count    Min(us)    Avg(us)    P50(us)    P95(us)    P99(us)    Max(us)
      c0/i69     1000      2.457     15.184     14.077     23.780     32.027    145.514
      c0/i70     1000      2.915      4.926      4.290      8.471     14.454     24.947
      c0/i71     1000      1.541      4.848      4.081      9.372     18.158     31.027
      c0/i76     1000      1.541      4.752      3.915      7.790     14.579    127.231
      c1/i69     1000      4.039     12.764     11.536     21.865     31.862     60.263
      c1/i70     1000      3.331      5.293      4.623      7.996     14.621     22.198
      c1/i71     1000      2.374      4.841      4.248      8.046     14.829     21.698
      c1/i76     1000      1.375      4.656      3.998      7.705     13.869     26.196
      c2/i69     1000      3.540     12.569     10.620     22.655     31.526    113.862
      c2/i70     1000      3.290      5.424      4.664      8.758     15.785     33.234
      c2/i71     1000      1.457      4.941      4.248      9.206     14.869     35.150
      c2/i76     1000      1.333      4.681      3.999      7.622     14.161     19.991
      c3/i69     1000      0.041      0.258      0.166      0.583      1.499      4.206
      c3/i70     1000      0.041      0.183      0.125      0.458      0.792      3.582
      c3/i71     1000      0.041      0.149      0.084      0.417      0.791      3.582
      c3/i76     1000      0.041      0.156      0.083      0.458      0.750      3.915

Devreg Read/Write

The driver allows for reading and setting registers on the PCI device. This is particularly useful when working on adding support for new cards and needing to test interrupt enabling, for example.

The pcilat_devreg node allows reading a register by simply writing the register address to it:

# echo 0xD0 >pcilat_devreg
# cat pcilat_devreg 
DevReg log (1 entries; full history via /dev/pcilatdriver/<bdf>/devreg):
[2078202165814] read 0xd0 = 0x0080e005 rc=0
#

For writing, simply echo the register and the value as two arguments.

About

Driver for testing PCIe latencies on individual CPU cores.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors