loaded-latency: add mode-2 multipass random chain (v1) by realmzstevenmiao · Pull Request #2 · ARM-software/infra-microbenchmarks

realmzstevenmiao · 2026-05-25T10:30:38Z

Merge the 8-pass random Hamiltonian chain randomization (mode 2) Defeating Neoverse V2 L2 prefetchers that learn the single-chain pattern used by mode 1.

Changes:

args.c: add -R/--lat-randomize-mode to select randomization mode (1 = pair-swap shuffle, 2 = 8-pass random chain). -r remains shorthand for mode 1.
memlatency.c:
- Add make_multipass_chain() building LAT_CHAIN_PASSES (=8) independent Hamiltonian cycles via ptr_t slots per cacheline, concatenated head-to-tail into one closed loop. Each pass uses an independent Fisher-Yates random visitation order and a different slot offset within each cacheline, defeating stride, next-line, and short-history temporal prefetchers.
- Refactor make_pairswap_chain() to use an external order[] array (mirroring mode 2). Drop per-node order/index bookkeeping fields; nodes now carry only ->next.
- Shrink local node_t in lat_initialize() accordingly; union is now { void *next; ptr_t ptrs[LAT_CHAIN_PASSES]; } + cacheline pad.
- Replace dead #if 0 debug block with #ifdef LAT_DEBUG_CHAIN that derives cacheline index/slot from the pointer and buffer base.
./smoke-mode1-vs-mode2.sh
Using CPU 31 on NUMA node 0
size mode1_ns mode2_ns ratio

32KiB 1.213600 1.213517 1.00x
1MiB 4.149233 6.914998 1.67x
2MiB 6.859965 23.385340 3.41x
4MiB 7.384905 31.402270 4.25x
8MiB 4.850773 34.441810 7.10x
16MiB 5.888085 35.799870 6.08x
64MiB 32.145800 36.532270 1.14x

Merge the 8-pass random Hamiltonian chain randomization (mode 2) Defeating Neoverse V2 L2 prefetchers that learn the single-chain pattern used by mode 1. Changes: - args.c: add -R/--lat-randomize-mode <n> to select randomization mode (1 = pair-swap shuffle, 2 = 8-pass random chain). -r remains shorthand for mode 1. - memlatency.c: * Add make_multipass_chain() building LAT_CHAIN_PASSES (=8) independent Hamiltonian cycles via ptr_t slots per cacheline, concatenated head-to-tail into one closed loop. Each pass uses an independent Fisher-Yates random visitation order and a different slot offset within each cacheline, defeating stride, next-line, and short-history temporal prefetchers. * Refactor make_pairswap_chain() to use an external order[] array (mirroring mode 2). Drop per-node order/index bookkeeping fields; nodes now carry only ->next. * Shrink local node_t in lat_initialize() accordingly; union is now { void *next; ptr_t ptrs[LAT_CHAIN_PASSES]; } + cacheline pad. * Replace dead #if 0 debug block with #ifdef LAT_DEBUG_CHAIN that derives cacheline index/slot from the pointer and buffer base. Signed-off-by: Stefan Andersson DAG <stefan.dag.andersson@ericsson.com> Signed-off-by: Steven Miao <Steven.Miao@arm.com>

realmzstevenmiao · 2026-05-25T10:32:11Z

cat smoke-mode1-vs-mode2.sh


#!/bin/bash

# SPDX-FileCopyrightText: Copyright 2026 Arm Limited and/or its affiliates
# SPDX-License-Identifier: BSD-3-Clause
#
# smoke-mode1-vs-mode2.sh
#
# Quick latency comparison of mode 1 (pair-swap shuffle) vs mode 2
# (multi-pass random chain) across a sweep of working-set sizes.
# Mode 2 defeats HW prefetchers on Neoverse V2 / Sapphire Rapids, so
# mode 2 >> mode 1 once the working set escapes L1/L2; at DRAM scale
# the ratio shrinks because mode 1's random shuffle alone already
# defeats much of the prefetcher.
#
# Usage:
#   ./smoke-mode1-vs-mode2.sh [--cpu N] [--numa-node N]
#                             [--duration SEC] [--hugepages 1G|2MB|none]

set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
LOADED_LATENCY="${SCRIPT_DIR}/loaded-latency"

NUMA_NODE=0
CPU=""              # default: last CPU in NUMA_NODE (likely idle)
DURATION=3          # seconds per run, kept short for a smoke test
HUGEPAGE_OPT="1G"   # 1G|2MB|none — matches verify-sweep.sh default
# Probe sizes in bytes: L1, then a sweep across L2/LLC where the HW
# prefetcher gain on mode 1 is largest, then DRAM.
SIZES_BYTES=(32768 1048576 2097152 4194304 8388608 16777216 67108864)
SIZE_LABELS=("32KiB"  "1MiB"   "2MiB"   "4MiB"   "8MiB"   "16MiB"   "64MiB")

while [[ $# -gt 0 ]]; do
    case "$1" in
        --cpu)       CPU="$2";          shift 2 ;;
        --numa-node) NUMA_NODE="$2";    shift 2 ;;
        --duration)  DURATION="$2";     shift 2 ;;
        --hugepages) HUGEPAGE_OPT="$2"; shift 2 ;;
        -h|--help)
            echo "Usage: $0 [--cpu N] [--numa-node N] [--duration SEC] [--hugepages 1G|2MB|none]"
            exit 0 ;;
        *) echo "Unknown option: $1" >&2; exit 1 ;;
    esac
done

[[ -x "${LOADED_LATENCY}" ]] || { echo "Not found: ${LOADED_LATENCY} (run make)" >&2; exit 1; }
command -v numactl >/dev/null || { echo "numactl required" >&2; exit 1; }

# Default CPU = last CPU on NUMA_NODE (OS scheduler tends to leave high
# CPU IDs idle, so we get less interference without explicit isolation).
if [[ -z "${CPU}" ]]; then
    CPU=$(numactl -H | awk -v n="${NUMA_NODE}" '
        $1=="node" && $2==n && $3=="cpus:" { print $NF; exit }')
    [[ -z "${CPU}" ]] && { echo "Could not determine last CPU on NUMA node ${NUMA_NODE}" >&2; exit 1; }
fi
echo "Using CPU ${CPU} on NUMA node ${NUMA_NODE}"

run_once() {
    # Args: size_bytes mode
    local size_bytes=$1
    local mode=$2
    local cacheline_count=$((size_bytes / 64))
    (( cacheline_count < 2 )) && cacheline_count=2

    local hp_flag=""
    [[ "${HUGEPAGE_OPT}" != "none" ]] && hp_flag="--lat-use-hugepages ${HUGEPAGE_OPT}"

    local out
    out=$(numactl -N"${NUMA_NODE}" -m"${NUMA_NODE}" \
            "${LOADED_LATENCY}" -l "${CPU}" -D "${DURATION}" \
            --lat-cacheline-count "${cacheline_count}" \
            --lat-randomize-mode "${mode}" ${hp_flag} 2>&1) || true

    local lat
    lat=$(echo "$out" | awk '/^Average Latency/ {print $4; exit}')
    [[ -z "${lat}" ]] && lat="nan"
    echo "${lat}"
}

printf '%-10s %12s %12s %10s\n' "size" "mode1_ns" "mode2_ns" "ratio"
printf '%-10s %12s %12s %10s\n' "----" "--------" "--------" "-----"

for i in "${!SIZES_BYTES[@]}"; do
    sz=${SIZES_BYTES[$i]}
    label=${SIZE_LABELS[$i]}

    l1=$(run_once "$sz" 1)
    l2=$(run_once "$sz" 2)

    if [[ "$l1" == "nan" || "$l2" == "nan" ]]; then
        ratio="nan"
    else
        ratio=$(awk -v a="$l2" -v b="$l1" 'BEGIN{ if (b>0) printf "%.2f", a/b; else print "inf" }')
    fi

    printf '%-10s %12s %12s %10s\n' "${label}" "${l1}" "${l2}" "${ratio}x"
done

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

loaded-latency: add mode-2 multipass random chain (v1)#2

loaded-latency: add mode-2 multipass random chain (v1)#2
realmzstevenmiao wants to merge 1 commit into
ARM-software:mainfrom
realmzstevenmiao:random_mode2

realmzstevenmiao commented May 25, 2026

Uh oh!

realmzstevenmiao commented May 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

realmzstevenmiao commented May 25, 2026

Uh oh!

realmzstevenmiao commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

realmzstevenmiao commented May 25, 2026 •

edited

Loading