Skip to content

loaded-latency: add mode-2 multipass random chain (v1)#2

Open
realmzstevenmiao wants to merge 1 commit into
ARM-software:mainfrom
realmzstevenmiao:random_mode2
Open

loaded-latency: add mode-2 multipass random chain (v1)#2
realmzstevenmiao wants to merge 1 commit into
ARM-software:mainfrom
realmzstevenmiao:random_mode2

Conversation

@realmzstevenmiao

Copy link
Copy Markdown

Merge the 8-pass random Hamiltonian chain randomization (mode 2) Defeating Neoverse V2 L2 prefetchers that learn the single-chain pattern used by mode 1.

Changes:

  • args.c: add -R/--lat-randomize-mode to select randomization mode (1 = pair-swap shuffle, 2 = 8-pass random chain). -r remains shorthand for mode 1.

  • memlatency.c:

    • Add make_multipass_chain() building LAT_CHAIN_PASSES (=8) independent Hamiltonian cycles via ptr_t slots per cacheline, concatenated head-to-tail into one closed loop. Each pass uses an independent Fisher-Yates random visitation order and a different slot offset within each cacheline, defeating stride, next-line, and short-history temporal prefetchers.
    • Refactor make_pairswap_chain() to use an external order[] array (mirroring mode 2). Drop per-node order/index bookkeeping fields; nodes now carry only ->next.
    • Shrink local node_t in lat_initialize() accordingly; union is now { void *next; ptr_t ptrs[LAT_CHAIN_PASSES]; } + cacheline pad.
    • Replace dead #if 0 debug block with #ifdef LAT_DEBUG_CHAIN that derives cacheline index/slot from the pointer and buffer base.

    ./smoke-mode1-vs-mode2.sh
    Using CPU 31 on NUMA node 0
    size mode1_ns mode2_ns ratio


32KiB 1.213600 1.213517 1.00x
1MiB 4.149233 6.914998 1.67x
2MiB 6.859965 23.385340 3.41x
4MiB 7.384905 31.402270 4.25x
8MiB 4.850773 34.441810 7.10x
16MiB 5.888085 35.799870 6.08x
64MiB 32.145800 36.532270 1.14x

Merge the 8-pass random Hamiltonian chain randomization (mode 2)
Defeating Neoverse V2 L2 prefetchers that learn the single-chain pattern used by mode 1.

Changes:
- args.c: add -R/--lat-randomize-mode <n> to select randomization mode
  (1 = pair-swap shuffle, 2 = 8-pass random chain). -r remains
  shorthand for mode 1.
- memlatency.c:
  * Add make_multipass_chain() building LAT_CHAIN_PASSES (=8)
    independent Hamiltonian cycles via ptr_t slots per cacheline,
    concatenated head-to-tail into one closed loop. Each pass uses an
    independent Fisher-Yates random visitation order and a different
    slot offset within each cacheline, defeating stride, next-line,
    and short-history temporal prefetchers.
  * Refactor make_pairswap_chain() to use an external order[] array
    (mirroring mode 2). Drop per-node order/index bookkeeping fields;
    nodes now carry only ->next.
  * Shrink local node_t in lat_initialize() accordingly; union is now
    { void *next; ptr_t ptrs[LAT_CHAIN_PASSES]; } + cacheline pad.
  * Replace dead #if 0 debug block with #ifdef LAT_DEBUG_CHAIN that
    derives cacheline index/slot from the pointer and buffer base.

Signed-off-by: Stefan Andersson DAG <stefan.dag.andersson@ericsson.com>
Signed-off-by: Steven Miao <Steven.Miao@arm.com>
@realmzstevenmiao

realmzstevenmiao commented May 25, 2026

Copy link
Copy Markdown
Author

cat smoke-mode1-vs-mode2.sh


#!/bin/bash

# SPDX-FileCopyrightText: Copyright 2026 Arm Limited and/or its affiliates
# SPDX-License-Identifier: BSD-3-Clause
#
# smoke-mode1-vs-mode2.sh
#
# Quick latency comparison of mode 1 (pair-swap shuffle) vs mode 2
# (multi-pass random chain) across a sweep of working-set sizes.
# Mode 2 defeats HW prefetchers on Neoverse V2 / Sapphire Rapids, so
# mode 2 >> mode 1 once the working set escapes L1/L2; at DRAM scale
# the ratio shrinks because mode 1's random shuffle alone already
# defeats much of the prefetcher.
#
# Usage:
#   ./smoke-mode1-vs-mode2.sh [--cpu N] [--numa-node N]
#                             [--duration SEC] [--hugepages 1G|2MB|none]

set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
LOADED_LATENCY="${SCRIPT_DIR}/loaded-latency"

NUMA_NODE=0
CPU=""              # default: last CPU in NUMA_NODE (likely idle)
DURATION=3          # seconds per run, kept short for a smoke test
HUGEPAGE_OPT="1G"   # 1G|2MB|none — matches verify-sweep.sh default
# Probe sizes in bytes: L1, then a sweep across L2/LLC where the HW
# prefetcher gain on mode 1 is largest, then DRAM.
SIZES_BYTES=(32768 1048576 2097152 4194304 8388608 16777216 67108864)
SIZE_LABELS=("32KiB"  "1MiB"   "2MiB"   "4MiB"   "8MiB"   "16MiB"   "64MiB")

while [[ $# -gt 0 ]]; do
    case "$1" in
        --cpu)       CPU="$2";          shift 2 ;;
        --numa-node) NUMA_NODE="$2";    shift 2 ;;
        --duration)  DURATION="$2";     shift 2 ;;
        --hugepages) HUGEPAGE_OPT="$2"; shift 2 ;;
        -h|--help)
            echo "Usage: $0 [--cpu N] [--numa-node N] [--duration SEC] [--hugepages 1G|2MB|none]"
            exit 0 ;;
        *) echo "Unknown option: $1" >&2; exit 1 ;;
    esac
done

[[ -x "${LOADED_LATENCY}" ]] || { echo "Not found: ${LOADED_LATENCY} (run make)" >&2; exit 1; }
command -v numactl >/dev/null || { echo "numactl required" >&2; exit 1; }

# Default CPU = last CPU on NUMA_NODE (OS scheduler tends to leave high
# CPU IDs idle, so we get less interference without explicit isolation).
if [[ -z "${CPU}" ]]; then
    CPU=$(numactl -H | awk -v n="${NUMA_NODE}" '
        $1=="node" && $2==n && $3=="cpus:" { print $NF; exit }')
    [[ -z "${CPU}" ]] && { echo "Could not determine last CPU on NUMA node ${NUMA_NODE}" >&2; exit 1; }
fi
echo "Using CPU ${CPU} on NUMA node ${NUMA_NODE}"

run_once() {
    # Args: size_bytes mode
    local size_bytes=$1
    local mode=$2
    local cacheline_count=$((size_bytes / 64))
    (( cacheline_count < 2 )) && cacheline_count=2

    local hp_flag=""
    [[ "${HUGEPAGE_OPT}" != "none" ]] && hp_flag="--lat-use-hugepages ${HUGEPAGE_OPT}"

    local out
    out=$(numactl -N"${NUMA_NODE}" -m"${NUMA_NODE}" \
            "${LOADED_LATENCY}" -l "${CPU}" -D "${DURATION}" \
            --lat-cacheline-count "${cacheline_count}" \
            --lat-randomize-mode "${mode}" ${hp_flag} 2>&1) || true

    local lat
    lat=$(echo "$out" | awk '/^Average Latency/ {print $4; exit}')
    [[ -z "${lat}" ]] && lat="nan"
    echo "${lat}"
}

printf '%-10s %12s %12s %10s\n' "size" "mode1_ns" "mode2_ns" "ratio"
printf '%-10s %12s %12s %10s\n' "----" "--------" "--------" "-----"

for i in "${!SIZES_BYTES[@]}"; do
    sz=${SIZES_BYTES[$i]}
    label=${SIZE_LABELS[$i]}

    l1=$(run_once "$sz" 1)
    l2=$(run_once "$sz" 2)

    if [[ "$l1" == "nan" || "$l2" == "nan" ]]; then
        ratio="nan"
    else
        ratio=$(awk -v a="$l2" -v b="$l1" 'BEGIN{ if (b>0) printf "%.2f", a/b; else print "inf" }')
    fi

    printf '%-10s %12s %12s %10s\n' "${label}" "${l1}" "${l2}" "${ratio}x"
done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants