Skip to content

tscore: SIMD ts::memcpy_tolower; use in URL.cc cache-key fast path#13167

Draft
phongn wants to merge 4 commits into
apache:masterfrom
phongn:simd-bulk-tolower
Draft

tscore: SIMD ts::memcpy_tolower; use in URL.cc cache-key fast path#13167
phongn wants to merge 4 commits into
apache:masterfrom
phongn:simd-bulk-tolower

Conversation

@phongn
Copy link
Copy Markdown
Collaborator

@phongn phongn commented May 14, 2026

Summary

Add a SIMD-accelerated bulk ASCII tolower helper ts::memcpy_tolower in tscore, and use it in place of the byte-at-a-time loop on the URL canonicalization fast path that produces the cache-key digest. Header-only helper with a compile-time ISA cascade: 64-byte AVX-512BW, 32-byte AVX2, 16-byte SSE2 on x86_64, plus 16-byte NEON on ARMv8. Selection is purely compile-time; runtime ifunc dispatch is left for a follow-up. Operators get the wider path automatically by raising -march (x86-64-v3 = AVX2, x86-64-v4 = AVX-512BW); a stock x86_64 build keeps SSE2.

Behavior matches ParseRules::ink_tolower exactly: bytes in A..Z map to a..z, all others (including 0x80..0xFF) pass through unchanged.

Implementation notes

  • Cascade: wider bodies drain into narrower ones, so the worst-case scalar tail is always <16 bytes regardless of build flags.
  • AVX-512BW kernel uses _mm512_mask_add_epi8 to fuse the conditional +0x20 into a single op, and a masked load/store for the 1–63-byte tail in one SIMD pass. Inspired by Tony Finch's copytolower64.c.
  • The whole AVX-512BW block is gated at n ≥ 64, because the masked load/store carries ~7 ns of fixed setup that loses to the narrower paths for short inputs; below 64 bytes the AVX-512BW build falls through to its AVX2 + SSE2 cascade.

src/proxy/hdrs/URL.cc drops its static-inline memcpy_tolower and calls ts::memcpy_tolower instead.

Performance — measured on Xeon Gold 6338 (Ice Lake, 2.0 GHz)

Mean ns per call from tools/benchmark/benchmark_memcpy_tolower:

Size scalar SSE2 AVX2 (-mavx2) AVX-512BW (-mavx512bw)
16 B 10.4 2.15 1.75 1.98
32 B 15.4 2.90 2.24 2.31
64 B 28.0 4.43 2.85 2.61
256 B 113 13.87 7.57 6.20
1024 B 425 50.47 24.23 17.49

Speedup vs scalar at 1024 B: SSE2 8.4×, AVX2 17.5×, AVX-512BW 24.3×.

URL hot path inputs: HTTP schemes ("http"/"https") are 4–5 bytes and stay on the scalar tail with no change. Typical host names (16+ bytes) get the full 4–14× speedup depending on build flags.

Test plan

  • Microbench tools/benchmark/benchmark_memcpy_tolower runs 269 correctness assertions covering:
    • Sizes 0, 1, 5, 15, 16, 17, 23, 31, 32, 33, 64, 257 (bracketing each SIMD body) against the scalar reference.
    • An exhaustive sweep of all 256 byte values verifying that only A..Z are remapped — guards against any future widening of the case-fold range.
  • All paths run correctness clean on:
    • Broadwell (AVX2-capable) with -mavx2
    • Ice Lake (AVX-512BW) with -mavx2, -mavx512bw, and the default
  • cmake --build build -t format clean.
  • src/proxy/hdrs/libhdrs.a builds clean with the updated URL.cc.
  • Jenkins CI green.

Notes for reviewers

  • No new compile flags or dependencies. Just baseline SSE2 (x86_64 ABI) and baseline NEON (ARMv8 ABI); wider paths kick in automatically with -march=x86-64-v3 or -march=x86-64-v4.
  • The header includes <immintrin.h> / <arm_neon.h> only inside the #if that needs them, so other architectures don't pull them in.
  • AVX-512BW kernel design (mask_add + masked tail) was adapted from Tony Finch's vectolower, license-compatible (0BSD / MIT-0).
  • Other call sites with the same byte-at-a-time tolower pattern (HPACK.cc, QPACK.cc, UrlRewrite.cc) could also benefit; left untouched here to keep this PR focused.

🤖 Generated with Claude Code

@phongn phongn force-pushed the simd-bulk-tolower branch from ed9a596 to d975fb2 Compare May 14, 2026 20:30
The bulk ASCII tolower loop used to canonicalize the scheme and host
portions of a URL before hashing into the cache key runs at ~1.5 GB/s
scalar (one byte and one ParseRules table lookup per iteration). The
work is trivially data-parallel and there is no per-byte branching, so
a SIMD kernel that lowercases a whole register at once gives a
straightforward speedup once the input is long enough to amortize the
vector setup.

Add a header-only helper ts::memcpy_tolower under
include/tscore/ink_memcpy_tolower.h with a compile-time-selected
cascade of SIMD bodies: 64-byte AVX-512BW, 32-byte AVX2, 16-byte SSE2
on x86_64, plus 16-byte NEON on ARMv8. Wider bodies fall through to
narrower drain loops, so the worst-case scalar tail is always <16
bytes. Selection is purely compile-time; runtime ifunc dispatch is
left for a follow-up.

The AVX-512BW body uses _mm512_mask_add_epi8 to fuse the conditional
"+0x20 where upper" into a single op, and a masked load/store handles
1..63 leftover bytes in a single SIMD pass (inspired by Tony Finch's
copytolower64.c, https://dotat.at/cgi/git/vectolower.git/). The whole
AVX-512BW block is gated at n >= 64 because the masked load/store has
~7 ns of fixed setup that loses to the narrower paths for short
inputs; below 64 bytes we fall through to the AVX2 + SSE2 cascade.

Semantics match the existing ParseRules::ink_tolower table exactly:
bytes in 'A'..'Z' map to 'a'..'z', all others (including 0x80..0xFF)
pass through unchanged.

Replace the static inline memcpy_tolower in src/proxy/hdrs/URL.cc with
this helper. Baseline x86_64 builds use the 16-byte SSE2 path; builds
that opt into a wider -march (x86-64-v3 = AVX2, x86-64-v4 = AVX-512BW)
get the wider bodies automatically. Sub-16-byte inputs (e.g. short
HTTP schemes like "http") use the scalar tail and see no perf change.

Measured throughput on a 2.0 GHz Ice Lake Xeon Gold 6338, mean ns:

  size   scalar   SSE2     AVX2     AVX-512BW
  ----   ------   ----     ----     ---------
  16 B   10.4     2.15     1.75     1.98
  32 B   15.4     2.90     2.24     2.31
  64 B   28.0     4.43     2.85     2.61
  256 B  113      13.87    7.57     6.20
  1024 B 425      50.47    24.23    17.49

Speedup vs scalar at 1024 B: SSE2 8.4x, AVX2 17.5x, AVX-512BW 24.3x.

A new microbenchmark under tools/benchmark covers correctness across
sizes 0..257 (bracketing each SIMD body size) plus an exhaustive byte-
value sweep that guards against any future widening of the case-fold
range, alongside scalar-vs-SIMD throughput numbers and a config-print
case that emits the selected ISA path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bryancall
Copy link
Copy Markdown
Contributor

@phongn Should we do this for in place too?

@bryancall bryancall added this to the 11.0.0 milestone May 18, 2026
@bryancall bryancall requested a review from masaori335 May 18, 2026 22:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a header-only SIMD ASCII lowercase-copy helper in tscore and switches the URL cache-key fast path to use it instead of the local scalar loop.

Changes:

  • Adds ts::memcpy_tolower with scalar, SSE2, AVX2, AVX-512BW, and NEON paths.
  • Replaces URL fast-path scheme/host lowercasing with the shared helper.
  • Adds an optional Catch2 benchmark/correctness harness for the helper.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
include/tscore/ink_memcpy_tolower.h Defines the new SIMD/scalar lowercase-copy helper.
src/proxy/hdrs/URL.cc Uses ts::memcpy_tolower in cache-key fast-path canonicalization.
tools/benchmark/benchmark_memcpy_tolower.cc Adds benchmark and correctness checks for the helper.
tools/benchmark/CMakeLists.txt Builds the new benchmark target when benchmarks are enabled.

Comment thread include/tscore/ink_memcpy_tolower.h Outdated
Comment thread include/tscore/ink_memcpy_tolower.h Outdated
Comment on lines +19 to +24
Implementation note: the bodies are stacked widest-first and each
drains its block size before falling through to the next. A build
with AVX-512BW gets the 64-byte body as the main loop, then at most
one 32-byte AVX2 iteration and one 16-byte SSE2 iteration to drain
the remainder before the scalar tail handles 0-15 bytes. Builds
without the wider ISAs simply skip those blocks. Selection is purely
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — comment was stale from an earlier iteration that used a cascade for the tail. Updated in faeb167 to describe what the code actually does: AVX-512BW gates at n >= 64, runs a 64-byte main loop, finishes any 1..63-byte tail with a single masked load/store, then early-returns. The AVX2 + SSE2 cascade only runs for n < 64, where the masked-tail setup cost (~7 ns) would otherwise lose to the narrower paths.

Comment thread tools/benchmark/benchmark_ascii_tolower.cc
Comment thread tools/benchmark/benchmark_memcpy_tolower.cc Outdated
@bryancall bryancall self-requested a review May 18, 2026 22:21
Comment thread include/tscore/ink_memcpy_tolower.h Outdated
- Move ts::memcpy_tolower correctness coverage out of the
  ENABLE_BENCHMARKS-gated benchmark and into a new
  src/tscore/unit_tests/test_ink_memcpy_tolower.cc so ctest exercises
  the scalar and SIMD paths in every build. Covers boundary sizes
  bracketing each SIMD body width, the exhaustive 0..255 byte-value
  sweep, and the in-place (dst == src) form (Copilot).

- Fix the implementation-note comment on ts::memcpy_tolower to
  describe the actual AVX-512BW control flow (gated main loop +
  masked-tail load/store + early return), and document that in-place
  (dst == src) is supported on every path (Copilot).

- Add a Catch::Benchmark::keep_memory barrier in
  benchmark_memcpy_tolower so the compiler can no longer DCE the
  inlined stores past the first observed byte (Copilot).

- Migrate the in-place tolower loop in
  src/proxy/http3/QPACK.cc::_encode_header to ts::memcpy_tolower,
  demonstrating the in-place contract (bryancall).

- Add Tony Finch's copytolower64.c attribution to NOTICE
  (masaori335).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@phongn
Copy link
Copy Markdown
Collaborator Author

phongn commented May 21, 2026

Yes — migrated the in-place tolower loop in QPACK::_encode_header to ts::memcpy_tolower in faeb167. The helper's contract now explicitly supports dst == src: every SIMD body loads a full block into a register before storing back at the same offset, and the AVX-512BW masked tail does masked-load/masked-store at the same offset. Partial overlap (where dst != src but the ranges intersect) is still unsupported. Unit test coverage for the in-place form added too.

UrlRewrite.cc and HPACK.cc have similar byte-at-a-time loops with separate buffers — those are easy follow-up migrations if you'd like them rolled in, or I can leave them as a separate PR.

phongn and others added 2 commits May 21, 2026 20:59
memcpy_tolower carried two warts: the "memcpy" prefix implied
non-overlapping by convention with libc memcpy (we explicitly support
the in-place case), and the unqualified name didn't surface the
ASCII-only semantics. Rename the helper to ts::ascii::tolower_copy
and add a thin ts::ascii::tolower_inplace(buf, n) wrapper so call
sites that operate on a single buffer read naturally instead of
passing the same pointer twice.

Rename the header to include/tscore/ink_ascii_tolower.h, the unit
test to src/tscore/unit_tests/test_ink_ascii_tolower.cc, and the
benchmark to tools/benchmark/benchmark_ascii_tolower.cc to match.
Update the two existing call sites (URL.cc fast-path scheme/host and
QPACK::_encode_header in-place name lowercasing) accordingly. No
behavior change: the helper bodies are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migrate two more byte-at-a-time ASCII tolower loops to
ts::ascii::tolower_copy. Both call sites use a separate destination
buffer, so the copy form is the right fit:

- hpack_encode_header_block(): lower-cases each MIMEField name before
  encoding to match the HTTP/2 lowercase-header-name requirement.

- UrlRewrite::_mappingLookup(): lower-cases the incoming request host
  into a stack buffer before the table lookup, so the lookup is
  case-insensitive against the lower-cased keys built at config-load
  time. The previous code used libc tolower(int) on signed char values,
  which is technically UB for bytes >= 0x80; the new call avoids that.

The existing unit tests in test_URL, test_HpackIndexingTable, and
test_RemapRules executed the tolower paths but only with inputs that
were already lower-case, so they would have missed a "skip the
lowercasing" regression. Add focused behavioral coverage:

- test_URL.cc: four extra get_hash_test_cases that hash a request with
  uppercase/mixed-case scheme or host and require an equal hash to the
  lower-case form. Includes a 49-byte uppercase host that crosses both
  the 16- and 32-byte SIMD bodies.

- test_RemapRules.cc: a new SCENARIO that builds a UrlRewrite from a
  map for a lower-case host and requires that uppercase, mixed-case,
  and long-uppercase request hosts all match.

- test_HpackIndexingTable.cc: a new TEST_CASE that encodes a long
  mixed-case field name with hpack_encode_header_block and requires
  the encoded byte stream to be identical to encoding the same field
  with an already-lower-case name.

QPACK already exercises the in-place path through its Encoding test
and the helper's own ts::ascii::tolower_inplace unit test covers
in-place semantics exhaustively; an additional focused QPACK test
would need the external .qif fixture infrastructure, which is out of
scope here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@masaori335
Copy link
Copy Markdown
Contributor

masaori335 commented May 22, 2026

I expect applying this to UrlRewrite and HPACK has more performance impact than cache-key in real traffic 👍 I can spawn benchmark in our tool and share result when tests are passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants