tscore: SIMD ts::memcpy_tolower; use in URL.cc cache-key fast path by phongn · Pull Request #13167 · apache/trafficserver

phongn · 2026-05-14T19:55:30Z

Summary

Add a SIMD-accelerated bulk ASCII tolower helper ts::memcpy_tolower in tscore, and use it in place of the byte-at-a-time loop on the URL canonicalization fast path that produces the cache-key digest. Header-only helper with a compile-time ISA cascade: 64-byte AVX-512BW, 32-byte AVX2, 16-byte SSE2 on x86_64, plus 16-byte NEON on ARMv8. Selection is purely compile-time; runtime ifunc dispatch is left for a follow-up. Operators get the wider path automatically by raising -march (x86-64-v3 = AVX2, x86-64-v4 = AVX-512BW); a stock x86_64 build keeps SSE2.

Behavior matches ParseRules::ink_tolower exactly: bytes in A..Z map to a..z, all others (including 0x80..0xFF) pass through unchanged.

Implementation notes

Cascade: wider bodies drain into narrower ones, so the worst-case scalar tail is always <16 bytes regardless of build flags.
AVX-512BW kernel uses _mm512_mask_add_epi8 to fuse the conditional +0x20 into a single op, and a masked load/store for the 1–63-byte tail in one SIMD pass. Inspired by Tony Finch's copytolower64.c.
The whole AVX-512BW block is gated at n ≥ 64, because the masked load/store carries ~7 ns of fixed setup that loses to the narrower paths for short inputs; below 64 bytes the AVX-512BW build falls through to its AVX2 + SSE2 cascade.

src/proxy/hdrs/URL.cc drops its static-inline memcpy_tolower and calls ts::memcpy_tolower instead.

Performance — measured on Xeon Gold 6338 (Ice Lake, 2.0 GHz)

Mean ns per call from tools/benchmark/benchmark_memcpy_tolower:

Size	scalar	SSE2	AVX2 (`-mavx2`)	AVX-512BW (`-mavx512bw`)
16 B	10.4	2.15	1.75	1.98
32 B	15.4	2.90	2.24	2.31
64 B	28.0	4.43	2.85	2.61
256 B	113	13.87	7.57	6.20
1024 B	425	50.47	24.23	17.49

Speedup vs scalar at 1024 B: SSE2 8.4×, AVX2 17.5×, AVX-512BW 24.3×.

URL hot path inputs: HTTP schemes ("http"/"https") are 4–5 bytes and stay on the scalar tail with no change. Typical host names (16+ bytes) get the full 4–14× speedup depending on build flags.

Test plan

Microbench tools/benchmark/benchmark_memcpy_tolower runs 269 correctness assertions covering:
- Sizes 0, 1, 5, 15, 16, 17, 23, 31, 32, 33, 64, 257 (bracketing each SIMD body) against the scalar reference.
- An exhaustive sweep of all 256 byte values verifying that only A..Z are remapped — guards against any future widening of the case-fold range.
All paths run correctness clean on:
- Broadwell (AVX2-capable) with -mavx2
- Ice Lake (AVX-512BW) with -mavx2, -mavx512bw, and the default
cmake --build build -t format clean.
src/proxy/hdrs/libhdrs.a builds clean with the updated URL.cc.
Jenkins CI green.

Notes for reviewers

No new compile flags or dependencies. Just baseline SSE2 (x86_64 ABI) and baseline NEON (ARMv8 ABI); wider paths kick in automatically with -march=x86-64-v3 or -march=x86-64-v4.
The header includes <immintrin.h> / <arm_neon.h> only inside the #if that needs them, so other architectures don't pull them in.
AVX-512BW kernel design (mask_add + masked tail) was adapted from Tony Finch's vectolower, license-compatible (0BSD / MIT-0).
Other call sites with the same byte-at-a-time tolower pattern (HPACK.cc, QPACK.cc, UrlRewrite.cc) could also benefit; left untouched here to keep this PR focused.

🤖 Generated with Claude Code

The bulk ASCII tolower loop used to canonicalize the scheme and host portions of a URL before hashing into the cache key runs at ~1.5 GB/s scalar (one byte and one ParseRules table lookup per iteration). The work is trivially data-parallel and there is no per-byte branching, so a SIMD kernel that lowercases a whole register at once gives a straightforward speedup once the input is long enough to amortize the vector setup. Add a header-only helper ts::memcpy_tolower under include/tscore/ink_memcpy_tolower.h with a compile-time-selected cascade of SIMD bodies: 64-byte AVX-512BW, 32-byte AVX2, 16-byte SSE2 on x86_64, plus 16-byte NEON on ARMv8. Wider bodies fall through to narrower drain loops, so the worst-case scalar tail is always <16 bytes. Selection is purely compile-time; runtime ifunc dispatch is left for a follow-up. The AVX-512BW body uses _mm512_mask_add_epi8 to fuse the conditional "+0x20 where upper" into a single op, and a masked load/store handles 1..63 leftover bytes in a single SIMD pass (inspired by Tony Finch's copytolower64.c, https://dotat.at/cgi/git/vectolower.git/). The whole AVX-512BW block is gated at n >= 64 because the masked load/store has ~7 ns of fixed setup that loses to the narrower paths for short inputs; below 64 bytes we fall through to the AVX2 + SSE2 cascade. Semantics match the existing ParseRules::ink_tolower table exactly: bytes in 'A'..'Z' map to 'a'..'z', all others (including 0x80..0xFF) pass through unchanged. Replace the static inline memcpy_tolower in src/proxy/hdrs/URL.cc with this helper. Baseline x86_64 builds use the 16-byte SSE2 path; builds that opt into a wider -march (x86-64-v3 = AVX2, x86-64-v4 = AVX-512BW) get the wider bodies automatically. Sub-16-byte inputs (e.g. short HTTP schemes like "http") use the scalar tail and see no perf change. Measured throughput on a 2.0 GHz Ice Lake Xeon Gold 6338, mean ns: size scalar SSE2 AVX2 AVX-512BW ---- ------ ---- ---- --------- 16 B 10.4 2.15 1.75 1.98 32 B 15.4 2.90 2.24 2.31 64 B 28.0 4.43 2.85 2.61 256 B 113 13.87 7.57 6.20 1024 B 425 50.47 24.23 17.49 Speedup vs scalar at 1024 B: SSE2 8.4x, AVX2 17.5x, AVX-512BW 24.3x. A new microbenchmark under tools/benchmark covers correctness across sizes 0..257 (bracketing each SIMD body size) plus an exhaustive byte- value sweep that guards against any future widening of the case-fold range, alongside scalar-vs-SIMD throughput numbers and a config-print case that emits the selected ISA path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bryancall · 2026-05-18T22:15:21Z

@phongn Should we do this for in place too?

Copilot

Pull request overview

This PR adds a header-only SIMD ASCII lowercase-copy helper in tscore and switches the URL cache-key fast path to use it instead of the local scalar loop.

Changes:

Adds ts::memcpy_tolower with scalar, SSE2, AVX2, AVX-512BW, and NEON paths.
Replaces URL fast-path scheme/host lowercasing with the shared helper.
Adds an optional Catch2 benchmark/correctness harness for the helper.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`include/tscore/ink_memcpy_tolower.h`	Defines the new SIMD/scalar lowercase-copy helper.
`src/proxy/hdrs/URL.cc`	Uses `ts::memcpy_tolower` in cache-key fast-path canonicalization.
`tools/benchmark/benchmark_memcpy_tolower.cc`	Adds benchmark and correctness checks for the helper.
`tools/benchmark/CMakeLists.txt`	Builds the new benchmark target when benchmarks are enabled.

phongn · 2026-05-21T20:21:24Z

+  Implementation note: the bodies are stacked widest-first and each
+  drains its block size before falling through to the next. A build
+  with AVX-512BW gets the 64-byte body as the main loop, then at most
+  one 32-byte AVX2 iteration and one 16-byte SSE2 iteration to drain
+  the remainder before the scalar tail handles 0-15 bytes. Builds
+  without the wider ISAs simply skip those blocks. Selection is purely


Good catch — comment was stale from an earlier iteration that used a cascade for the tail. Updated in faeb167 to describe what the code actually does: AVX-512BW gates at n >= 64, runs a 64-byte main loop, finishes any 1..63-byte tail with a single masked load/store, then early-returns. The AVX2 + SSE2 cascade only runs for n < 64, where the masked-tail setup cost (~7 ns) would otherwise lose to the narrower paths.

- Move ts::memcpy_tolower correctness coverage out of the ENABLE_BENCHMARKS-gated benchmark and into a new src/tscore/unit_tests/test_ink_memcpy_tolower.cc so ctest exercises the scalar and SIMD paths in every build. Covers boundary sizes bracketing each SIMD body width, the exhaustive 0..255 byte-value sweep, and the in-place (dst == src) form (Copilot). - Fix the implementation-note comment on ts::memcpy_tolower to describe the actual AVX-512BW control flow (gated main loop + masked-tail load/store + early return), and document that in-place (dst == src) is supported on every path (Copilot). - Add a Catch::Benchmark::keep_memory barrier in benchmark_memcpy_tolower so the compiler can no longer DCE the inlined stores past the first observed byte (Copilot). - Migrate the in-place tolower loop in src/proxy/http3/QPACK.cc::_encode_header to ts::memcpy_tolower, demonstrating the in-place contract (bryancall). - Add Tony Finch's copytolower64.c attribution to NOTICE (masaori335). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

phongn · 2026-05-21T20:21:44Z

Yes — migrated the in-place tolower loop in QPACK::_encode_header to ts::memcpy_tolower in faeb167. The helper's contract now explicitly supports dst == src: every SIMD body loads a full block into a register before storing back at the same offset, and the AVX-512BW masked tail does masked-load/masked-store at the same offset. Partial overlap (where dst != src but the ranges intersect) is still unsupported. Unit test coverage for the in-place form added too.

UrlRewrite.cc and HPACK.cc have similar byte-at-a-time loops with separate buffers — those are easy follow-up migrations if you'd like them rolled in, or I can leave them as a separate PR.

memcpy_tolower carried two warts: the "memcpy" prefix implied non-overlapping by convention with libc memcpy (we explicitly support the in-place case), and the unqualified name didn't surface the ASCII-only semantics. Rename the helper to ts::ascii::tolower_copy and add a thin ts::ascii::tolower_inplace(buf, n) wrapper so call sites that operate on a single buffer read naturally instead of passing the same pointer twice. Rename the header to include/tscore/ink_ascii_tolower.h, the unit test to src/tscore/unit_tests/test_ink_ascii_tolower.cc, and the benchmark to tools/benchmark/benchmark_ascii_tolower.cc to match. Update the two existing call sites (URL.cc fast-path scheme/host and QPACK::_encode_header in-place name lowercasing) accordingly. No behavior change: the helper bodies are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Migrate two more byte-at-a-time ASCII tolower loops to ts::ascii::tolower_copy. Both call sites use a separate destination buffer, so the copy form is the right fit: - hpack_encode_header_block(): lower-cases each MIMEField name before encoding to match the HTTP/2 lowercase-header-name requirement. - UrlRewrite::_mappingLookup(): lower-cases the incoming request host into a stack buffer before the table lookup, so the lookup is case-insensitive against the lower-cased keys built at config-load time. The previous code used libc tolower(int) on signed char values, which is technically UB for bytes >= 0x80; the new call avoids that. The existing unit tests in test_URL, test_HpackIndexingTable, and test_RemapRules executed the tolower paths but only with inputs that were already lower-case, so they would have missed a "skip the lowercasing" regression. Add focused behavioral coverage: - test_URL.cc: four extra get_hash_test_cases that hash a request with uppercase/mixed-case scheme or host and require an equal hash to the lower-case form. Includes a 49-byte uppercase host that crosses both the 16- and 32-byte SIMD bodies. - test_RemapRules.cc: a new SCENARIO that builds a UrlRewrite from a map for a lower-case host and requires that uppercase, mixed-case, and long-uppercase request hosts all match. - test_HpackIndexingTable.cc: a new TEST_CASE that encodes a long mixed-case field name with hpack_encode_header_block and requires the encoded byte stream to be identical to encoding the same field with an already-lower-case name. QPACK already exercises the in-place path through its Encoding test and the helper's own ts::ascii::tolower_inplace unit test covers in-place semantics exhaustively; an additional focused QPACK test would need the external .qif fixture infrastructure, which is out of scope here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

masaori335 · 2026-05-22T02:58:04Z

I expect applying this to UrlRewrite and HPACK has more performance impact than cache-key in real traffic 👍 I can spawn benchmark in our tool and share result when tests are passed.

phongn force-pushed the simd-bulk-tolower branch from ed9a596 to d975fb2 Compare May 14, 2026 20:30

phongn force-pushed the simd-bulk-tolower branch from d975fb2 to 32016aa Compare May 14, 2026 20:56

bryancall requested a review from Copilot May 18, 2026 22:14

Copilot started reviewing on behalf of bryancall May 18, 2026 22:14 View session

bryancall assigned phongn May 18, 2026

bryancall added this to the 11.0.0 milestone May 18, 2026

bryancall added the Performance label May 18, 2026

bryancall requested a review from masaori335 May 18, 2026 22:16

Copilot AI reviewed May 18, 2026

View reviewed changes

bryancall self-requested a review May 18, 2026 22:21

masaori335 reviewed May 19, 2026

View reviewed changes

Comment thread include/tscore/ink_memcpy_tolower.h Outdated

phongn and others added 2 commits May 21, 2026 20:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tscore: SIMD ts::memcpy_tolower; use in URL.cc cache-key fast path#13167

tscore: SIMD ts::memcpy_tolower; use in URL.cc cache-key fast path#13167
phongn wants to merge 4 commits into
apache:masterfrom
phongn:simd-bulk-tolower

phongn commented May 14, 2026 •

edited

Loading

Uh oh!

bryancall commented May 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

phongn May 21, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

phongn commented May 21, 2026

Uh oh!

masaori335 commented May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

phongn commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation notes

Performance — measured on Xeon Gold 6338 (Ice Lake, 2.0 GHz)

Test plan

Notes for reviewers

Uh oh!

bryancall commented May 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

phongn May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

phongn commented May 21, 2026

Uh oh!

masaori335 commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

phongn commented May 14, 2026 •

edited

Loading

masaori335 commented May 22, 2026 •

edited

Loading