tscore: SIMD ts::memcpy_tolower; use in URL.cc cache-key fast path#13167
tscore: SIMD ts::memcpy_tolower; use in URL.cc cache-key fast path#13167phongn wants to merge 4 commits into
Conversation
The bulk ASCII tolower loop used to canonicalize the scheme and host portions of a URL before hashing into the cache key runs at ~1.5 GB/s scalar (one byte and one ParseRules table lookup per iteration). The work is trivially data-parallel and there is no per-byte branching, so a SIMD kernel that lowercases a whole register at once gives a straightforward speedup once the input is long enough to amortize the vector setup. Add a header-only helper ts::memcpy_tolower under include/tscore/ink_memcpy_tolower.h with a compile-time-selected cascade of SIMD bodies: 64-byte AVX-512BW, 32-byte AVX2, 16-byte SSE2 on x86_64, plus 16-byte NEON on ARMv8. Wider bodies fall through to narrower drain loops, so the worst-case scalar tail is always <16 bytes. Selection is purely compile-time; runtime ifunc dispatch is left for a follow-up. The AVX-512BW body uses _mm512_mask_add_epi8 to fuse the conditional "+0x20 where upper" into a single op, and a masked load/store handles 1..63 leftover bytes in a single SIMD pass (inspired by Tony Finch's copytolower64.c, https://dotat.at/cgi/git/vectolower.git/). The whole AVX-512BW block is gated at n >= 64 because the masked load/store has ~7 ns of fixed setup that loses to the narrower paths for short inputs; below 64 bytes we fall through to the AVX2 + SSE2 cascade. Semantics match the existing ParseRules::ink_tolower table exactly: bytes in 'A'..'Z' map to 'a'..'z', all others (including 0x80..0xFF) pass through unchanged. Replace the static inline memcpy_tolower in src/proxy/hdrs/URL.cc with this helper. Baseline x86_64 builds use the 16-byte SSE2 path; builds that opt into a wider -march (x86-64-v3 = AVX2, x86-64-v4 = AVX-512BW) get the wider bodies automatically. Sub-16-byte inputs (e.g. short HTTP schemes like "http") use the scalar tail and see no perf change. Measured throughput on a 2.0 GHz Ice Lake Xeon Gold 6338, mean ns: size scalar SSE2 AVX2 AVX-512BW ---- ------ ---- ---- --------- 16 B 10.4 2.15 1.75 1.98 32 B 15.4 2.90 2.24 2.31 64 B 28.0 4.43 2.85 2.61 256 B 113 13.87 7.57 6.20 1024 B 425 50.47 24.23 17.49 Speedup vs scalar at 1024 B: SSE2 8.4x, AVX2 17.5x, AVX-512BW 24.3x. A new microbenchmark under tools/benchmark covers correctness across sizes 0..257 (bracketing each SIMD body size) plus an exhaustive byte- value sweep that guards against any future widening of the case-fold range, alongside scalar-vs-SIMD throughput numbers and a config-print case that emits the selected ISA path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@phongn Should we do this for in place too? |
There was a problem hiding this comment.
Pull request overview
This PR adds a header-only SIMD ASCII lowercase-copy helper in tscore and switches the URL cache-key fast path to use it instead of the local scalar loop.
Changes:
- Adds
ts::memcpy_tolowerwith scalar, SSE2, AVX2, AVX-512BW, and NEON paths. - Replaces URL fast-path scheme/host lowercasing with the shared helper.
- Adds an optional Catch2 benchmark/correctness harness for the helper.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
include/tscore/ink_memcpy_tolower.h |
Defines the new SIMD/scalar lowercase-copy helper. |
src/proxy/hdrs/URL.cc |
Uses ts::memcpy_tolower in cache-key fast-path canonicalization. |
tools/benchmark/benchmark_memcpy_tolower.cc |
Adds benchmark and correctness checks for the helper. |
tools/benchmark/CMakeLists.txt |
Builds the new benchmark target when benchmarks are enabled. |
| Implementation note: the bodies are stacked widest-first and each | ||
| drains its block size before falling through to the next. A build | ||
| with AVX-512BW gets the 64-byte body as the main loop, then at most | ||
| one 32-byte AVX2 iteration and one 16-byte SSE2 iteration to drain | ||
| the remainder before the scalar tail handles 0-15 bytes. Builds | ||
| without the wider ISAs simply skip those blocks. Selection is purely |
There was a problem hiding this comment.
Good catch — comment was stale from an earlier iteration that used a cascade for the tail. Updated in faeb167 to describe what the code actually does: AVX-512BW gates at n >= 64, runs a 64-byte main loop, finishes any 1..63-byte tail with a single masked load/store, then early-returns. The AVX2 + SSE2 cascade only runs for n < 64, where the masked-tail setup cost (~7 ns) would otherwise lose to the narrower paths.
- Move ts::memcpy_tolower correctness coverage out of the ENABLE_BENCHMARKS-gated benchmark and into a new src/tscore/unit_tests/test_ink_memcpy_tolower.cc so ctest exercises the scalar and SIMD paths in every build. Covers boundary sizes bracketing each SIMD body width, the exhaustive 0..255 byte-value sweep, and the in-place (dst == src) form (Copilot). - Fix the implementation-note comment on ts::memcpy_tolower to describe the actual AVX-512BW control flow (gated main loop + masked-tail load/store + early return), and document that in-place (dst == src) is supported on every path (Copilot). - Add a Catch::Benchmark::keep_memory barrier in benchmark_memcpy_tolower so the compiler can no longer DCE the inlined stores past the first observed byte (Copilot). - Migrate the in-place tolower loop in src/proxy/http3/QPACK.cc::_encode_header to ts::memcpy_tolower, demonstrating the in-place contract (bryancall). - Add Tony Finch's copytolower64.c attribution to NOTICE (masaori335). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Yes — migrated the in-place tolower loop in
|
memcpy_tolower carried two warts: the "memcpy" prefix implied non-overlapping by convention with libc memcpy (we explicitly support the in-place case), and the unqualified name didn't surface the ASCII-only semantics. Rename the helper to ts::ascii::tolower_copy and add a thin ts::ascii::tolower_inplace(buf, n) wrapper so call sites that operate on a single buffer read naturally instead of passing the same pointer twice. Rename the header to include/tscore/ink_ascii_tolower.h, the unit test to src/tscore/unit_tests/test_ink_ascii_tolower.cc, and the benchmark to tools/benchmark/benchmark_ascii_tolower.cc to match. Update the two existing call sites (URL.cc fast-path scheme/host and QPACK::_encode_header in-place name lowercasing) accordingly. No behavior change: the helper bodies are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migrate two more byte-at-a-time ASCII tolower loops to ts::ascii::tolower_copy. Both call sites use a separate destination buffer, so the copy form is the right fit: - hpack_encode_header_block(): lower-cases each MIMEField name before encoding to match the HTTP/2 lowercase-header-name requirement. - UrlRewrite::_mappingLookup(): lower-cases the incoming request host into a stack buffer before the table lookup, so the lookup is case-insensitive against the lower-cased keys built at config-load time. The previous code used libc tolower(int) on signed char values, which is technically UB for bytes >= 0x80; the new call avoids that. The existing unit tests in test_URL, test_HpackIndexingTable, and test_RemapRules executed the tolower paths but only with inputs that were already lower-case, so they would have missed a "skip the lowercasing" regression. Add focused behavioral coverage: - test_URL.cc: four extra get_hash_test_cases that hash a request with uppercase/mixed-case scheme or host and require an equal hash to the lower-case form. Includes a 49-byte uppercase host that crosses both the 16- and 32-byte SIMD bodies. - test_RemapRules.cc: a new SCENARIO that builds a UrlRewrite from a map for a lower-case host and requires that uppercase, mixed-case, and long-uppercase request hosts all match. - test_HpackIndexingTable.cc: a new TEST_CASE that encodes a long mixed-case field name with hpack_encode_header_block and requires the encoded byte stream to be identical to encoding the same field with an already-lower-case name. QPACK already exercises the in-place path through its Encoding test and the helper's own ts::ascii::tolower_inplace unit test covers in-place semantics exhaustively; an additional focused QPACK test would need the external .qif fixture infrastructure, which is out of scope here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
I expect applying this to |
Summary
Add a SIMD-accelerated bulk ASCII tolower helper
ts::memcpy_tolowerintscore, and use it in place of the byte-at-a-time loop on the URL canonicalization fast path that produces the cache-key digest. Header-only helper with a compile-time ISA cascade: 64-byte AVX-512BW, 32-byte AVX2, 16-byte SSE2 on x86_64, plus 16-byte NEON on ARMv8. Selection is purely compile-time; runtime ifunc dispatch is left for a follow-up. Operators get the wider path automatically by raising-march(x86-64-v3= AVX2,x86-64-v4= AVX-512BW); a stock x86_64 build keeps SSE2.Behavior matches
ParseRules::ink_tolowerexactly: bytes inA..Zmap toa..z, all others (including0x80..0xFF) pass through unchanged.Implementation notes
_mm512_mask_add_epi8to fuse the conditional+0x20into a single op, and a masked load/store for the 1–63-byte tail in one SIMD pass. Inspired by Tony Finch's copytolower64.c.src/proxy/hdrs/URL.ccdrops its static-inlinememcpy_tolowerand callsts::memcpy_tolowerinstead.Performance — measured on Xeon Gold 6338 (Ice Lake, 2.0 GHz)
Mean ns per call from
tools/benchmark/benchmark_memcpy_tolower:-mavx2)-mavx512bw)Speedup vs scalar at 1024 B: SSE2 8.4×, AVX2 17.5×, AVX-512BW 24.3×.
URL hot path inputs: HTTP schemes ("http"/"https") are 4–5 bytes and stay on the scalar tail with no change. Typical host names (16+ bytes) get the full 4–14× speedup depending on build flags.
Test plan
tools/benchmark/benchmark_memcpy_tolowerruns 269 correctness assertions covering:A..Zare remapped — guards against any future widening of the case-fold range.-mavx2-mavx2,-mavx512bw, and the defaultcmake --build build -t formatclean.src/proxy/hdrs/libhdrs.abuilds clean with the updated URL.cc.Notes for reviewers
-march=x86-64-v3or-march=x86-64-v4.<immintrin.h>/<arm_neon.h>only inside the#ifthat needs them, so other architectures don't pull them in.HPACK.cc,QPACK.cc,UrlRewrite.cc) could also benefit; left untouched here to keep this PR focused.🤖 Generated with Claude Code