ASMap optimizations#174
Draft
l0rinc wants to merge 5 commits into
Draft
Conversation
The ASMap benchmark added in PR 35285 makes GetMappedAS() measurable. perf record on ASMapGetMappedASMulti showed most time under Interpret()/DecodeBits(), but address classification and lookup-key setup were still visible. GetMappedAS() already calls GetNetClass() before building the lookup key, and GetNetClass() classifies IPv4 by using the same HasLinkedIPv4() predicate. Reuse that net_class result for the early IPv4/IPv6 filter and for selecting the IPv4 mapping path, and use a fixed stack array for the 128-bit lookup key instead of allocating a 16-byte vector on every lookup. Benchmarked with GCC 12.2.0 and Clang 17.0.6. Numbers are median ns/op over 10 runs with -min-time=10000. GCC 12.2.0, previous commit -> this commit: ASMapGetMappedASMulti: 6071.9 -> 5875.7 saved 196.1 ns/op, +3.3% faster ASMapGetMappedASCloudflarev4: 6730.7 -> 6585.8 saved 144.9 ns/op, +2.2% faster ASMapGetMappedASCloudflarev6: 5328.0 -> 5087.2 saved 240.9 ns/op, +4.7% faster ASMapGetMappedASQuad9v6: 4099.8 -> 3866.8 saved 233.0 ns/op, +6.0% faster ASMapGetMappedASUnmappedv6: 1095.4 -> 854.4 saved 241.0 ns/op, +28.2% faster Clang 17.0.6, previous commit -> this commit: ASMapGetMappedASMulti: 4639.3 -> 4492.5 saved 146.8 ns/op, +3.3% faster ASMapGetMappedASCloudflarev4: 5048.3 -> 4937.3 saved 111.0 ns/op, +2.2% faster ASMapGetMappedASCloudflarev6: 3940.9 -> 3806.9 saved 134.0 ns/op, +3.5% faster ASMapGetMappedASQuad9v6: 3013.9 -> 2881.3 saved 132.5 ns/op, +4.6% faster ASMapGetMappedASUnmappedv6: 693.4 -> 567.8 saved 125.6 ns/op, +22.1% faster
For IPv6 ASMap lookups, GetAddrBytes() builds a temporary vector even though GetMappedAS() only needs the fixed 16-byte IPv6 address. Copy the address through GetIn6Addr() into the existing stack lookup key instead. This keeps the same IPv6-only assertion while avoiding the temporary byte vector. Benchmarked with GCC 12.2.0 and Clang 17.0.6. Numbers are median ns/op over 10 runs with -min-time=10000. GCC 12.2.0, previous commit -> this commit: ASMapGetMappedASMulti: 5875.7 -> 5817.7 saved 58.0 ns/op, +1.0% faster ASMapGetMappedASCloudflarev6: 5087.2 -> 5000.6 saved 86.5 ns/op, +1.7% faster ASMapGetMappedASGooglev6: 5251.3 -> 5159.3 saved 92.0 ns/op, +1.8% faster ASMapGetMappedASQuad9v6: 3866.8 -> 3785.4 saved 81.4 ns/op, +2.2% faster ASMapGetMappedASUnmappedv6: 854.4 -> 769.8 saved 84.6 ns/op, +11.0% faster Clang 17.0.6, previous commit -> this commit: ASMapGetMappedASMulti: 4492.5 -> 4425.4 saved 67.1 ns/op, +1.5% faster ASMapGetMappedASCloudflarev6: 3806.9 -> 3746.4 saved 60.5 ns/op, +1.6% faster ASMapGetMappedASGooglev6: 3857.1 -> 3799.2 saved 57.9 ns/op, +1.5% faster ASMapGetMappedASQuad9v6: 2881.3 -> 2810.4 saved 70.9 ns/op, +2.5% faster ASMapGetMappedASUnmappedv6: 567.8 -> 506.6 saved 61.2 ns/op, +12.1% faster
DecodeBits() consumes an operand one bit at a time in big-endian order, but the old mantissa loop used a variable shift for every bit and added into the final value directly. Build the mantissa by shifting the accumulated value left and OR-ing the next consumed bit. This preserves the existing bit order and EOF checks without introducing a separate packed-bit reader. DecodeBits() itself is an anonymous-namespace helper, so it is covered indirectly through Interpret(), SanityCheckAsmap(), CheckStandardAsmap(), GetMappedAS(), and the ASMap fuzz targets rather than by direct unit tests. Keep this as the smallest mechanical change to the existing decoder. Benchmarked with GCC 12.2.0 and Clang 17.0.6. Numbers are median ns/op over 10 runs with -min-time=10000. GCC 12.2.0, previous commit -> this commit: ASMapGetMappedASMulti: 5817.7 -> 5622.1 saved 195.6 ns/op, +3.5% faster ASMapGetMappedASCloudflarev4: 6591.4 -> 6487.4 saved 104.0 ns/op, +1.6% faster ASMapGetMappedASCloudflarev6: 5000.6 -> 4912.0 saved 88.6 ns/op, +1.8% faster ASMapGetMappedASGooglev4: 6805.4 -> 6647.2 saved 158.2 ns/op, +2.4% faster ASMapGetMappedASQuad9v4: 4785.5 -> 4644.4 saved 141.2 ns/op, +3.0% faster Clang 17.0.6, previous commit -> this commit: ASMapGetMappedASMulti: 4425.4 -> 4176.2 saved 249.1 ns/op, +6.0% faster ASMapGetMappedASCloudflarev4: 4936.1 -> 4699.0 saved 237.0 ns/op, +5.0% faster ASMapGetMappedASCloudflarev6: 3746.4 -> 3550.5 saved 196.0 ns/op, +5.5% faster ASMapGetMappedASGooglev4: 5093.1 -> 4823.7 saved 269.4 ns/op, +5.6% faster ASMapGetMappedASQuad9v4: 3633.4 -> 3449.4 saved 184.1 ns/op, +5.3% faster
With the mantissa loop simplified, DecodeBits() is small enough that asking for inlining gives GCC better visibility into the fixed operand tables used by DecodeType(), DecodeASN(), DecodeMatch(), and DecodeJump(). Keep the generic DecodeType() path: after the mantissa and inlining changes, it benchmarks faster than the custom direct opcode decoder while preserving the smaller diff. Some IDEs mark this inline keyword redundant, and Clang 17 is indeed effectively neutral here. GCC 12 does not treat it as redundant in this build: the explicit hint substantially improves the ASMap benchmarks. A forced ALWAYS_INLINE version was measured separately and was slower than the plain inline hint, so keep the weaker standard keyword. Benchmarked with GCC 12.2.0 and Clang 17.0.6. Numbers are median ns/op over 10 runs with -min-time=10000. GCC 12.2.0, previous commit -> this commit: ASMapGetMappedASMulti: 5622.1 -> 4014.8 saved 1607.4 ns/op, +40.0% faster ASMapGetMappedASCloudflarev4: 6487.4 -> 4564.4 saved 1923.0 ns/op, +42.1% faster ASMapGetMappedASCloudflarev6: 4912.0 -> 3435.1 saved 1477.0 ns/op, +43.0% faster ASMapGetMappedASGooglev4: 6647.2 -> 4742.7 saved 1904.5 ns/op, +40.2% faster ASMapGetMappedASQuad9v6: 3714.9 -> 2606.8 saved 1108.1 ns/op, +42.5% faster ASMapGetMappedASUnmappedv6: 771.1 -> 609.2 saved 161.9 ns/op, +26.6% faster Clang 17.0.6, previous commit -> this commit: ASMapGetMappedASMulti: 4176.2 -> 4180.5 changed -4.3 ns/op, -0.1% slower ASMapGetMappedASCloudflarev4: 4699.0 -> 4695.6 saved 3.5 ns/op, +0.1% faster ASMapGetMappedASGooglev6: 3639.3 -> 3608.7 saved 30.6 ns/op, +0.8% faster ASMapGetMappedASQuad9v6: 2670.5 -> 2704.1 changed -33.6 ns/op, -1.2% slower ASMapGetMappedASUnmappedv6: 482.3 -> 482.4 changed -0.1 ns/op, -0.0% slower
GetMappedAS() first checks whether an ASMap is configured. When it is not, there is no need to classify the address before returning the reserved unmapped ASN value. This keeps the ASMap-enabled path unchanged except for the already-predictable non-empty check, while making the default no-ASMap path cheaper for callers such as AddrMan that repeatedly ask for netgroups. Benchmarked with GCC 12.2.0 and Clang 17.0.6. Numbers are median ns/op over 10 runs with -min-time=10000. GCC 12.2.0, previous commit -> this commit: AddrManAdd: 132114160.6 -> 117119345.0 saved 14994815.6 ns/op, +12.8% faster AddrManAddThenGood: 150168118.5 -> 133389867.5 saved 16778251.0 ns/op, +12.6% faster Clang 17.0.6, previous commit -> this commit: AddrManAdd: 102457638.5 -> 96694378.8 saved 5763259.8 ns/op, +6.0% faster AddrManAddThenGood: 110488499.0 -> 103380926.0 saved 7107573.0 ns/op, +6.9% faster
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
ASMap lookups are on hot paths for netgrouping and addrman behavior. This PR keeps the implementation shape close to the existing code, but removes a few avoidable costs in
NetGroupManager::GetMappedAS()andthe ASMap varint decoder.
Changes
GetNetClass()result instead of rechecking IPv4 classification.GetIn6Addr().DecodeBits()path.DecodeBits()inline; this helps GCC substantially and is neutral for Clang.Benchmarks
Median ns/op over 10 runs,
-min-time=10000, pinned withtaskset -c 0.ASMap-enabled lookup, final stack vs
origin/master:ASMapGetMappedASMultiASMapGetMappedASMultiNo-ASMap path, last commit vs previous commit:
AddrManAddAddrManAddThenGoodAddrManAddAddrManAddThenGood