Skip to content

ASMap optimizations#174

Draft
l0rinc wants to merge 5 commits into
masterfrom
detached534
Draft

ASMap optimizations#174
l0rinc wants to merge 5 commits into
masterfrom
detached534

Conversation

@l0rinc
Copy link
Copy Markdown
Owner

@l0rinc l0rinc commented May 17, 2026

Motivation

ASMap lookups are on hot paths for netgrouping and addrman behavior. This PR keeps the implementation shape close to the existing code, but removes a few avoidable costs in NetGroupManager::GetMappedAS() and
the ASMap varint decoder.

Changes

  • Avoid heap allocation when building the 128-bit ASMap lookup key.
  • Reuse the GetNetClass() result instead of rechecking IPv4 classification.
  • Avoid an IPv6 temporary byte vector by copying through GetIn6Addr().
  • Simplify ASMap mantissa decoding while keeping the generic DecodeBits() path.
  • Mark DecodeBits() inline; this helps GCC substantially and is neutral for Clang.
  • Return before address classification when ASMap is disabled.

Benchmarks

Median ns/op over 10 runs, -min-time=10000, pinned with taskset -c 0.

ASMap-enabled lookup, final stack vs origin/master:

Compiler Benchmark Before After Speedup
GCC 12.2.0 ASMapGetMappedASMulti 6071.9 4014.8 +51.2%
Clang 17.0.6 ASMapGetMappedASMulti 4639.3 4180.5 +11.0%

No-ASMap path, last commit vs previous commit:

Compiler Benchmark Before After Speedup
GCC 12.2.0 AddrManAdd 132114160.6 117119345.0 +12.8%
GCC 12.2.0 AddrManAddThenGood 150168118.5 133389867.5 +12.6%
Clang 17.0.6 AddrManAdd 102457638.5 96694378.8 +6.0%
Clang 17.0.6 AddrManAddThenGood 110488499.0 103380926.0 +6.9%

@l0rinc l0rinc closed this May 17, 2026
@l0rinc l0rinc reopened this May 17, 2026
@l0rinc l0rinc closed this May 17, 2026
@l0rinc l0rinc reopened this May 17, 2026
l0rinc added 5 commits May 18, 2026 18:17
The ASMap benchmark added in PR 35285 makes GetMappedAS() measurable. perf record on ASMapGetMappedASMulti showed most time under Interpret()/DecodeBits(), but address classification and lookup-key setup were still visible.

GetMappedAS() already calls GetNetClass() before building the lookup key, and GetNetClass() classifies IPv4 by using the same HasLinkedIPv4() predicate. Reuse that net_class result for the early IPv4/IPv6 filter and for selecting the IPv4 mapping path, and use a fixed stack array for the 128-bit lookup key instead of allocating a 16-byte vector on every lookup.

Benchmarked with GCC 12.2.0 and Clang 17.0.6. Numbers are median ns/op over 10 runs with -min-time=10000.

GCC 12.2.0, previous commit -> this commit:
ASMapGetMappedASMulti:       6071.9 -> 5875.7  saved 196.1 ns/op, +3.3% faster
ASMapGetMappedASCloudflarev4: 6730.7 -> 6585.8  saved 144.9 ns/op, +2.2% faster
ASMapGetMappedASCloudflarev6: 5328.0 -> 5087.2  saved 240.9 ns/op, +4.7% faster
ASMapGetMappedASQuad9v6:      4099.8 -> 3866.8  saved 233.0 ns/op, +6.0% faster
ASMapGetMappedASUnmappedv6:   1095.4 -> 854.4   saved 241.0 ns/op, +28.2% faster

Clang 17.0.6, previous commit -> this commit:
ASMapGetMappedASMulti:       4639.3 -> 4492.5  saved 146.8 ns/op, +3.3% faster
ASMapGetMappedASCloudflarev4: 5048.3 -> 4937.3  saved 111.0 ns/op, +2.2% faster
ASMapGetMappedASCloudflarev6: 3940.9 -> 3806.9  saved 134.0 ns/op, +3.5% faster
ASMapGetMappedASQuad9v6:      3013.9 -> 2881.3  saved 132.5 ns/op, +4.6% faster
ASMapGetMappedASUnmappedv6:   693.4 -> 567.8    saved 125.6 ns/op, +22.1% faster
For IPv6 ASMap lookups, GetAddrBytes() builds a temporary vector even though GetMappedAS() only needs the fixed 16-byte IPv6 address.

Copy the address through GetIn6Addr() into the existing stack lookup key instead. This keeps the same IPv6-only assertion while avoiding the temporary byte vector.

Benchmarked with GCC 12.2.0 and Clang 17.0.6. Numbers are median ns/op over 10 runs with -min-time=10000.

GCC 12.2.0, previous commit -> this commit:
ASMapGetMappedASMulti:       5875.7 -> 5817.7  saved 58.0 ns/op, +1.0% faster
ASMapGetMappedASCloudflarev6: 5087.2 -> 5000.6  saved 86.5 ns/op, +1.7% faster
ASMapGetMappedASGooglev6:     5251.3 -> 5159.3  saved 92.0 ns/op, +1.8% faster
ASMapGetMappedASQuad9v6:      3866.8 -> 3785.4  saved 81.4 ns/op, +2.2% faster
ASMapGetMappedASUnmappedv6:   854.4 -> 769.8    saved 84.6 ns/op, +11.0% faster

Clang 17.0.6, previous commit -> this commit:
ASMapGetMappedASMulti:       4492.5 -> 4425.4  saved 67.1 ns/op, +1.5% faster
ASMapGetMappedASCloudflarev6: 3806.9 -> 3746.4  saved 60.5 ns/op, +1.6% faster
ASMapGetMappedASGooglev6:     3857.1 -> 3799.2  saved 57.9 ns/op, +1.5% faster
ASMapGetMappedASQuad9v6:      2881.3 -> 2810.4  saved 70.9 ns/op, +2.5% faster
ASMapGetMappedASUnmappedv6:   567.8 -> 506.6    saved 61.2 ns/op, +12.1% faster
DecodeBits() consumes an operand one bit at a time in big-endian order, but the old mantissa loop used a variable shift for every bit and added into the final value directly.

Build the mantissa by shifting the accumulated value left and OR-ing the next consumed bit. This preserves the existing bit order and EOF checks without introducing a separate packed-bit reader.

DecodeBits() itself is an anonymous-namespace helper, so it is covered indirectly through Interpret(), SanityCheckAsmap(), CheckStandardAsmap(), GetMappedAS(), and the ASMap fuzz targets rather than by direct unit tests. Keep this as the smallest mechanical change to the existing decoder.

Benchmarked with GCC 12.2.0 and Clang 17.0.6. Numbers are median ns/op over 10 runs with -min-time=10000.

GCC 12.2.0, previous commit -> this commit:
ASMapGetMappedASMulti:       5817.7 -> 5622.1  saved 195.6 ns/op, +3.5% faster
ASMapGetMappedASCloudflarev4: 6591.4 -> 6487.4  saved 104.0 ns/op, +1.6% faster
ASMapGetMappedASCloudflarev6: 5000.6 -> 4912.0  saved 88.6 ns/op, +1.8% faster
ASMapGetMappedASGooglev4:     6805.4 -> 6647.2  saved 158.2 ns/op, +2.4% faster
ASMapGetMappedASQuad9v4:      4785.5 -> 4644.4  saved 141.2 ns/op, +3.0% faster

Clang 17.0.6, previous commit -> this commit:
ASMapGetMappedASMulti:       4425.4 -> 4176.2  saved 249.1 ns/op, +6.0% faster
ASMapGetMappedASCloudflarev4: 4936.1 -> 4699.0  saved 237.0 ns/op, +5.0% faster
ASMapGetMappedASCloudflarev6: 3746.4 -> 3550.5  saved 196.0 ns/op, +5.5% faster
ASMapGetMappedASGooglev4:     5093.1 -> 4823.7  saved 269.4 ns/op, +5.6% faster
ASMapGetMappedASQuad9v4:      3633.4 -> 3449.4  saved 184.1 ns/op, +5.3% faster
With the mantissa loop simplified, DecodeBits() is small enough that asking for inlining gives GCC better visibility into the fixed operand tables used by DecodeType(), DecodeASN(), DecodeMatch(), and DecodeJump(). Keep the generic DecodeType() path: after the mantissa and inlining changes, it benchmarks faster than the custom direct opcode decoder while preserving the smaller diff.

Some IDEs mark this inline keyword redundant, and Clang 17 is indeed effectively neutral here. GCC 12 does not treat it as redundant in this build: the explicit hint substantially improves the ASMap benchmarks. A forced ALWAYS_INLINE version was measured separately and was slower than the plain inline hint, so keep the weaker standard keyword.

Benchmarked with GCC 12.2.0 and Clang 17.0.6. Numbers are median ns/op over 10 runs with -min-time=10000.

GCC 12.2.0, previous commit -> this commit:
ASMapGetMappedASMulti:       5622.1 -> 4014.8  saved 1607.4 ns/op, +40.0% faster
ASMapGetMappedASCloudflarev4: 6487.4 -> 4564.4  saved 1923.0 ns/op, +42.1% faster
ASMapGetMappedASCloudflarev6: 4912.0 -> 3435.1  saved 1477.0 ns/op, +43.0% faster
ASMapGetMappedASGooglev4:     6647.2 -> 4742.7  saved 1904.5 ns/op, +40.2% faster
ASMapGetMappedASQuad9v6:      3714.9 -> 2606.8  saved 1108.1 ns/op, +42.5% faster
ASMapGetMappedASUnmappedv6:   771.1 -> 609.2    saved 161.9 ns/op, +26.6% faster

Clang 17.0.6, previous commit -> this commit:
ASMapGetMappedASMulti:       4176.2 -> 4180.5  changed -4.3 ns/op, -0.1% slower
ASMapGetMappedASCloudflarev4: 4699.0 -> 4695.6  saved 3.5 ns/op, +0.1% faster
ASMapGetMappedASGooglev6:     3639.3 -> 3608.7  saved 30.6 ns/op, +0.8% faster
ASMapGetMappedASQuad9v6:      2670.5 -> 2704.1  changed -33.6 ns/op, -1.2% slower
ASMapGetMappedASUnmappedv6:   482.3 -> 482.4    changed -0.1 ns/op, -0.0% slower
GetMappedAS() first checks whether an ASMap is configured. When it is not, there is no need to classify the address before returning the reserved unmapped ASN value.

This keeps the ASMap-enabled path unchanged except for the already-predictable non-empty check, while making the default no-ASMap path cheaper for callers such as AddrMan that repeatedly ask for netgroups.

Benchmarked with GCC 12.2.0 and Clang 17.0.6. Numbers are median ns/op over 10 runs with -min-time=10000.

GCC 12.2.0, previous commit -> this commit:

AddrManAdd:         132114160.6 -> 117119345.0  saved 14994815.6 ns/op, +12.8% faster

AddrManAddThenGood: 150168118.5 -> 133389867.5  saved 16778251.0 ns/op, +12.6% faster

Clang 17.0.6, previous commit -> this commit:

AddrManAdd:         102457638.5 -> 96694378.8   saved 5763259.8 ns/op, +6.0% faster

AddrManAddThenGood: 110488499.0 -> 103380926.0  saved 7107573.0 ns/op, +6.9% faster
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant