[Draft] Benchmarking 8-bits Tag instead of 7#653
[Draft] Benchmarking 8-bits Tag instead of 7#653gaujay wants to merge 1 commit intorust-lang:masterfrom
Conversation
| special, | ||
| x86::_mm_set1_epi8(Tag::DELETED.0 as i8), | ||
| x86::_mm_and_si128(is_special, x86::_mm_set1_epi32(Tag::EMPTY32)), // EMPTY if special | ||
| x86::_mm_set1_epi32(Tag::DELETED32), // else DELETED |
There was a problem hiding this comment.
You can simplify this: the compare gives you 0 and -1, to which you can just add 127 to get EMPTY and DELETED.
| /// Control tag value for an empty bucket. | ||
| pub(crate) const EMPTY: Tag = Tag(0b1111_1111); | ||
| pub(crate) const EMPTY: Tag = Tag(0b0111_1111); // 127 | ||
| pub(crate) const EMPTY32: i32 = 0x7F7F7F7F; |
There was a problem hiding this comment.
There should probably be a separate ExpandedTag type for this rather than a plain i32.
|
Passing the benchmark results through |
|
On my system I don't see the iteration speedups. However there is still a significant speedup for lookup-fail: |
|
Interesting, thanks for the test run. This is a rough implementation just for benchmarking. |
|
I started an experiment in rust-lang/rust#146909 to use your modification for the compiler's benchmark suite. This measures compilation time, which is hugely affected by hash table performance.
I tested on a Zen 2 running Linux. |
|
Overall this seems like a slight perf loss: rust-lang/rust#146909 (comment) Note that the standard library was previously using 0.15.5, so this comparison also includes the change from #639. |
|
With that said it might still be worth persuing. Perhaps reducing the size of the lookup table might help? The current table takes up a full 1KB. But then this suffers from 2 extra cycles on x86 to broadcast a If you push any changes to this branch I can re-run the rustc perf benchmarks. |
|
Appreciated. I also tested with an i8 table, but the results were slower due to the additional instructions in the hot path. On another note, I noticed that random seeds of the hasher could sometimes lead to bad distribution in buckets when working with serial/high-bits key values. It is quite rare (maybe 3%) but worth mentioning as it could also skew some benchmark results. |
Following #635
Here is a test branch to benchmark an "8-bits Tag" implementation (starting with SSE2).
As seen in results below, for Foldhash:
For Std:
But the more interesting part might be the iteration results.
Not quite sure the exact reason for it (without proper profiling), but we have a ~35% speed-up.
As the new implementation relies on a static const array to map hash value to tag (which might not be hot in cache), the benchmarks should be taken with a grain of salt compared to in-production scenarios.
But I feel like the iteration gain deserve to be studied.
Vs