Add virtual-cycles to callgrind measure#316
Conversation
4a16cf2 to
86ed210
Compare
86ed210 to
55a8cf0
Compare
| + cost(100, "ll-dcache-read-misses") | ||
| + cost(100, "ll-dcache-write-misses") | ||
| + cost(10, "conditional-branch-misses") | ||
| + cost(10, "indirect-branch-misses") |
There was a problem hiding this comment.
These numbers are not astronomically wrong, and even plausible for some generic middle-of-the-road CPU... I guess the most important thing is that we fix them once and then do comparisons based on them.
On the other hand, I do wonder if we could instead record the raw stats (that shouldn't be astronomically expensive -- 9 integers rather than 1, per run?) so that we could choose to present them differently, or give the user sliders ("perf on a big machine with out-of-order exec that hides latency" vs "perf on little embedded chip" or whatever)?
There was a problem hiding this comment.
FWIW, sightglass will still report the underlying events, it will just also report virtual cycles in addition to those events, so we could always go back and recompute virtual cycles based on old data if we wanted.
I didn't want to make virtual cycles live outside of sightglass tho so that if we say "commit BLAH regressed virtual cycles" we can also provide a simple sightglass CLI command to reproduce the regression.
There was a problem hiding this comment.
Ah, good point; as long as we're recording and not losing them then I'm happy.
There was a problem hiding this comment.
But yeah, any tweaks to the cost factors you think we should make before merging this? They aren't permanent but would be nice not to have to tweak them a bunch.
There was a problem hiding this comment.
Honestly, they seem fine-ish. In a modern CPU an L1 miss will go to L2 with a cost of 10-15 cycles; and misses of LLC will go to DRAM with a cost of 200-300 cycles; and branch mispredicts are typically discovered 10-15-ish pipeline stages in; so these numbers are pretty close to real. The biggest gap will probably be the cache model itself (modern CPUs have L1 at 64KiB-ish, L2 at 256-512 KiB-ish, and LLC at anywhere from 8-32MiB-ish; Callgrind appears to model only L1 and LLC and I don't know what its default sizes are), and the fact that an out-of-order CPU will often hide at least the L1 misses completely. But this will give us something better than inst count.
There was a problem hiding this comment.
We fix the size, associativity, and line size of callgrind's caches here:
sightglass/crates/cli/src/benchmark.rs
Lines 29 to 31 in 1d56857
So 32KiB, 8-way associative, 64B line size L1 instruction and data caches.
8MiB, 16-way associative, 64B line size LL cache.
Let me know if you think we should change any of that.
There was a problem hiding this comment.
Seems OK-ish, yeah; maybe double the L1 to 64KiB for a more typical modern CPU (several of the most recent generations of Intel, AMD are as such at least).
@cfallin mind taking a look at this? Interested in if you have opinions on the cost factors for various cache miss and branch misprediction events.