Skip to content

Add virtual-cycles to callgrind measure#316

Merged
fitzgen merged 3 commits into
bytecodealliance:mainfrom
fitzgen:callgrind-virtual-cycles
Jun 11, 2026
Merged

Add virtual-cycles to callgrind measure#316
fitzgen merged 3 commits into
bytecodealliance:mainfrom
fitzgen:callgrind-virtual-cycles

Conversation

@fitzgen

@fitzgen fitzgen commented Jun 10, 2026

Copy link
Copy Markdown
Member

@cfallin mind taking a look at this? Interested in if you have opinions on the cost factors for various cache miss and branch misprediction events.

@fitzgen fitzgen force-pushed the callgrind-virtual-cycles branch from 4a16cf2 to 86ed210 Compare June 10, 2026 22:04
@fitzgen fitzgen force-pushed the callgrind-virtual-cycles branch from 86ed210 to 55a8cf0 Compare June 10, 2026 22:17
@fitzgen fitzgen requested a review from cfallin June 10, 2026 22:17
+ cost(100, "ll-dcache-read-misses")
+ cost(100, "ll-dcache-write-misses")
+ cost(10, "conditional-branch-misses")
+ cost(10, "indirect-branch-misses")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These numbers are not astronomically wrong, and even plausible for some generic middle-of-the-road CPU... I guess the most important thing is that we fix them once and then do comparisons based on them.

On the other hand, I do wonder if we could instead record the raw stats (that shouldn't be astronomically expensive -- 9 integers rather than 1, per run?) so that we could choose to present them differently, or give the user sliders ("perf on a big machine with out-of-order exec that hides latency" vs "perf on little embedded chip" or whatever)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, sightglass will still report the underlying events, it will just also report virtual cycles in addition to those events, so we could always go back and recompute virtual cycles based on old data if we wanted.

I didn't want to make virtual cycles live outside of sightglass tho so that if we say "commit BLAH regressed virtual cycles" we can also provide a simple sightglass CLI command to reproduce the regression.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point; as long as we're recording and not losing them then I'm happy.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yeah, any tweaks to the cost factors you think we should make before merging this? They aren't permanent but would be nice not to have to tweak them a bunch.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, they seem fine-ish. In a modern CPU an L1 miss will go to L2 with a cost of 10-15 cycles; and misses of LLC will go to DRAM with a cost of 200-300 cycles; and branch mispredicts are typically discovered 10-15-ish pipeline stages in; so these numbers are pretty close to real. The biggest gap will probably be the cache model itself (modern CPUs have L1 at 64KiB-ish, L2 at 256-512 KiB-ish, and LLC at anywhere from 8-32MiB-ish; Callgrind appears to model only L1 and LLC and I don't know what its default sizes are), and the fact that an out-of-order CPU will often hide at least the L1 misses completely. But this will give us something better than inst count.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We fix the size, associativity, and line size of callgrind's caches here:

const CACHE_MODEL_I1: &str = "32768,8,64";
const CACHE_MODEL_D1: &str = "32768,8,64";
const CACHE_MODEL_LL: &str = "8388608,16,64";

So 32KiB, 8-way associative, 64B line size L1 instruction and data caches.

8MiB, 16-way associative, 64B line size LL cache.

Let me know if you think we should change any of that.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems OK-ish, yeah; maybe double the L1 to 64KiB for a more typical modern CPU (several of the most recent generations of Intel, AMD are as such at least).

@fitzgen fitzgen merged commit 1027e61 into bytecodealliance:main Jun 11, 2026
16 checks passed
@fitzgen fitzgen deleted the callgrind-virtual-cycles branch June 11, 2026 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants