As suggested in the last sync meeting, we should understand why some of the benchmarks regressed and progressed. There are possible outcomes for each:
- The benchmark is poorly designed
- There is low-hanging fixes in CPython to reduce the regression
- We are reasonably comfortable with the regression given improvements elsewhere
I think as a first pass, we should just try to classify along these lines, and then fix CPython (where possible) first, and fix benchmarks with a lower priority.
For the progressions, it may just be a source of WHATSNEW content.
Let's crowdsource this where possible, reporting back to the checklist below.
Using the last weekly as a guide, the statistically significant regressions are below. For longitudinal details, see the plot of benchmark performance over time below.
The most statistically significant progressions are:
As suggested in the last sync meeting, we should understand why some of the benchmarks regressed and progressed. There are possible outcomes for each:
I think as a first pass, we should just try to classify along these lines, and then fix CPython (where possible) first, and fix benchmarks with a lower priority.
For the progressions, it may just be a source of WHATSNEW content.
Let's crowdsource this where possible, reporting back to the checklist below.
Using the last weekly as a guide, the statistically significant regressions are below. For longitudinal details, see the plot of benchmark performance over time below.
The most statistically significant progressions are: