Drop (leading) windows in MSM where all coefficients zero #479
Drop (leading) windows in MSM where all coefficients zero #479
Conversation
For scalar field elements modulo p, if all coefficients `a_i` are smaller than some value x < p, their binary representations will have leading zero bits. In this case, we can skip processing the windows corresponding to these leading zeros in the bucket calculation since they would not contribute to the final result. This follows the same idea as in the implementation: privacy-ethereum/halo2curves#168
Greptile SummaryThis PR ports the "drop leading zero windows" optimisation from Issues found:
Confidence Score: 1/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["multiScalarMul_vartime / multiScalarMul_reference_vartime\n(public API — useZeroWindows: bool)"] --> B{useZeroWindows?}
B -- "true\n(original behaviour)" --> C["Use full window count\nnumWindows = ceilDiv(bits, c)\ntop = bits - excess"]
B -- "false\n(new optimisation)" --> D["determineBitsSet\nOR all coefs → find highest set bit (msb)"]
D --> E{msb < 0?}
E -- yes --> F["Return early\n(all coefs zero)"]
E -- no --> G{msb < bits - excess?}
G -- "no\n(no savings)" --> C
G -- "yes\n(leading zeros exist)" --> H["Reduce window count\nnumWindows = ceilDiv(msb, c)\ntop = ceilDiv(msb,c) × c ⚠️\nexcess = 0"]
C --> I["Run Pippenger bucket accumulation\nover all windows"]
H --> I
I --> J["Final Horner accumulation\nr = Σ miniMSM[w] × 2^(w·c)"]
|
| points: ptr UncheckedArray[ECaff], | ||
| len: int) {.tags:[VarTime, Alloca, HeapAlloc], meter.} = | ||
| len: int, | ||
| useZeroWindows: bool) {.tags:[VarTime, Alloca, HeapAlloc], meter.} = | ||
| ## Multiscalar multiplication: | ||
| ## r <- [a₀]P₀ + [a₁]P₁ + ... + [aₙ₋₁]Pₙ₋₁ | ||
|
|
||
| msm_dispatch_vartime(r, coefs, points, len) | ||
| msm_dispatch_vartime(r, coefs, points, len, useZeroWindows) | ||
|
|
||
| func multiScalarMul_vartime*[bits: static int, EC, ECaff]( | ||
| r: var EC, | ||
| coefs: openArray[BigInt[bits]], | ||
| points: openArray[ECaff]) {.tags:[VarTime, Alloca, HeapAlloc], meter.} = | ||
| points: openArray[ECaff], | ||
| useZeroWindows: bool) {.tags:[VarTime, Alloca, HeapAlloc], meter.} = | ||
| ## Multiscalar multiplication: | ||
| ## r <- [a₀]P₀ + [a₁]P₁ + ... + [aₙ₋₁]Pₙ₋₁ | ||
| debug: doAssert coefs.len == points.len | ||
| let N = points.len | ||
| msm_dispatch_vartime(r, coefs.asUnchecked(), points.asUnchecked(), N) | ||
| msm_dispatch_vartime(r, coefs.asUnchecked(), points.asUnchecked(), N, useZeroWindows) | ||
|
|
||
| func multiScalarMul_vartime*[F, EC, ECaff]( | ||
| r: var EC, | ||
| coefs: ptr UncheckedArray[F], | ||
| points: ptr UncheckedArray[ECaff], | ||
| len: int) {.tags:[VarTime, Alloca, HeapAlloc], meter.} = | ||
| len: int, | ||
| useZeroWindows: bool) {.tags:[VarTime, Alloca, HeapAlloc], meter.} = | ||
| ## Multiscalar multiplication: | ||
| ## r <- [a₀]P₀ + [a₁]P₁ + ... + [aₙ₋₁]Pₙ₋₁ | ||
|
|
||
| let n = cast[int](len) | ||
| let coefs_big = allocHeapArrayAligned(F.getBigInt(), n, alignment = 64) | ||
| coefs_big.batchFromField(coefs, n) | ||
| r.multiScalarMul_vartime(coefs_big, points, n) | ||
| r.multiScalarMul_vartime(coefs_big, points, n, useZeroWindows) | ||
|
|
||
| freeHeapAligned(coefs_big) | ||
|
|
||
| func multiScalarMul_vartime*[EC, ECaff]( | ||
| r: var EC, | ||
| coefs: openArray[Fr], | ||
| points: openArray[ECaff]) {.tags:[VarTime, Alloca, HeapAlloc], inline.} = | ||
| points: openArray[ECaff], | ||
| useZeroWindows: bool) {.tags:[VarTime, Alloca, HeapAlloc], meter.} = | ||
| ## Multiscalar multiplication: | ||
| ## r <- [a₀]P₀ + [a₁]P₁ + ... + [aₙ₋₁]Pₙ₋₁ | ||
| debug: doAssert coefs.len == points.len | ||
| let N = points.len | ||
| multiScalarMul_vartime(r, coefs.asUnchecked(), points.asUnchecked(), N) | ||
| multiScalarMul_vartime(r, coefs.asUnchecked(), points.asUnchecked(), N, useZeroWindows) |
There was a problem hiding this comment.
Breaking public API change — useZeroWindows has no default
Every public overload of multiScalarMul_vartime and multiScalarMul_reference_vartime now requires a mandatory useZeroWindows: bool argument. This breaks all existing call sites in the codebase (and in downstream consumers). The PR description acknowledges CI is broken for this reason.
The conventional way to introduce this without a breaking change is to give the parameter a sensible default — true to preserve existing behaviour (process all windows):
func multiScalarMul_vartime*[bits: static int, EC, ECaff](
r: var EC,
coefs: openArray[BigInt[bits]],
points: openArray[ECaff],
useZeroWindows: bool = true) {.tags:[VarTime, Alloca, HeapAlloc], meter.}Applying the same default to every overload would restore CI without sacrificing the new opt-in optimisation path.
| @@ -363,13 +415,27 @@ func msmAffineImpl_vartime[bits: static int, EC, ECaff]( | |||
|
|
|||
| # Algorithm | |||
| # --------- | |||
| const excess = bits mod c | |||
| const top = bits - excess | |||
| var excess = bits mod c | |||
There was a problem hiding this comment.
Off-by-one: one extra zero window is processed in the optimised Pippenger path
When useZeroWindows = false and msb < bits - excess, the code computes:
msb = ceilDiv_vartime(msb, c) * c
excess = 0
top = msb # e.g. with msb=100, c=14 → top=112With top = 112 and excess = 0, the prologue is a no-op (because if excess != 0 is false), so w stays at 112. The steady-state loop then processes window 112 (bits 112–125) first, which is entirely zero by definition — determineBitsSet returned 100, meaning no bit above 99 is set.
The reference implementation in multiScalarMulImpl_reference_vartime handles this correctly by computing numWindows = ceilDiv(msb, c) (a window count), which maps to window indices 0…7 covering bits 0–111. The equivalent for msmImpl_vartime is:
msb = (ceilDiv_vartime(msb, c) - 1) * c # e.g. (8-1)*14 = 98This makes top = 98, the prologue is still a no-op, and the steady state begins at the last useful window (bits 98–111). Functionally the current code is still correct (a zero window contributes nothing), but it wastes one full window's worth of bucket accumulation and reduction work.
The same issue exists in msmAffineImpl_vartime at the analogous section.
NOTE: CI is broken, because I temporarily added a parameter
useZeroWindowsto the MSM procs.This implements the same optimization as done in:
privacy-ethereum/halo2curves#168
for the single threaded MSM implementations.
Essentially, if all coefficients
a_iare smaller than some valuex < p(withpthe prime field order), their binary representations will have leading zero bits. In this case, we can skip processing the windows corresponding to these leading zeros in the bucket calculation since they would not contribute to the final result.After writing a benchmark comparing the old implementation using all windows vs. the optimization (and hence added a new argument that currently breaks the CI), I noticed that the performance improvements are pretty minor. In fact, even without this optimization our improvements are massive for inputs with only small coefficients. The gains beyond that of skipping windows only provides an additional marginal improvement.
Here is a plot showing the time for an MSM with a different number of points and number of set bits. Compared is the reference implementation (using
bprefix for baseline) with the optimized implementation. Both split by either using all windows or only using non zero leading windows. What is actually quite interesting is that for the reference implementation the improvements generally are much larger, pushing it in line with the optimized implementation.(I ran the bench on my laptop with a i7-8750H)