Skip to content

Drop (leading) windows in MSM where all coefficients zero #479

Open
Vindaar wants to merge 4 commits intomasterfrom
msmDropZeroWindows
Open

Drop (leading) windows in MSM where all coefficients zero #479
Vindaar wants to merge 4 commits intomasterfrom
msmDropZeroWindows

Conversation

@Vindaar
Copy link
Collaborator

@Vindaar Vindaar commented Oct 25, 2024

NOTE: CI is broken, because I temporarily added a parameter useZeroWindows to the MSM procs.

This implements the same optimization as done in:

privacy-ethereum/halo2curves#168

for the single threaded MSM implementations.

Essentially, if all coefficients a_i are smaller than some value x < p (with p the prime field order), their binary representations will have leading zero bits. In this case, we can skip processing the windows corresponding to these leading zeros in the bucket calculation since they would not contribute to the final result.

After writing a benchmark comparing the old implementation using all windows vs. the optimization (and hence added a new argument that currently breaks the CI), I noticed that the performance improvements are pretty minor. In fact, even without this optimization our improvements are massive for inputs with only small coefficients. The gains beyond that of skipping windows only provides an additional marginal improvement.

Here is a plot showing the time for an MSM with a different number of points and number of set bits. Compared is the reference implementation (using b prefix for baseline) with the optimized implementation. Both split by either using all windows or only using non zero leading windows. What is actually quite interesting is that for the reference implementation the improvements generally are much larger, pushing it in line with the optimized implementation.

(I ran the bench on my laptop with a i7-8750H)

bench_result

For scalar field elements modulo p, if all coefficients `a_i` are smaller
than some value x < p, their binary representations will have leading zero
bits. In this case, we can skip processing the windows corresponding to
these leading zeros in the bucket calculation since they would not
contribute to the final result.

This follows the same idea as in the implementation:

privacy-ethereum/halo2curves#168
@greptile-apps
Copy link

greptile-apps bot commented Mar 16, 2026

Greptile Summary

This PR ports the "drop leading zero windows" optimisation from halo2curves into Constantine's single-threaded MSM implementations (multiScalarMul_vartime and multiScalarMul_reference_vartime). A new determineBitsSet helper ORs all coefficients together to find the highest set bit, and when that bit falls below the topmost Pippenger window, the number of processed windows is reduced accordingly. New benchmarks and correctness-comparison tests are included to validate and profile the change.

Issues found:

  • Compile error — missing module: t_ec_template.nim line 783 imports ../../benchmarks/bench_msm_impl_optional_drop_windows which does not exist anywhere in the repository (the bench file has it commented out). This breaks compilation of every test file that imports t_ec_template.
  • Compile error — undefined identifier: t_ec_template.nim line 780 references bEC.getScalarField().bits() where bEC is undefined; should be EC.
  • Breaking API change: All public overloads of multiScalarMul_vartime and multiScalarMul_reference_vartime now require a mandatory useZeroWindows: bool argument with no default. Adding useZeroWindows: bool = true as a default would restore backwards compatibility.
  • Off-by-one in optimised Pippenger top: In msmImpl_vartime and msmAffineImpl_vartime, the reduced top is computed as ceilDiv(msb, c) * c, which is one window position too high. With msb=100, c=14 this gives top=112 but the last useful window is at 98. One entirely zero window is needlessly processed. The reference implementation correctly uses numWindows = ceilDiv(msb, c) (a count), equivalent to top = (ceilDiv(msb,c) - 1) * c.
  • == should be <= in sanity test: The check msb == min(bits, EC.getScalarField().bits()) is probabilistically fragile; random_coefficient only guarantees at-most maxBit bits, not exactly maxBit.
  • determineBitsSet declared as proc: Should be func per codebase convention (no side effects).
  • random_coefficient duplicated: Identical implementations appear in both bench_elliptic_parallel_template.nim and t_ec_template.nim.

Confidence Score: 1/5

  • Not safe to merge — two compile-blocking issues in the test template and a breaking API change across all public MSM entry points.
  • The PR introduces two definite compile errors (missing module import and bEC undefined identifier) that prevent the new tests from building, and a breaking API change that breaks CI as acknowledged by the author. The core optimisation logic in the reference implementation is correct, but the optimised Pippenger variants process one redundant zero window. All four issues must be resolved before this can be merged safely.
  • tests/math_elliptic_curves/t_ec_template.nim (two compile errors), constantine/math/elliptic/ec_multi_scalar_mul.nim (breaking API + off-by-one in top window for msmImpl_vartime and msmAffineImpl_vartime).

Important Files Changed

Filename Overview
constantine/math/elliptic/ec_multi_scalar_mul.nim Core MSM implementation: adds determineBitsSet helper and useZeroWindows parameter to all overloads. Two issues: (1) all public APIs are now breaking changes with no default value for useZeroWindows; (2) in msmImpl_vartime/msmAffineImpl_vartime the optimised top is computed as ceilDiv(msb,c)*c (one window too many) instead of (ceilDiv(msb,c)-1)*c; also when compile-time guards converted to if runtime checks.
tests/math_elliptic_curves/t_ec_template.nim Three compile-blocking issues: (1) bEC typo on line 780 (undefined identifier, should be EC); (2) import of non-existent bench_msm_impl_optional_drop_windows module on line 783; (3) sanity assertion uses == where <= is correct. The random_coefficient helper is also duplicated from the benchmark template.
benchmarks/bench_elliptic_parallel_template.nim Adds maxBit parameter to createBenchMsmContext and a random_coefficient helper to generate sparse coefficients. Fields made public. Logic is sound; random_coefficient is duplicated in the test template.
benchmarks/bench_ec_msm_coeffs_with_zeroes.nim New benchmark comparing all-windows vs zero-window-dropping strategies across baseline and optimised MSM. Requires ggplotnim (hard compile error if absent). Contains a commented-out import of the missing bench_msm_impl_optional_drop_windows module.
tests/math_elliptic_curves/t_ec_msm_zero_windows_sanity.nim New sanity test for determineBitsSet; blocked from compiling due to bEC typo and the missing-module import in the template it imports.
tests/math_elliptic_curves/t_ec_shortw_jac_g1_msm_zero_windows_comparison.nim New correctness comparison test verifying useZeroWindows=true and useZeroWindows=false produce identical results. Blocked from compiling because it imports t_ec_template which in turn imports the missing benchmark module.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["multiScalarMul_vartime / multiScalarMul_reference_vartime\n(public API — useZeroWindows: bool)"] --> B{useZeroWindows?}

    B -- "true\n(original behaviour)" --> C["Use full window count\nnumWindows = ceilDiv(bits, c)\ntop = bits - excess"]

    B -- "false\n(new optimisation)" --> D["determineBitsSet\nOR all coefs → find highest set bit (msb)"]

    D --> E{msb < 0?}
    E -- yes --> F["Return early\n(all coefs zero)"]
    E -- no --> G{msb < bits - excess?}

    G -- "no\n(no savings)" --> C
    G -- "yes\n(leading zeros exist)" --> H["Reduce window count\nnumWindows = ceilDiv(msb, c)\ntop = ceilDiv(msb,c) × c ⚠️\nexcess = 0"]

    C --> I["Run Pippenger bucket accumulation\nover all windows"]
    H --> I

    I --> J["Final Horner accumulation\nr = Σ miniMSM[w] × 2^(w·c)"]
Loading

Comments Outside Diff (5)

  1. tests/math_elliptic_curves/t_ec_template.nim, line 780 (link)

    Undefined identifier bEC — compile error

    bEC is not defined anywhere in scope; this should be EC (the typedesc parameter of the enclosing proc). As written, this will fail to compile.

  2. tests/math_elliptic_curves/t_ec_template.nim, line 783 (link)

    Import of non-existent module — compile error

    bench_msm_impl_optional_drop_windows.nim does not exist in the repository. The bench file (bench_ec_msm_coeffs_with_zeroes.nim) has this same import commented out (# ./bench_msm_impl_optional_drop_windows), which strongly suggests it was an intermediate file that was removed before this PR was submitted. This import will cause a compilation failure for any test file that imports t_ec_template.

  3. tests/math_elliptic_curves/t_ec_template.nim, line 780 (link)

    Equality check == should be <= in sanity assertion

    determineBitsSet returns the highest set bit position + 1 across all N coefficients. random_coefficient guarantees that the result fits in at most maxBit bits, but it does not guarantee that the highest bit is exactly maxBit (e.g., all 4096 random values might land just below 2^(maxBit-1), or — for maxBit = 1 — could even all be zero with negligible but nonzero probability).

    The assertion should be:

  4. constantine/math/elliptic/ec_multi_scalar_mul.nim, line 222-245 (link)

    determineBitsSet should be func, not proc

    This helper has no observable side effects and does not mutate state, yet it is declared as proc. The existing codebase convention is to use func for pure computations. Declaring it as proc silently removes the compile-time guarantee of no side effects and may prevent the compiler from applying certain optimisations.

  5. benchmarks/bench_elliptic_parallel_template.nim, line 144-156 (link)

    random_coefficient is duplicated across benchmark and test files

    Identical implementations of random_coefficient appear both here (in bench_elliptic_parallel_template.nim, lines 144–156) and in tests/math_elliptic_curves/t_ec_template.nim (lines 732–744). Any future bug fix or behavioural change in one copy will not propagate to the other, creating a maintenance burden.

    Consider extracting this helper into a single shared location (e.g., helpers/prng_unsafe alongside the existing test PRNG utilities) and importing it in both files.

Last reviewed commit: 2489912

Comment on lines 597 to +641
points: ptr UncheckedArray[ECaff],
len: int) {.tags:[VarTime, Alloca, HeapAlloc], meter.} =
len: int,
useZeroWindows: bool) {.tags:[VarTime, Alloca, HeapAlloc], meter.} =
## Multiscalar multiplication:
## r <- [a₀]P₀ + [a₁]P₁ + ... + [aₙ₋₁]Pₙ₋₁

msm_dispatch_vartime(r, coefs, points, len)
msm_dispatch_vartime(r, coefs, points, len, useZeroWindows)

func multiScalarMul_vartime*[bits: static int, EC, ECaff](
r: var EC,
coefs: openArray[BigInt[bits]],
points: openArray[ECaff]) {.tags:[VarTime, Alloca, HeapAlloc], meter.} =
points: openArray[ECaff],
useZeroWindows: bool) {.tags:[VarTime, Alloca, HeapAlloc], meter.} =
## Multiscalar multiplication:
## r <- [a₀]P₀ + [a₁]P₁ + ... + [aₙ₋₁]Pₙ₋₁
debug: doAssert coefs.len == points.len
let N = points.len
msm_dispatch_vartime(r, coefs.asUnchecked(), points.asUnchecked(), N)
msm_dispatch_vartime(r, coefs.asUnchecked(), points.asUnchecked(), N, useZeroWindows)

func multiScalarMul_vartime*[F, EC, ECaff](
r: var EC,
coefs: ptr UncheckedArray[F],
points: ptr UncheckedArray[ECaff],
len: int) {.tags:[VarTime, Alloca, HeapAlloc], meter.} =
len: int,
useZeroWindows: bool) {.tags:[VarTime, Alloca, HeapAlloc], meter.} =
## Multiscalar multiplication:
## r <- [a₀]P₀ + [a₁]P₁ + ... + [aₙ₋₁]Pₙ₋₁

let n = cast[int](len)
let coefs_big = allocHeapArrayAligned(F.getBigInt(), n, alignment = 64)
coefs_big.batchFromField(coefs, n)
r.multiScalarMul_vartime(coefs_big, points, n)
r.multiScalarMul_vartime(coefs_big, points, n, useZeroWindows)

freeHeapAligned(coefs_big)

func multiScalarMul_vartime*[EC, ECaff](
r: var EC,
coefs: openArray[Fr],
points: openArray[ECaff]) {.tags:[VarTime, Alloca, HeapAlloc], inline.} =
points: openArray[ECaff],
useZeroWindows: bool) {.tags:[VarTime, Alloca, HeapAlloc], meter.} =
## Multiscalar multiplication:
## r <- [a₀]P₀ + [a₁]P₁ + ... + [aₙ₋₁]Pₙ₋₁
debug: doAssert coefs.len == points.len
let N = points.len
multiScalarMul_vartime(r, coefs.asUnchecked(), points.asUnchecked(), N)
multiScalarMul_vartime(r, coefs.asUnchecked(), points.asUnchecked(), N, useZeroWindows)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaking public API change — useZeroWindows has no default

Every public overload of multiScalarMul_vartime and multiScalarMul_reference_vartime now requires a mandatory useZeroWindows: bool argument. This breaks all existing call sites in the codebase (and in downstream consumers). The PR description acknowledges CI is broken for this reason.

The conventional way to introduce this without a breaking change is to give the parameter a sensible default — true to preserve existing behaviour (process all windows):

func multiScalarMul_vartime*[bits: static int, EC, ECaff](
       r: var EC,
       coefs: openArray[BigInt[bits]],
       points: openArray[ECaff],
       useZeroWindows: bool = true) {.tags:[VarTime, Alloca, HeapAlloc], meter.}

Applying the same default to every overload would restore CI without sacrificing the new opt-in optimisation path.

Comment on lines 407 to +418
@@ -363,13 +415,27 @@ func msmAffineImpl_vartime[bits: static int, EC, ECaff](

# Algorithm
# ---------
const excess = bits mod c
const top = bits - excess
var excess = bits mod c
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off-by-one: one extra zero window is processed in the optimised Pippenger path

When useZeroWindows = false and msb < bits - excess, the code computes:

msb = ceilDiv_vartime(msb, c) * c
excess = 0
top = msb   # e.g. with msb=100, c=14  →  top=112

With top = 112 and excess = 0, the prologue is a no-op (because if excess != 0 is false), so w stays at 112. The steady-state loop then processes window 112 (bits 112–125) first, which is entirely zero by definition — determineBitsSet returned 100, meaning no bit above 99 is set.

The reference implementation in multiScalarMulImpl_reference_vartime handles this correctly by computing numWindows = ceilDiv(msb, c) (a window count), which maps to window indices 0…7 covering bits 0–111. The equivalent for msmImpl_vartime is:

msb = (ceilDiv_vartime(msb, c) - 1) * c   # e.g. (8-1)*14 = 98

This makes top = 98, the prologue is still a no-op, and the steady state begins at the last useful window (bits 98–111). Functionally the current code is still correct (a zero window contributes nothing), but it wastes one full window's worth of bucket accumulation and reduction work.

The same issue exists in msmAffineImpl_vartime at the analogous section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant