llvm-pi

The fastest known single-threaded GMP-based Chudnovsky pi calculator.

Beats gmp-chudnovsky.c — the gold standard reference implementation by Hanhong Xue that has reigned unchallenged for years — by orchestrating hand-written LLVM IR with surgically optimized C routines.

Benchmark Results

10 million digits on Apple M4 (macOS, single-threaded, user CPU time, PGO build):

Implementation	Avg (5 runs)	vs llvm-pi
llvm-pi	7.96s	---
gmp-chudnovsky (C, -O3)	8.09s	1.5% slower

=== llvm-pi PGO (5 runs, user CPU time) ===
user 7.96  user 7.98  user 7.96  user 7.96  user 7.96

=== gmp-chudnovsky -O3 (5 runs, user CPU time) ===
user 8.09  user 8.08  user 8.10  user 8.09  user 8.08

Even without PGO (./build), llvm-pi averages 7.98s — still 1.3% faster.

1 million digits:

Implementation	Time
llvm-pi	0.45s
gmp-chudnovsky	0.46s

The Journey

Starting from a naive LLVM IR implementation that was 23% slower than gmp-chudnovsky.c, systematic optimization brought it to parity and then past the reference:

Milestone	10M time	vs gmp-chud
Initial implementation	10.08s	23% slower
Binary splitting rewrite	8.52s	3.9% slower
Newton's method sqrt	8.33s	1.6% slower
Precision fix (the big one)	8.18s	1.2% faster
BS optimizations + LTO	7.98s	1.3% faster
With PGO	7.96s	1.5% faster

The single biggest win: discovering that digits * 4 bits of precision wastes 20% of every float operation. The correct value is digits * log2(10) ≈ digits * 3.322. This one fix shaved 240ms off the 10M-digit runtime.

How It Works

The Chudnovsky formula converges at ~14.18 digits per term — the fastest known series for pi:

         Q * (C/D) * sqrt(C)
pi = -------------------------
            T + A*Q

where C=640320, D=12, A=13591409, and Q, T are computed via binary splitting over ~705K terms at 10M digits.

Optimizations That Beat the Reference

Exact precision matching — digits * log2(10) bits, not the naive digits * 4. Avoids 20% wasted work in division, sqrt, and multiply phases.
Prime sieve GCD removal — At each binary splitting merge, common prime factors between Q_right and G_left are removed via factorized-number intersection before the expensive big-integer multiplications.
Dead P accumulator elimination — The reference maintains P = P_left * P_right through every merge. We proved P is never read after the final merge and eliminated the multiply entirely.
Cache-friendly AoS memory layout — Per-level data (Q, T, G, factorized forms) packed in contiguous structs. The reference scatters them across 5 separate heap allocations.
Integer-space constant folding — T += A*Q (addmul_ui) and Q *= C/D (mul_ui) done as integer operations before float conversion, eliminating a float multiply.
Aggressive inlining — All internal functions static, hot splitting function marked __attribute__((flatten)) to inline all callees (fac_set_bp, fac_mul_bp, fac_remove_gcd, etc.)
Link-time optimization — Cross-module inlining and optimization across C translation units.
Optional PGO — Profile-guided optimization tunes branch prediction and code layout for the actual hot paths.

Both implementations share the same foundational techniques (asymmetric 0.5224 split ratio, Newton's method precision-doubling sqrt, pre-allocated recursion stacks, GCD threshold at depth 4).

Architecture

main.c (CLI)
  -> chudnovsky.ll (LLVM IR entry point)
    -> sieve.c (all heavy computation)
       - Prime sieve + factorized number GCD
       - Binary splitting with pre-allocated stacks
       - Newton's method square root
       - Final pi computation + formatting

The LLVM IR serves as the public interface (compute_pi), delegating to optimized C routines for the number crunching.

Building

Requires Nix with flakes enabled:

# Build
./build

# Build with Profile-Guided Optimization (slower build, ~2% faster runtime)
PGO=1 ./build

# Run (compute N digits of pi)
./pi 1000000

# Run tests (6 tests: 50 to 100K digits verified against reference)
./test

# Benchmark against gmp-chudnovsky and mpfr-pi
./bm

Files

File	Description
`chudnovsky.ll`	LLVM IR entry point — declares and wraps `compute_pi_c`
`sieve.c`	All heavy computation: sieve, binary splitting, Newton sqrt, final pi
`main.c`	CLI: parses digit count, allocates buffer, calls `compute_pi`, prints result
`build`	Build script (llc + clang + LTO, optional PGO)
`test`	Test suite (6 tests: 50 to 100K digits)
`bm`	Benchmark script comparing llvm-pi, gmp-chudnovsky, mpfr-pi

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
bench		bench
docs/plans		docs/plans
.gitignore		.gitignore
README.md		README.md
SPEC.md		SPEC.md
bm		bm
build		build
chudnovsky.ll		chudnovsky.ll
flake.lock		flake.lock
flake.nix		flake.nix
main.c		main.c
pi_reference.txt		pi_reference.txt
sieve.c		sieve.c
test		test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llvm-pi

Benchmark Results

The Journey

How It Works

Optimizations That Beat the Reference

Architecture

Building

Files

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llvm-pi

Benchmark Results

The Journey

How It Works

Optimizations That Beat the Reference

Architecture

Building

Files

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages