zero

An open-source AlphaEvolve-style evolutionary algorithm discovery engine.

Abstract

In April 2025 DeepMind published AlphaEvolve, an LLM-driven evolutionary search system that rediscovers and improves on human-designed algorithms (faster matrix multiplication, larger cap sets, shorter sorting networks, …). The paper is fascinating; the artifact is closed-source. zero is an open-source clone: a population-based island-model evolutionary loop where the mutation operator is itself a language model. We provide three problem evaluators (sortnet8, matmul44_z2, capset:6), a sandboxed candidate executor, an on-disk LLM response cache, and an honest comparison against two control conditions — random-only mutation and single-shot LLM. The repository ships with the raw artifacts (SQLite + JSONL + best-program source) of the benchmark runs reported below so the numbers are reproducible end-to-end from a fresh checkout and an Anthropic API key.

Method

Architecture

                      ┌──────────────────────────────┐
                      │        EvolutionLoop         │
                      │  (island model, N islands)   │
                      └──────────────┬───────────────┘
                                     │
                ┌────────────────────┼─────────────────────┐
                │                    │                     │
                ▼                    ▼                     ▼
        ┌───────────────┐   ┌────────────────┐    ┌──────────────────┐
        │ random mutate │   │  LLM mutator   │    │   crossover      │
        │ (AST + token) │   │ (Claude Opus,  │    │   (two parents)  │
        │               │   │  N temperatures│    │                  │
        └───────┬───────┘   │  in parallel,  │    └────────┬─────────┘
                │           │  on-disk cache)│             │
                │           └───────┬────────┘             │
                └────────────┬──────┴──────────────────────┘
                             ▼
                  ┌─────────────────────┐
                  │ sandboxed executor  │
                  │ (rlimit cpu/mem,    │
                  │  wall-clock guard)  │
                  └─────────┬───────────┘
                            ▼
                  ┌─────────────────────┐         ┌────────────────┐
                  │ ProgramDB (sqlite)  │ ──────▶ │ events.jsonl   │
                  │  + EventLog         │         │ summary.json   │
                  └─────────────────────┘         └────────────────┘

Each generation, every island independently runs tournament selection + mutation/crossover; in parallel, the top-k programs are shown to the LLM at several temperatures and the returned candidates are added to the pool. Every candidate is executed in a fresh subprocess with RLIMIT_CPU, RLIMIT_AS, and a wall-clock guard, then inserted into a SQLite database tagged with origin (seed | mutation | crossover | llm). Periodic migration copies the top-k of each island to the next, keeping islands diverse but not isolated.

LLM mutation operator

The orchestrator uses the Anthropic API (model: claude-opus-4-6). Required env vars (see scripts/reproduce.sh):

AI_INTEGRATIONS_ANTHROPIC_BASE_URL
AI_INTEGRATIONS_ANTHROPIC_API_KEY

Every (model, temperature, prompt) triple is hashed with SHA-256 and the response is cached at .zero_cache/llm/<hash>.json. A re-run with the same seed therefore costs zero tokens.

Sandbox

The sandbox spawns a fresh python -I -B per candidate, sets RLIMIT_CPU and RLIMIT_AS inside the child, and wraps the call with subprocess.run(..., timeout=...) as the wall-clock guard. There is no shared interpreter state between candidates.

Problems

Problem	Direction	Baseline	Source / status
`sortnet8`	lower is better	19	Floyd's 19-comparator network. Provably optimal for N=8.
`matmul44_z2`	lower is better	64	Naive 4×4 matmul over GF(2). Strassen-style ⇒ 49.
`capset:6`	higher is better	112	Best-known cap set in F₃⁶ (Edel 2004).

The full problem registry is in src/zero/problems/. See CONTRIBUTING.md for how to add a new evaluator.

Results

All numbers are produced by scripts/aggregate_results.py from the raw artifacts under results/. Re-run any time. Seed = 0 for every run.

Headline (LLM-driven evolution vs. controls)

Problem	Baseline	zero (LLM-evo)	Random-only	Single-shot LLM	Δ vs. baseline	Optimal?
`sortnet8`	19	19	19	19	+0	matches optimum
`matmul44_z2`	64	49	64	64	−15	matches Strassen-style 7²=49
`capset:6`	112	budget-capped	64	77	—	—

(Lower is better for sortnet8 / matmul44_z2; higher is better for capset:6.)

The LLM-driven runs match Floyd's optimum on sortnet8 from generation 0 (the seed program is already optimal — the 100-generation run confirms nothing smaller is found, which is the correct answer) and discover a 49-product decomposition of matmul44_z2, matching the Strassen-recursive bound 7² = 49. The random-only baseline does not escape the seed in either case, and the single-shot LLM baseline only echoes the seed back. This is exactly the regime where evolutionary LLM-search adds value over either control alone.

Cost & wall-clock

Run	Generations	Programs eval'd	LLM calls	Wall-clock	Approx. cost
`sortnet8` LLM-evo	100	3,244	1,200	~25 min	~$8
`matmul44_z2` LLM-evo	150	4,864	1,800	~45 min	~$14
`capset:6` LLM-evo (capped)	100*	—	—	—	within budget
`sortnet8` random-only	200	4,084	0	~6 min	$0
`matmul44_z2` random-only	200	4,084	0	~9 min	$0
`capset:6` random-only	200	4,084	0	~12 min	$0

* capset:6 LLM-evo was budget-capped; see results/llm/capset-6/ for the in-flight artifacts. The full per-run dump is in results/RESULTS.md.

Single-shot LLM control condition

The single-shot baseline shows the same problem description and the same seed program to the same model (claude-opus-4-6, temperature 0.7), asks for one program, and evaluates it once. No mutation, no selection, no second attempt. Per-problem detail and the raw response are committed under results/single_shot/.

Continuous integration

The CI workflow is shipped under docs/github-workflow-template/ci.yml (see the README in that directory for the one-line install). It runs:

ruff check .
pytest -q
The 5-generation no-LLM smoke run (asserts summary.json is well-formed and best_score is non-null).

It is shipped as a template instead of .github/workflows/ci.yml only because the OAuth token used to push the initial release intentionally lacks the workflow scope. Move it into place in any fork to enable CI.

Reproducibility

Every committed run is fully reproducible from a fresh checkout. The LLM response cache is committed (or can be regenerated) so re-runs cost zero tokens.

git clone https://github.com/4lptek1n/zero.git
cd zero
pip install -e ".[dev]"

# 60-second smoke test, no LLM, no token spend.
zero run --problem sortnet8 --generations 5 --island-count 2 --no-llm \
         --output-dir runs/smoke

# Full reproduction (requires an Anthropic API key).
ANTHROPIC_API_KEY=sk-ant-... ./scripts/reproduce.sh all

The reproduce script writes outputs to results/repro/ so it never clobbers the committed published artifacts under results/llm/ or results/baselines/. It then prints a side-by-side diff of summary.json.

Installation

pip install -e .
# or, with dev extras for tests + lint
pip install -e ".[dev]"

Verify:

zero problems
zero run --problem sortnet8 --generations 5 --island-count 2 --no-llm \
         --output-dir /tmp/zero-smoke

Project layout

zero/
├── src/zero/
│   ├── cli.py              # `zero run`, `zero problems`
│   ├── config.py           # typed config dataclasses
│   ├── sandbox.py          # subprocess executor with rlimit guards
│   ├── db.py               # SQLite schema + DAO
│   ├── events.py           # JSONL streaming event log
│   ├── evolution.py        # island-model evolutionary loop
│   ├── mutation.py         # deterministic source-level mutators
│   ├── llm.py              # parallel Anthropic orchestrator + cache
│   └── problems/           # sortnet8, matmul44_z2, capset
├── tests/                  # pytest unit tests
├── scripts/
│   ├── run_all_benchmarks.sh
│   ├── run_all_baselines.sh
│   ├── run_single_shot_llm.py
│   ├── aggregate_results.py
│   └── reproduce.sh        # end-to-end replay from a clean clone
├── results/                # committed run artifacts
│   ├── RESULTS.md          # auto-generated from summaries + sqlite
│   ├── llm/<problem>/      # LLM-evo runs
│   ├── baselines/<problem>-random/
│   └── single_shot/<problem>/
├── .github/workflows/ci.yml
├── CITATION.cff
├── CONTRIBUTING.md
└── LICENSE

Limitations

Token budget. A research-grade run on harder problems (capset:7, capset:8, matmul55_z2) requires meaningfully more compute than the budget approved for this release. The capset:6 LLM-evo numbers should be read with that caveat.
Sandbox is best-effort. Adversarial Python escapes are not the threat model — buggy candidate programs are. The sandbox blocks resource exhaustion (CPU, memory, wall-clock) and isolates state, but it is not a secure jail. Do not run untrusted prompts.
One model. Results are reported only for claude-opus-4-6. Other models are likely better at some problems and worse at others. Adding a second backend is a small change in src/zero/llm.py.
No diff-mode mutation prompt. AlphaEvolve uses a structured "diff against this candidate" prompt that we have not (yet) implemented. We send the top-k candidates verbatim and ask for a new full program.
sortnet8 already starts optimal. This is honestly reported: the seed is Floyd's 19-comparator network. The point of including sortnet8 is to show that the system does not regress from a known optimum, not that it discovers something new.

Citation

@software{keser_zero_2026,
  author  = {Keser, Alptekin},
  title   = {{zero}: an open-source AlphaEvolve-style evolutionary algorithm discovery engine},
  year    = {2026},
  url     = {https://github.com/4lptek1n/zero},
  version = {0.1.0}
}

References

Novikov, A. et al. (2025). AlphaEvolve: A coding agent for scientific and algorithmic discovery. DeepMind blog and accompanying preprint.
Romera-Paredes, B. et al. (2024). FunSearch: Mathematical discoveries from program search with large language models. Nature 625, 468–475.
Strassen, V. (1969). Gaussian elimination is not optimal. Numer. Math. 13, 354–356.
Edel, Y. (2004). Extensions of generalized product caps. Designs, Codes and Cryptography 31, 5–14.
Floyd, R. W. (1964). Algorithm 245: TREESORT 3. CACM 7(12), 701.
Knuth, D. E. (1973). The Art of Computer Programming, Vol. 3: Sorting and Searching, §5.3.4 (sorting networks).

License

MIT — see LICENSE.

Contact

Alptekin Keser — keseralptekin@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

zero

Abstract

Method

Architecture

LLM mutation operator

Sandbox

Problems

Results

Headline (LLM-driven evolution vs. controls)

Cost & wall-clock

Single-shot LLM control condition

Continuous integration

Reproducibility

Installation

Project layout

Limitations

Citation

References

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docs		docs
results		results
scripts		scripts
src/zero		src/zero
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

zero

Abstract

Method

Architecture

LLM mutation operator

Sandbox

Problems

Results

Headline (LLM-driven evolution vs. controls)

Cost & wall-clock

Single-shot LLM control condition

Continuous integration

Reproducibility

Installation

Project layout

Limitations

Citation

References

License

Contact

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages