LLM-guided neural-architecture evolution for medical-image classification. This repository is a self-contained working example of MedEvoNet's MedMNIST track: an LLM proposes mutations to a Python file that defines a PyTorch model, every candidate is trained on five MedMNIST datasets, and the search keeps the architectures with the highest mean validation AUC.
The companion paper describes the method and reports the headline result on this track: starting from a 3-layer CNN baseline (1.70M params, 0.873 mean test AUC), evolution discovered a 115k-parameter architecture with 0.898 mean test AUC (paired-bootstrap p=0.019).
While this repository targets MedMNIST classification, the underlying
principle is general-purpose: any optimization problem for which a scalar
evaluation function can be defined over a candidate program is amenable to
the same LLM-guided evolutionary loop. Swapping train.py for a different
evaluator (and adjusting seed_program.py / the system prompt accordingly)
is sufficient to retarget the pipeline to other architecture searches,
algorithmic tasks, or any code-level optimization with a measurable
objective.
MedEvoNet/
├── README.md
├── pyproject.toml # uv-installable; pulls in the vendored openevolve
├── .env.example # copy → .env, fill in your API key(s)
├── .gitignore # ignores .env, runs/, caches
├── config.yaml # evolution + LLM + MAP-Elites configuration
├── seed_program.py # initial architecture (3-layer CNN)
├── dataloader.py # MedMNIST datasets + RGB transform + RAM cache
├── train.py # train_and_evaluate(program_path) — full eval pass
├── evaluator.py # 6-line OpenEvolve evaluator that calls train.py
├── azure_llm.py # Azure OpenAI wrapper (used iff AZURE_OPENAI_API_KEY set)
├── run_evolution.py # entry point — `python run_evolution.py`
├── run_utils.py # run-directory bookkeeping + artifacts env var
├── metrics_utils.py # bootstrap AUC + per-iteration artifact persistence
└── openevolve_lib/ # vendored OpenEvolve (Apache-2.0) with MedEvoNet patches
├── pyproject.toml
└── openevolve/ # ← imported as `openevolve` from your scripts
uv is required (pip works too — see further
down).
# 1. Clone
git clone https://github.com/YOUR_ORG/MedEvoNet.git
cd MedEvoNet
# 2. Configure secrets — copy template, then fill in your API key(s)
cp .env.example .env
$EDITOR .env
# 3. Create the venv + install everything (incl. the patched openevolve)
uv sync
# 4. Smoke test: load the seed architecture and count its parameters
uv run python -c "from seed_program import create_model; \
print(sum(p.numel() for p in create_model().parameters()), 'params')"
# → 1701383 paramsThe first MedMNIST dataset access will download ~2 GB of .npz files into
~/.medmnist/ (or $MEDMNIST_ROOT if set).
uv run python run_evolution.pyOutput is written into a fresh runs/mnist/<timestamp>_<jobid>/ directory:
runs/mnist/20260512_134500_local/
├── run_info.json # config snapshot, git sha, host, time
├── artifacts/ # per-iteration: <uuid>.json + <uuid>.npz
└── openevolve/ # openevolve's own checkpoints/best/logs
├── best/best_program.py
├── checkpoints/checkpoint_5, _10, ...
└── logs/openevolve_*.log
A runs/mnist/latest symlink is kept up to date. The *.json artifact per
iteration contains the program source, the per-epoch history per dataset,
the best-val-epoch metrics, and bootstrap AUC stats. The companion *.npz
holds the raw best-val-epoch probabilities + labels per (dataset, split) so
you can re-bootstrap or run paired-bootstrap analyses without retraining.
config.yaml controls:
max_iterations,checkpoint_interval,max_code_length- LLM provider (
llm.api_base,llm.primary_model,llm.temperature) - MAP-Elites archive feature axes (
features:—params×efficiency) - the system prompt that instructs the LLM what kinds of architectural mutations to attempt
To switch from Azure OpenAI to standard OpenAI just leave
AZURE_OPENAI_API_KEY empty in .env and set OPENAI_API_KEY. The runner
detects which provider to use at startup.
To evaluate one candidate program end-to-end (10 epochs/dataset, full metrics + artifacts):
uv run python train.py seed_program.py 10The OpenEvolve loop calls evaluator.evaluate(program_path), which is a
1-line wrapper around the same function.
python -m venv .venv && source .venv/bin/activate
pip install -e openevolve_lib
pip install -e .- The MedMNIST images are loaded at 224×224, with grayscale datasets channel-replicated to RGB so a single architecture can be evaluated across all five tasks.
- For each dataset the model's final
nn.Linearis auto-swapped to match the class count (2 / 4 / 7 / 14); seed programs just need to end in a Linear layer. - Selection signal is pure mean validation AUC. Parameter count is not used in the fitness score itself — only as a behavioural axis of the MAP-Elites archive, so smaller models can occupy different cells but cannot replace a higher-AUC model in the same cell.
See the companion paper for methodology and results. The OpenEvolve codebase
is from codelion/openevolve
(Apache-2.0); local patches to controller.py, database.py and
process_parallel.py are bundled in openevolve_lib/.
Apache-2.0. Bundled OpenEvolve sources remain under their upstream Apache-2.0 license.