Official implementation of Spectral Evolution Search (SES):
Spectral Evolution Search: Efficient Inference-Time Scaling for Reward-Aligned Image Generation
SES is a training-free inference-time scaling framework for text-to-image generation. It improves reward alignment by searching over the initial noise in a low-frequency wavelet subspace with the Cross-Entropy Method (CEM), without updating any generator or reward-model parameters.
The paper has been accepted to ICML 2026.
Inference-time scaling allocates extra compute at inference time to improve generated outputs. In text-to-image models, a direct option is to search over the initial noise. Full-space noise search is expensive because the latent space is high-dimensional and many perturbation directions have weak visual impact.
SES reduces this search space by decomposing the initial noise with a Discrete Wavelet Transform (DWT). It optimizes only the low-frequency coefficients, which strongly affect global image structure, while keeping high-frequency coefficients fixed. A gradient-free CEM loop then samples candidates, decodes images, scores them with a reward model, and updates the search distribution toward higher-reward regions.
Key properties:
- training-free and plug-and-play
- gradient-free reward optimization
- works with diffusion and flow-matching text-to-image models
- supports single-prompt and CSV batch evaluation
- compatible with multiple reward models
Use Python 3.10 or later. A fresh conda environment is recommended.
conda create -n ses python=3.10
conda activate sesInstall PyTorch for your CUDA version. For example, with CUDA 12.1:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121Then install the remaining dependencies:
pip install -r requirements.txtRun SES on a single prompt with the default backbone, sdxl-turbo, and default reward model, pick.
bash run.sh \
--prompt "An orange colored sandwich." \
--model_id sdxl-turbo \
--reward_model pick \
--save_dir outputs/demo_single \
--total_eval_budget 200Run SES on a CSV file:
bash run.sh \
--prompt_csv prompts.csv \
--model_id sdxl \
--reward_model hps \
--save_dir outputs/demo_batch \
--total_eval_budget 50Use --model_id to choose the image generator.
model_id |
Public model source | Default size |
|---|---|---|
sdxl-turbo |
stabilityai/sdxl-turbo |
512 x 512 |
sd1-4 |
CompVis/stable-diffusion-v1-4 |
512 x 512 |
sdxl |
stabilityai/stable-diffusion-xl-base-1.0 |
1024 x 1024 |
flux |
black-forest-labs/FLUX.1-dev |
1024 x 1024 |
qwen-image |
Qwen/Qwen-Image |
1024 x 1024 |
Use --reward_model to choose the optimization objective.
reward_model |
Default source |
|---|---|
pick |
yuvalkirstain/PickScore_v1 |
clip |
openai/clip-vit-large-patch14 |
hps |
adams-story/HPSv2-hf |
aes |
camenduru/improved-aesthetic-predictor |
ir |
ImageReward package |
Reward evaluation is often the runtime bottleneck because each candidate must be decoded before scoring. SES supports proxy evaluation by using fewer diffusion steps during search and more steps for the final image.
Example:
bash run.sh \
--prompt "A beautiful girl." \
--model_id qwen-image \
--reward_model aes \
--num_inference_steps 10 \
--final_num_inference_steps 30 \
--total_eval_budget 50 \
--save_dir outputs/proxy_demoThis evaluates candidate images with 10-step generations, then decodes the final selected noise with 30 steps.
SES has also been integrated into DiffSynth-Studio for inference-time scaling research:
If you find this repository useful, please cite:
@article{ye2026spectral,
title={Spectral Evolution Search: Efficient Inference-Time Scaling for Reward-Aligned Image Generation},
author={Ye, Jinyan and Duan, Zhongjie and Li, Zhiwen and Chen, Cen and Chen, Daoyuan and Li, Yaliang and Chen, Yingda},
journal={arXiv preprint arXiv:2602.03208},
year={2026}
}

