rebase upstream Jul 31 by mycpuorg · Pull Request #2 · TheDuckAI/openevolve

mycpuorg · 2025-07-31T17:56:07Z

No description provided.

fixes

fix

Fix func min example

Update config.yaml

Expanded optimization hints and LLM context in affine_transform_2d/config.yaml, increased max_tokens and program counts, and switched to the Gemini model. Updated evaluator.py to use EvaluationResult for artifact support and improved error reporting. Added optimization notes to initial_program.py. Removed obsolete benchmark results JSON.

…easoning_effort params

Add support for gpt-oss-* models as reasoning models, and allow for reasoning_effort params in standard models

Updated LLM model weights to include 'google/gemini-2.5-pro' with adjusted weights across all tasks. Reduced verbosity and streamlined optimization hints in config prompts for clarity. Adjusted parallel_evaluations to 4 for most tasks (except polynomial_real, set to 1 to avoid JAX conflicts) and increased evaluator timeout for polynomial_real. Updated initial_program.py files to clarify and reorganize optimization opportunities.

Deleted detailed experiment reports for GEMINI_FLASH_2.5 and O4_MINI from the examples/algotune directory. Replaced the README with a comprehensive optimization report summarizing the OpenEvolve AlgoTune benchmark journey, key discoveries, configuration strategies, and best practices for evolutionary code optimization.

Added best_program.py and best_program_info.json files for the following tasks: affine_transform_2d, convolve2d_full_fill, eigenvectors_complex, fft_cmplx_scipy_fftpack, fft_convolution, lu_factorization, polynomial_real, and psd_cone_projection. These files contain the initial evolved implementations and associated metadata for each task.

Update algotune example

as

Previously, compilation errors like 'Unable to build metal library from source' were treated the same as transient GPU errors (memory pressure, command buffer). This caused 16+ unnecessary retry attempts when a kernel had bfloat16-incompatible Metal code. Changes: - Detect 'unable to build metal library' error messages - Return immediately with failure instead of retrying - Mark error as 'compilation_error: True' for caller awareness - Increment metal_compilation_errors counter for tracking This significantly reduces evaluation time for programs with invalid Metal code.

… script Evaluation optimizations: - Run correctness test BEFORE baseline benchmark (fail fast) - Skip expensive baseline if kernel doesn't compile (bfloat16 errors) Experiment runner script (run_evolve_experiment.sh): - Explicit log file truncation on resume to prevent content mixing - Fixed checkpoint resume logic to select highest numbered checkpoint - Added PYTHONUNBUFFERED and stdbuf for reliable log ordering - Support for --resume, --foreground, --iterations, --dry-run flags Documentation: - Added changelog section to README_validity_fix.md

- Add -W ignore::RuntimeWarning to suppress harmless import warnings - Filter RuntimeWarning from stderr output for cleaner error messages - Show return code in failure messages for better debugging - Add 0.5s delay between runs to reduce GPU resource contention - Show more stderr content (200 chars vs 100) for better diagnostics

- Rewrite README.md as concise usage guide (~100 lines) - Extract detailed analysis to EVOLUTION_ANALYSIS.md (~250 lines) - Document validity fixes: subprocess hook, bf16 gate, 16:8 heads - Add experiment results showing -3.2% regression vs baseline - Analyze evolution limitations from RL perspective - Compare with KernelBench metrics for future improvements - Minor fixes: config.yaml model names, run script unbuffered output

…t-based" early stopping functionality

…mple using this approach

- Fix bash -u background run bug (stdbuf/nohup handling) - Avoid clobbering OPENAI_API_KEY from GEMINI_API_KEY - Use non-preview Gemini model names - Place cascade_evaluation under evaluator and fix 2:1 GQA prompt

- Add curated demo output artifacts (best program + logs + config) - Document demo location in README

- Commit best_program.py and best_program_info.json at example root - Git-ignore demo/output dirs; remove demo_output_20260105_180918

fix(logging): fix logging bug in reject sampling

…validity Fixes #372: Reliability issues in `examples/mlx_metal_kernel_opt`

Manual Mode Support: extended visualizer UI

…early_stopping ARC-AGI-2 example + Event-based early stopping

Large codebase support through LLM changes description + TSP example using this approach

#384) When the visualizer import data from a checkpoint, this is sent to the javascript via a response object decoded with `resp.json()` in `fetchAndRender()` from "main.js". This is crashing if it does not respect fully json specs (and NaN, Infinity are not json valid even though js objects). This is useful for evolutions based on positive metrics to minimize (like a cost). In that case, we want to put -metric in combined_score (which will then be negative). Thus an evolved program not working should be given a worse score during evaluation. An easy way to do it is to put -inf (instead of not outputing any metric, which will be replaced by a 0 by default by the database when requesting a fitness). Doing so works well during evolution (ranking the top programs as expected), but during visualization, it was raising an error when fetching data. Co-authored-by: Nolwen <nolwen.huet@imacs.polytechnique.fr>

…#385) * Fix Anthropic models error when both temperature and top_p are passed When using certain Anthropic models, passing both `temperature` and `top_p` results in an error. This PR makes these parameters optional by: - Changing type annotations for `temperature` and `top_p` to `float | None` - Changing default `top_p` from 0.95 to None in LLMConfig - Adding logic to remove None values before dacite parsing to avoid type errors - Adding example Anthropic config files for circle_packing - Updating test to mock ANTHROPIC_API_KEY for config validation Closes #378 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Bump version to 0.2.26 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* Make max snapshot artifacts limit configurable Add `database.max_snapshot_artifacts` config option to control how many program artifacts are included in worker process snapshots. Default remains 100 for backward compatibility. - Set to a higher number to include more artifacts in prompts - Set to `null` (None) for unlimited artifacts (use with caution for large populations as this can significantly increase memory usage) Note: This limit only affects artifacts passed to worker processes, not the total artifacts stored. All program code is always available regardless of this setting. Closes #383 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add tests for recent features Add comprehensive tests for recently merged PRs: - test_llm_config_optional_params.py: Tests for optional temperature/top_p parameters (PR #385 - Anthropic model compatibility) - test_snapshot_artifacts_limit.py: Tests for configurable max_snapshot_artifacts (PR #386) - test_visualization_sanitization.py: Tests for -inf/+inf/NaN sanitization in visualization (PR #384) - test_early_stopping_config.py: Tests for event-based early stopping configuration (PR #375) - test_changes_description.py: Tests for large codebase support via changes description (PR #376) Total tests increased from 264 to 326. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add integration tests for example validation Add comprehensive integration tests that verify: - Example config files load correctly - Initial programs have EVOLVE-BLOCK markers - Evaluators exist and have required functions - Evaluators can run on initial programs - Cascade evaluation functions are detected - Database stores and retrieves programs correctly - Program evolution tracking works Tests cover function_minimization, circle_packing, and signal_processing examples, plus general structure validation for all examples. Total tests: 346 (was 326) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

* fix(prompt): warn when custom template directory doesn't exist * fix(config): resolve template_dir relative to config file location Previously, template_dir in config was resolved relative to the current working directory (CWD), making configs non-portable and dependent on where the command was run from. Now template_dir is resolved relative to the config file's location, allowing users to specify relative paths that work regardless of CWD. - Relative paths are converted to absolute based on config file directory - Absolute paths remain unchanged - Path normalization handles .. and . components correctly - Added comprehensive test suite covering all path resolution scenarios

## Problem When diff-based evolution is enabled, the "Previous Attempts" section of prompts shows changes like: Change 1: Replace 15 lines with 18 lines This gives the LLM no visibility into what the actual edits were, making it harder to: - Learn from successful patterns - Avoid repeating failed exact matches - Understand what format produces valid SEARCH blocks This contributes to the high rate of "apply diff fail" errors (see issue #346) where SEARCH patterns don't exactly match the original code. ## Solution Update `format_diff_summary()` to show actual content for multi-line blocks: Change 1: Replace: def old_function(): return False with: def new_function(): return True Single-line changes remain compact: Change 1: 'x = 1' to 'x = 2' Add `_format_block_lines()` helper with configurable truncation limits. ## Configuration New options in `prompt:` config section: ```yaml prompt: diff_summary_max_line_len: 100 # Truncate lines longer than this diff_summary_max_lines: 30 # Max lines per SEARCH/REPLACE block ``` ## Files Changed - `openevolve/config.py` - Add PromptConfig options - `openevolve/utils/code_utils.py` - Update format_diff_summary - `openevolve/iteration.py` - Pass config to format_diff_summary - `openevolve/process_parallel.py` - Pass config to format_diff_summary - `tests/test_code_utils.py` - Add tests for new behavior Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

…392) * Create test_island_child_placement.py * Pass target_island to database.add Fix issue #391 by ensuring children are placed in the intended target island instead of inheriting the parent's island. Add target_island to SerializableResult, capture sampling_island in the worker, and pass result.target_island into database.add when inserting child programs. Update tests to reflect the fixed behavior (children go to the target island) and add regression tests that demonstrate the old buggy behavior when target_island is not provided and the correct behavior when it is.

codelion and others added 30 commits August 20, 2025 15:39

fixes

3ce52ae

Merge pull request #224 from codelion/fix-some-bugs

24ff60a

fixes

fix

7f1f294

Merge pull request #227 from codelion/fix-openai-model-handling

3a0e57b

fix

fixes

d795ac4

optimize config

fe7f5ec

f

78e153b

fix unit tests

0c0e026

Merge pull request #228 from codelion/fix--func-min-example

8941266

Fix func min example

Update config.yaml

38a95d8

Merge pull request #229 from codelion/fix-config-eval

d813696

Update config.yaml

Add support for gpt-oss-* models as reasoning models, and allow for r…

cf6bf04

…easoning_effort params

Merge pull request #231 from theahura/patch-1

8d19744

Add support for gpt-oss-* models as reasoning models, and allow for reasoning_effort params in standard models

Merge branch 'main' into Update-algotune-example

5ecd98f

Update _version.py

bd7e14b

Merge pull request #232 from codelion/Update-algotune-example

5eb3dba

Update algotune example

as

75cb01b

Update README.md

05501c8

Update README.md

d64ef35

Update README.md

8acbe4c

Update README.md

5178fc8

Update README.md

ade305e

as

3716970

Merge pull request #234 from codelion/fix-readme

3606401

as

Update release.yml

3d9d124

Update _version.py

5233292

lanmogu98 and others added 30 commits January 4, 2026 11:17

chore(mlx_metal): preserve baseline artifacts; add validity-fix README

2721643

docs(mlx_metal_kernel_opt): rewrite README and remove invalid report

61fa0e7

fix(mlx_metal_kernel_opt): align prompt examples and harden runner env

2b5aec0

Added manual-mode functionality in visualizer

cf6ab3d

Manual mode, mock issue minor fix

07e5d2f

Add an example for running ARC-AGI-2 tasks using OpenEvolve and "even…

5a583ac

…t-based" early stopping functionality

Large codebase support through LLM changes description + TSP SOTA exa…

9e57065

…mple using this approach

TSP example, Configuration, Minor fixes

9b82d0e

fix(mlx_metal_kernel_opt): stabilize run script and config

5055a68

- Fix bash -u background run bug (stdbuf/nohup handling) - Avoid clobbering OPENAI_API_KEY from GEMINI_API_KEY - Use non-preview Gemini model names - Place cascade_evaluation under evaluator and fix 2:1 GQA prompt

chore(mlx_metal_kernel_opt): remove buggy artifacts

ba45973

docs(mlx_metal_kernel_opt): add demo results output (20260105_180918)

8ed7e9e

- Add curated demo output artifacts (best program + logs + config) - Document demo location in README

docs(mlx_metal_kernel_opt): keep demo best program snapshot

69d2dab

- Commit best_program.py and best_program_info.json at example root - Git-ignore demo/output dirs; remove demo_output_20260105_180918

Minor evolved program fixes

81c9039

tsp-example, README.md minor fixes

5c1ad67

fix logging

336e7eb

Merge pull request #382 from zigzagcai/fix-logging

67aca71

fix(logging): fix logging bug in reject sampling

Merge pull request #377 from lanmogu98/fix/mlx_metal_kernel_opt_eval_…

e29c2ba

…validity Fixes #372: Reliability issues in `examples/mlx_metal_kernel_opt`

Merge pull request #373 from strangecreator/manual-mode

980093c

Manual Mode Support: extended visualizer UI

Merge pull request #375 from omkar-rjoglekar/arc_example_event_based_…

22914d2

…early_stopping ARC-AGI-2 example + Event-based early stopping

Merge pull request #376 from strangecreator/tsp-example

75505f6

Large codebase support through LLM changes description + TSP example using this approach

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

rebase upstream Jul 31#2

rebase upstream Jul 31#2
mycpuorg wants to merge 413 commits intoTheDuckAI:devfrom
algorithmicsuperintelligence:main

mycpuorg commented Jul 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Comments

Conversation

mycpuorg commented Jul 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants