Open
Conversation
Fix func min example
Update config.yaml
Expanded optimization hints and LLM context in affine_transform_2d/config.yaml, increased max_tokens and program counts, and switched to the Gemini model. Updated evaluator.py to use EvaluationResult for artifact support and improved error reporting. Added optimization notes to initial_program.py. Removed obsolete benchmark results JSON.
…easoning_effort params
Add support for gpt-oss-* models as reasoning models, and allow for reasoning_effort params in standard models
Updated LLM model weights to include 'google/gemini-2.5-pro' with adjusted weights across all tasks. Reduced verbosity and streamlined optimization hints in config prompts for clarity. Adjusted parallel_evaluations to 4 for most tasks (except polynomial_real, set to 1 to avoid JAX conflicts) and increased evaluator timeout for polynomial_real. Updated initial_program.py files to clarify and reorganize optimization opportunities.
Deleted detailed experiment reports for GEMINI_FLASH_2.5 and O4_MINI from the examples/algotune directory. Replaced the README with a comprehensive optimization report summarizing the OpenEvolve AlgoTune benchmark journey, key discoveries, configuration strategies, and best practices for evolutionary code optimization.
Added best_program.py and best_program_info.json files for the following tasks: affine_transform_2d, convolve2d_full_fill, eigenvectors_complex, fft_cmplx_scipy_fftpack, fft_convolution, lu_factorization, polynomial_real, and psd_cone_projection. These files contain the initial evolved implementations and associated metadata for each task.
Update algotune example
Previously, compilation errors like 'Unable to build metal library from source' were treated the same as transient GPU errors (memory pressure, command buffer). This caused 16+ unnecessary retry attempts when a kernel had bfloat16-incompatible Metal code. Changes: - Detect 'unable to build metal library' error messages - Return immediately with failure instead of retrying - Mark error as 'compilation_error: True' for caller awareness - Increment metal_compilation_errors counter for tracking This significantly reduces evaluation time for programs with invalid Metal code.
… script Evaluation optimizations: - Run correctness test BEFORE baseline benchmark (fail fast) - Skip expensive baseline if kernel doesn't compile (bfloat16 errors) Experiment runner script (run_evolve_experiment.sh): - Explicit log file truncation on resume to prevent content mixing - Fixed checkpoint resume logic to select highest numbered checkpoint - Added PYTHONUNBUFFERED and stdbuf for reliable log ordering - Support for --resume, --foreground, --iterations, --dry-run flags Documentation: - Added changelog section to README_validity_fix.md
- Add -W ignore::RuntimeWarning to suppress harmless import warnings - Filter RuntimeWarning from stderr output for cleaner error messages - Show return code in failure messages for better debugging - Add 0.5s delay between runs to reduce GPU resource contention - Show more stderr content (200 chars vs 100) for better diagnostics
- Rewrite README.md as concise usage guide (~100 lines) - Extract detailed analysis to EVOLUTION_ANALYSIS.md (~250 lines) - Document validity fixes: subprocess hook, bf16 gate, 16:8 heads - Add experiment results showing -3.2% regression vs baseline - Analyze evolution limitations from RL perspective - Compare with KernelBench metrics for future improvements - Minor fixes: config.yaml model names, run script unbuffered output
…t-based" early stopping functionality
…mple using this approach
- Fix bash -u background run bug (stdbuf/nohup handling) - Avoid clobbering OPENAI_API_KEY from GEMINI_API_KEY - Use non-preview Gemini model names - Place cascade_evaluation under evaluator and fix 2:1 GQA prompt
- Add curated demo output artifacts (best program + logs + config) - Document demo location in README
- Commit best_program.py and best_program_info.json at example root - Git-ignore demo/output dirs; remove demo_output_20260105_180918
fix(logging): fix logging bug in reject sampling
…validity Fixes #372: Reliability issues in `examples/mlx_metal_kernel_opt`
Manual Mode Support: extended visualizer UI
…early_stopping ARC-AGI-2 example + Event-based early stopping
Large codebase support through LLM changes description + TSP example using this approach
#384) When the visualizer import data from a checkpoint, this is sent to the javascript via a response object decoded with `resp.json()` in `fetchAndRender()` from "main.js". This is crashing if it does not respect fully json specs (and NaN, Infinity are not json valid even though js objects). This is useful for evolutions based on positive metrics to minimize (like a cost). In that case, we want to put -metric in combined_score (which will then be negative). Thus an evolved program not working should be given a worse score during evaluation. An easy way to do it is to put -inf (instead of not outputing any metric, which will be replaced by a 0 by default by the database when requesting a fitness). Doing so works well during evolution (ranking the top programs as expected), but during visualization, it was raising an error when fetching data. Co-authored-by: Nolwen <nolwen.huet@imacs.polytechnique.fr>
…#385) * Fix Anthropic models error when both temperature and top_p are passed When using certain Anthropic models, passing both `temperature` and `top_p` results in an error. This PR makes these parameters optional by: - Changing type annotations for `temperature` and `top_p` to `float | None` - Changing default `top_p` from 0.95 to None in LLMConfig - Adding logic to remove None values before dacite parsing to avoid type errors - Adding example Anthropic config files for circle_packing - Updating test to mock ANTHROPIC_API_KEY for config validation Closes #378 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Bump version to 0.2.26 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* Make max snapshot artifacts limit configurable Add `database.max_snapshot_artifacts` config option to control how many program artifacts are included in worker process snapshots. Default remains 100 for backward compatibility. - Set to a higher number to include more artifacts in prompts - Set to `null` (None) for unlimited artifacts (use with caution for large populations as this can significantly increase memory usage) Note: This limit only affects artifacts passed to worker processes, not the total artifacts stored. All program code is always available regardless of this setting. Closes #383 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add tests for recent features Add comprehensive tests for recently merged PRs: - test_llm_config_optional_params.py: Tests for optional temperature/top_p parameters (PR #385 - Anthropic model compatibility) - test_snapshot_artifacts_limit.py: Tests for configurable max_snapshot_artifacts (PR #386) - test_visualization_sanitization.py: Tests for -inf/+inf/NaN sanitization in visualization (PR #384) - test_early_stopping_config.py: Tests for event-based early stopping configuration (PR #375) - test_changes_description.py: Tests for large codebase support via changes description (PR #376) Total tests increased from 264 to 326. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add integration tests for example validation Add comprehensive integration tests that verify: - Example config files load correctly - Initial programs have EVOLVE-BLOCK markers - Evaluators exist and have required functions - Evaluators can run on initial programs - Cascade evaluation functions are detected - Database stores and retrieves programs correctly - Program evolution tracking works Tests cover function_minimization, circle_packing, and signal_processing examples, plus general structure validation for all examples. Total tests: 346 (was 326) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* fix(prompt): warn when custom template directory doesn't exist * fix(config): resolve template_dir relative to config file location Previously, template_dir in config was resolved relative to the current working directory (CWD), making configs non-portable and dependent on where the command was run from. Now template_dir is resolved relative to the config file's location, allowing users to specify relative paths that work regardless of CWD. - Relative paths are converted to absolute based on config file directory - Absolute paths remain unchanged - Path normalization handles .. and . components correctly - Added comprehensive test suite covering all path resolution scenarios
## Problem
When diff-based evolution is enabled, the "Previous Attempts" section
of prompts shows changes like:
Change 1: Replace 15 lines with 18 lines
This gives the LLM no visibility into what the actual edits were,
making it harder to:
- Learn from successful patterns
- Avoid repeating failed exact matches
- Understand what format produces valid SEARCH blocks
This contributes to the high rate of "apply diff fail" errors
(see issue #346) where SEARCH patterns don't exactly match the
original code.
## Solution
Update `format_diff_summary()` to show actual content for multi-line
blocks:
Change 1: Replace:
def old_function():
return False
with:
def new_function():
return True
Single-line changes remain compact:
Change 1: 'x = 1' to 'x = 2'
Add `_format_block_lines()` helper with configurable truncation limits.
## Configuration
New options in `prompt:` config section:
```yaml
prompt:
diff_summary_max_line_len: 100 # Truncate lines longer than this
diff_summary_max_lines: 30 # Max lines per SEARCH/REPLACE block
```
## Files Changed
- `openevolve/config.py` - Add PromptConfig options
- `openevolve/utils/code_utils.py` - Update format_diff_summary
- `openevolve/iteration.py` - Pass config to format_diff_summary
- `openevolve/process_parallel.py` - Pass config to format_diff_summary
- `tests/test_code_utils.py` - Add tests for new behavior
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…392) * Create test_island_child_placement.py * Pass target_island to database.add Fix issue #391 by ensuring children are placed in the intended target island instead of inheriting the parent's island. Add target_island to SerializableResult, capture sampling_island in the worker, and pass result.target_island into database.add when inserting child programs. Update tests to reflect the fixed behavior (children go to the target island) and add regression tests that demonstrate the old buggy behavior when target_island is not provided and the correct behavior when it is.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.