fix: robustness and eval improvements across core, evolution, and agents#158
fix: robustness and eval improvements across core, evolution, and agents#158lfnothias wants to merge 1 commit into
Conversation
- factory.py: replace infinite MCP discovery loop with 5-retry exponential backoff - orchestrator.py: classify sandbox errors as TIMEOUT/SYNTAX_ERROR/RUNTIME_ERROR - workflow_factory.py: validate Python keyword identifiers, atomic state_result.json write, post-assemble compile check - workflow_runner.py: delete temp exec script in finally block - evolution_engine.py: ghost WorkflowInfo guard (generation_failed uuid); ASTRA export returns bool + .export_status sidecar; clear VariationEngine score/gradient/agent-count histories on session start - selection.py: refresh member metrics before eviction check (not after) - planner.py: verify expected outputs of completed dependencies before marking step executable - smolagent_factory.py: file lock on save_memories; explicit TimeoutError re-raise before generic handler; restore exception capture in _run_agent retry loop; max_steps=35 + context-window guard callback - astra_exporter.py: fail-fast on empty trace - llm_provider.py: strip temperature entirely for Opus 4.x (invalid_request_error); broaden temperature error detection to message string - variation_engine.py: cap variation temperature at 1.0 for Claude models - csv_mode.py: parse SUCCESS_LEVEL from LLM response; pass scenario_rubric filename; resolve runs_capsule_dir to absolute path - config.py: add max_steps, max_context_tokens, export_astra fields - workflow_v11.md: add MCP response schema guidance, absolute-path rule, env-setup via shell tool, no built-in file I/O rule Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Per change reviewsources/core/factory.py — MCP startup: infinite loop → bounded retryProblem: sources/core/orchestrator.py — error type classificationFix: classifies into TIMEOUT / SYNTAX_ERROR / RUNTIME_ERROR so the evolution loop can apply targeted remediation. sources/core/workflow_factory.py — three hardening changesComment: All three okay.
sources/core/workflow_runner.py — temp exec scripts deleted in
|
fix(core): rework robustness changes per PR #158 review
Context
Changes introduced during an ASB Haffner task_001 evaluation run. All 14 files are general-purpose improvements — none is Haffner-specific. Single cherry-pick of commit
641010efrom localmimosa_v2ontoorigin/mimosa_v2(applied cleanly, no conflicts).Verified against: ASB Haffner task_001, 3 QD iterations, best score 0.8303 (27/34 claims), ~$1.04 cumulative cost. No regressions observed on
metabo_dev.csvtasks.Per-change rationale
sources/core/factory.py— MCP startup: infinite loop → bounded retryProblem:
while tool_setup == False:had no exit condition. If MCP servers never became ready, the process spun forever with no diagnostic.Fix: 5-attempt loop with 1 s → 30 s exponential backoff; raises
RuntimeErrorafter exhaustion.sources/core/orchestrator.py— error type classificationProblem: all execution failures were logged as generic "execution failed" with no type signal.
Fix: classifies into TIMEOUT / SYNTAX_ERROR / RUNTIME_ERROR so the evolution loop can apply targeted remediation.
sources/core/workflow_factory.py— three hardening changesreturn,import, etc. at generation time instead of as a confusingSyntaxErrorat runtime.state_result.jsonwrite — write to.tmp, thenos.replace(). Prevents a crash mid-write from leaving a truncated JSON that poisons subsequent reads.assemble_workflow()— catches syntax errors before paying full sandbox overhead.sources/core/workflow_runner.py— temp exec scripts deleted infinallyProblem:
{exec_id}.pytemp files were only deleted on the happy path; exceptions left them accumulating.Fix: moved deletion into
finally.sources/core/evolution_engine.py— three fixesuuid == "generation_failed", constructingWorkflowInfosilently corruptedrewards_historywith 0.0 placeholders. Fixed by gating on_valid_uuid._export_astra()returns bool +.export_statussidecar — callers can inspect export outcome without re-running; failures are non-fatal and self-documenting.score_history,textual_gradient_history, andagent_count_historypersisted across tasks in batch runs, causing the mutator to apply boldness gradients from the previous task.sources/core/selection.py— metrics refresh before evictionProblem: eviction decisions were made against stale metrics from the previous iteration.
Fix:
_refresh_member_metrics()moved before the eviction check.sources/core/planner.py— dep output verification in_can_execute_step()Problem: a completed dep with missing expected output files still marked the step executable, letting dependent steps run on missing inputs.
Fix: verifies expected output files exist on disk.
sources/modules/smolagent_factory.py— four fixesfcntl.flockinsave_memories()— concurrent agents interleaved writes to shared memory file.except TimeoutError: raise— broadexcept Exceptionwas swallowingTimeoutError.result['exception'] = e— exception capture had been dropped in a prior refactor, breaking retry visibility.max_steps=35+ context-window guard callback — hard cap on agent steps; graceful early exit before OOM.sources/transparency/astra_exporter.py— fail-fast on empty traceProblem: empty
agent_traceproduced a malformed YAML exported without error.Fix: returns
Noneearly;.export_statussidecar records the failure.sources/core/llm_provider.py— Opus 4.x temperature handlingProblem: Anthropic's Opus 4.x rejects the
temperatureparameter entirely; falling back totemperature=1.0also failed.Fix:
_is_no_temperature_model()detects Opus 4.x and omits the parameter. Error detection broadened to catch provider message string variants.sources/core/variation_engine.py— Claude temperature cap at 1.0Problem:
variation_temperature=1.2triggered Anthropic API validation error (temperature must be ≤ 1.0).Fix: caps at 1.0 for Claude models.
sources/benchmark_evaluation/csv_mode.py— three eval correctness fixesSUCCESS_LEVELfrom LLM response via regex — was always defaulting to "Medium".scenario_rubric=scenario_rubric_filename— wasNone, silently ignoring scenario-specific rubric files.runs_capsule_dirresolved to absolute path — relative paths broke runs started from non-root directories.sources/prompts/workflow_v11.md— four agent guidance rulesAgents repeatedly made the same four mistakes: wrong MCP response schema indexing, relative paths,
subprocessinstead of shell tool, built-in file I/O instead of file MCP tools. Added explicit rules with examples.config.py— three new first-class config fieldsmax_steps,max_context_tokens, andexport_astrapromoted from ad-hoc kwargs toConfigdataclass fields withjsonify()/from_json()round-trip support.🤖 Generated with Claude Code