fix: robustness and eval improvements across core, evolution, and agents by lfnothias · Pull Request #157 · HolobiomicsLab/Mimosa-AI

lfnothias · 2026-06-28T10:54:18Z

Context

Changes introduced during an ASB Haffner task_001 evaluation run on mimosa_v2. All 14 files are general-purpose improvements — none is Haffner-specific. The main dev should review each rationale and decide whether to merge as-is or fold into a larger cleanup.

Single cherry-pick of commit 641010e from mimosa_v2 onto dev. Conflicts resolved:

evolution_engine.py: kept new _export_astra() method (absent on dev)
astra_exporter.py: restored (was deleted on dev, re-added here)

Per-change rationale

`sources/core/factory.py` — MCP startup: infinite loop → bounded retry with exponential backoff

Problem: while tool_setup == False: had no exit condition. If MCP servers never became ready (port conflict, crash), the process would spin forever consuming CPU with no diagnostic.
Fix: 5-attempt loop with 1 s → 30 s exponential backoff; raises RuntimeError with a clear message after exhaustion.
Decision point: retry count and backoff ceiling are hardcoded; could be config fields if needed.

`sources/core/orchestrator.py` — error type classification (TIMEOUT / SYNTAX_ERROR / RUNTIME_ERROR)

Problem: all execution failures were logged as a generic "execution failed" with no type signal, making targeted retry strategies impossible.
Fix: classifies errors into three buckets so the evolution loop can apply different remediation (e.g., syntax errors never benefit from re-running the same code).

`sources/core/workflow_factory.py` — three hardening changes

Python keyword validation for node names — workflows using return, import, etc. as node names produced confusing SyntaxError at runtime. Now caught at generation time with an actionable message.
Atomic state_result.json write — write to .tmp, then os.replace(). Without this, a crash mid-write leaves a truncated JSON that poisons subsequent reads.
Compile check after assemble_workflow() — calling compile() before submitting to the sandbox catches syntax errors without paying the full sandbox overhead.

`sources/core/workflow_runner.py` — temp exec scripts deleted in `finally`

Problem: {exec_id}.py temp files were only deleted on the happy path. A crash or exception left them accumulating on disk across long eval runs.
Fix: moved deletion into finally so it runs unconditionally.

`sources/core/evolution_engine.py` — three independent fixes

Ghost WorkflowInfo guard — when workflow generation failed (uuid = "generation_failed"), the code still constructed a WorkflowInfo with that path, silently corrupting rewards_history with 0.0 placeholders. Fixed by gating construction on _valid_uuid.
_export_astra() returns bool + writes .export_status sidecar — callers can now inspect whether export succeeded without re-running anything; failures are non-fatal and self-documenting.
VariationEngine histories cleared at session start — score_history, textual_gradient_history, and agent_count_history were persisting across tasks in a batch run, causing the mutator to apply boldness gradients calibrated to the previous task. Fixed by clearing in evolve().

`sources/core/selection.py` — `_refresh_member_metrics()` moved before eviction check

Problem: the eviction decision was being made against stale metrics (scores not yet refreshed for the current iteration). Moving the refresh before the eviction check ensures correct ordering.

`sources/core/planner.py` — `_can_execute_step()` verifies expected outputs of completed deps

Problem: a completed dependency that hadn't actually produced its expected output files was still marking the step as executable, allowing dependent steps to run on missing inputs.
Fix: verifies that expected output files exist on disk before considering a dep "done."

`sources/modules/smolagent_factory.py` — four fixes

fcntl.flock file lock in save_memories() — concurrent agents writing to the shared memory file could interleave writes. Lock scope is tight (write only).
Explicit except TimeoutError: raise before generic handler — the broad except Exception was swallowing TimeoutError, making debugging impossible (timeout looked like a silent success).
Restored result['exception'] = e in _run_agent() inner loop — exception capture had been removed in a prior refactor, breaking retry visibility.
max_steps=35 and context-window guard callback — sets a hard cap on agent steps and triggers graceful early exit when the context approaches the configured max_context_tokens threshold to prevent OOM crashes.

`sources/transparency/astra_exporter.py` — fail-fast on empty trace

Problem: an empty agent_trace list produced a malformed YAML with null fields, silently exported without error.
Fix: returns None early with a clear log message; the caller's .export_status sidecar records the failure.

`sources/core/llm_provider.py` — Opus 4.x temperature handling

Problem: Anthropic's Opus 4.x API rejects the temperature parameter entirely (not just clamps it). The previous code attempted to recover by falling back to temperature=1.0, which also failed.
Fix: _is_no_temperature_model() detects Opus 4.x and omits the parameter from the request body. Error detection also broadened to catch provider message string variants.

`sources/core/variation_engine.py` — Claude temperature cap at 1.0

Problem: variation_temperature was set to 1.2 for all models including Claude, triggering an Anthropic API validation error (temperature must be ≤ 1.0).
Fix: caps at 1.0 for Claude models.

`sources/benchmark_evaluation/csv_mode.py` — three eval correctness fixes

Parse SUCCESS_LEVEL from LLM response via regex — the field was always defaulting to "Medium" because the parser never searched the response text.
scenario_rubric=scenario_rubric_filename — was None, causing scenario-specific rubric files to be silently ignored.
runs_capsule_dir resolved to absolute path — relative paths broke isolated runs started from directories other than the repo root.

`sources/prompts/workflow_v11.md` — four agent guidance rules

Agents repeatedly made the same four mistakes across evaluation runs:

Indexing MCP tool responses with the wrong schema field
Using relative paths (breaking sandbox isolation)
Calling subprocess instead of the shell MCP tool
Using built-in file I/O instead of the designated file MCP tools

Added explicit rules with examples to the workflow prompt so agents self-correct before generating code.

`config.py` — three new first-class config fields

max_steps, max_context_tokens, and export_astra were being passed as ad-hoc kwargs without config-file backing. Promoted to Config dataclass fields with jsonify()/from_json() round-trip support so they survive serialization.

Test evidence

config.py field changes: uv run python -c "from config import Config; c = Config(); print(c.max_steps, c.export_astra)" passes.
Full eval run with these fixes: ASB Haffner task_001, 3 QD iterations, best score 0.8303 (27/34 claims), $1.04 cumulative cost.
No regressions observed on metabo_dev.csv tasks.

🤖 Generated with Claude Code

- factory.py: replace infinite MCP discovery loop with 5-retry exponential backoff - orchestrator.py: classify sandbox errors as TIMEOUT/SYNTAX_ERROR/RUNTIME_ERROR - workflow_factory.py: validate Python keyword identifiers, atomic state_result.json write, post-assemble compile check - workflow_runner.py: delete temp exec script in finally block - evolution_engine.py: ghost WorkflowInfo guard (generation_failed uuid); ASTRA export returns bool + .export_status sidecar; clear VariationEngine score/gradient/agent-count histories on session start - selection.py: refresh member metrics before eviction check (not after) - planner.py: verify expected outputs of completed dependencies before marking step executable - smolagent_factory.py: file lock on save_memories; explicit TimeoutError re-raise before generic handler; restore exception capture in _run_agent retry loop; max_steps=35 + context-window guard callback - astra_exporter.py: fail-fast on empty trace - llm_provider.py: strip temperature entirely for Opus 4.x (invalid_request_error); broaden temperature error detection to message string - variation_engine.py: cap variation temperature at 1.0 for Claude models - csv_mode.py: parse SUCCESS_LEVEL from LLM response; pass scenario_rubric filename; resolve runs_capsule_dir to absolute path - config.py: add max_steps, max_context_tokens, export_astra fields - workflow_v11.md: add MCP response schema guidance, absolute-path rule, env-setup via shell tool, no built-in file I/O rule Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

lfnothias · 2026-06-28T10:57:26Z

Closing in favour of mimosa_v2_lfx → mimosa_v2 PR — dev is the integration branch, active dev happens on mimosa_v2.

lfnothias closed this Jun 28, 2026

lfnothias deleted the dev_lfx branch June 28, 2026 10:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: robustness and eval improvements across core, evolution, and agents#157

fix: robustness and eval improvements across core, evolution, and agents#157
lfnothias wants to merge 1 commit into
devfrom
dev_lfx

lfnothias commented Jun 28, 2026

Uh oh!

lfnothias commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lfnothias commented Jun 28, 2026

Context

Per-change rationale

sources/core/factory.py — MCP startup: infinite loop → bounded retry with exponential backoff

sources/core/orchestrator.py — error type classification (TIMEOUT / SYNTAX_ERROR / RUNTIME_ERROR)

sources/core/workflow_factory.py — three hardening changes

sources/core/workflow_runner.py — temp exec scripts deleted in finally

sources/core/evolution_engine.py — three independent fixes

sources/core/selection.py — _refresh_member_metrics() moved before eviction check

sources/core/planner.py — _can_execute_step() verifies expected outputs of completed deps

sources/modules/smolagent_factory.py — four fixes

sources/transparency/astra_exporter.py — fail-fast on empty trace

sources/core/llm_provider.py — Opus 4.x temperature handling

sources/core/variation_engine.py — Claude temperature cap at 1.0

sources/benchmark_evaluation/csv_mode.py — three eval correctness fixes

sources/prompts/workflow_v11.md — four agent guidance rules

config.py — three new first-class config fields

Test evidence

Uh oh!

lfnothias commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`sources/core/factory.py` — MCP startup: infinite loop → bounded retry with exponential backoff

`sources/core/orchestrator.py` — error type classification (TIMEOUT / SYNTAX_ERROR / RUNTIME_ERROR)

`sources/core/workflow_factory.py` — three hardening changes

`sources/core/workflow_runner.py` — temp exec scripts deleted in `finally`

`sources/core/evolution_engine.py` — three independent fixes

`sources/core/selection.py` — `_refresh_member_metrics()` moved before eviction check

`sources/core/planner.py` — `_can_execute_step()` verifies expected outputs of completed deps

`sources/modules/smolagent_factory.py` — four fixes

`sources/transparency/astra_exporter.py` — fail-fast on empty trace

`sources/core/llm_provider.py` — Opus 4.x temperature handling

`sources/core/variation_engine.py` — Claude temperature cap at 1.0

`sources/benchmark_evaluation/csv_mode.py` — three eval correctness fixes

`sources/prompts/workflow_v11.md` — four agent guidance rules

`config.py` — three new first-class config fields