fix: robustness and eval improvements across core, evolution, and agents by lfnothias · Pull Request #158 · HolobiomicsLab/Mimosa-AI

lfnothias · 2026-06-28T10:58:30Z

Context

Changes introduced during an ASB Haffner task_001 evaluation run. All 14 files are general-purpose improvements — none is Haffner-specific. Single cherry-pick of commit 641010e from local mimosa_v2 onto origin/mimosa_v2 (applied cleanly, no conflicts).

Verified against: ASB Haffner task_001, 3 QD iterations, best score 0.8303 (27/34 claims), ~$1.04 cumulative cost. No regressions observed on metabo_dev.csv tasks.

Per-change rationale

`sources/core/factory.py` — MCP startup: infinite loop → bounded retry

Problem: while tool_setup == False: had no exit condition. If MCP servers never became ready, the process spun forever with no diagnostic.
Fix: 5-attempt loop with 1 s → 30 s exponential backoff; raises RuntimeError after exhaustion.

`sources/core/orchestrator.py` — error type classification

Problem: all execution failures were logged as generic "execution failed" with no type signal.
Fix: classifies into TIMEOUT / SYNTAX_ERROR / RUNTIME_ERROR so the evolution loop can apply targeted remediation.

`sources/core/workflow_factory.py` — three hardening changes

Python keyword validation for node names — catches return, import, etc. at generation time instead of as a confusing SyntaxError at runtime.
Atomic state_result.json write — write to .tmp, then os.replace(). Prevents a crash mid-write from leaving a truncated JSON that poisons subsequent reads.
Compile check after assemble_workflow() — catches syntax errors before paying full sandbox overhead.

`sources/core/workflow_runner.py` — temp exec scripts deleted in `finally`

Problem: {exec_id}.py temp files were only deleted on the happy path; exceptions left them accumulating.
Fix: moved deletion into finally.

`sources/core/evolution_engine.py` — three fixes

Ghost WorkflowInfo guard — when uuid == "generation_failed", constructing WorkflowInfo silently corrupted rewards_history with 0.0 placeholders. Fixed by gating on _valid_uuid.
_export_astra() returns bool + .export_status sidecar — callers can inspect export outcome without re-running; failures are non-fatal and self-documenting.
VariationEngine histories cleared at session start — score_history, textual_gradient_history, and agent_count_history persisted across tasks in batch runs, causing the mutator to apply boldness gradients from the previous task.

`sources/core/selection.py` — metrics refresh before eviction

Problem: eviction decisions were made against stale metrics from the previous iteration.
Fix: _refresh_member_metrics() moved before the eviction check.

`sources/core/planner.py` — dep output verification in `_can_execute_step()`

Problem: a completed dep with missing expected output files still marked the step executable, letting dependent steps run on missing inputs.
Fix: verifies expected output files exist on disk.

`sources/modules/smolagent_factory.py` — four fixes

fcntl.flock in save_memories() — concurrent agents interleaved writes to shared memory file.
Explicit except TimeoutError: raise — broad except Exception was swallowing TimeoutError.
Restored result['exception'] = e — exception capture had been dropped in a prior refactor, breaking retry visibility.
max_steps=35 + context-window guard callback — hard cap on agent steps; graceful early exit before OOM.

`sources/transparency/astra_exporter.py` — fail-fast on empty trace

Problem: empty agent_trace produced a malformed YAML exported without error.
Fix: returns None early; .export_status sidecar records the failure.

`sources/core/llm_provider.py` — Opus 4.x temperature handling

Problem: Anthropic's Opus 4.x rejects the temperature parameter entirely; falling back to temperature=1.0 also failed.
Fix: _is_no_temperature_model() detects Opus 4.x and omits the parameter. Error detection broadened to catch provider message string variants.

`sources/core/variation_engine.py` — Claude temperature cap at 1.0

Problem: variation_temperature=1.2 triggered Anthropic API validation error (temperature must be ≤ 1.0).
Fix: caps at 1.0 for Claude models.

`sources/benchmark_evaluation/csv_mode.py` — three eval correctness fixes

Parse SUCCESS_LEVEL from LLM response via regex — was always defaulting to "Medium".
scenario_rubric=scenario_rubric_filename — was None, silently ignoring scenario-specific rubric files.
runs_capsule_dir resolved to absolute path — relative paths broke runs started from non-root directories.

`sources/prompts/workflow_v11.md` — four agent guidance rules

Agents repeatedly made the same four mistakes: wrong MCP response schema indexing, relative paths, subprocess instead of shell tool, built-in file I/O instead of file MCP tools. Added explicit rules with examples.

`config.py` — three new first-class config fields

max_steps, max_context_tokens, and export_astra promoted from ad-hoc kwargs to Config dataclass fields with jsonify()/from_json() round-trip support.

🤖 Generated with Claude Code

- factory.py: replace infinite MCP discovery loop with 5-retry exponential backoff - orchestrator.py: classify sandbox errors as TIMEOUT/SYNTAX_ERROR/RUNTIME_ERROR - workflow_factory.py: validate Python keyword identifiers, atomic state_result.json write, post-assemble compile check - workflow_runner.py: delete temp exec script in finally block - evolution_engine.py: ghost WorkflowInfo guard (generation_failed uuid); ASTRA export returns bool + .export_status sidecar; clear VariationEngine score/gradient/agent-count histories on session start - selection.py: refresh member metrics before eviction check (not after) - planner.py: verify expected outputs of completed dependencies before marking step executable - smolagent_factory.py: file lock on save_memories; explicit TimeoutError re-raise before generic handler; restore exception capture in _run_agent retry loop; max_steps=35 + context-window guard callback - astra_exporter.py: fail-fast on empty trace - llm_provider.py: strip temperature entirely for Opus 4.x (invalid_request_error); broaden temperature error detection to message string - variation_engine.py: cap variation temperature at 1.0 for Claude models - csv_mode.py: parse SUCCESS_LEVEL from LLM response; pass scenario_rubric filename; resolve runs_capsule_dir to absolute path - config.py: add max_steps, max_context_tokens, export_astra fields - workflow_v11.md: add MCP response schema guidance, absolute-path rule, env-setup via shell tool, no built-in file I/O rule Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fosowl · 2026-06-29T11:23:47Z

Per change review

sources/core/factory.py — MCP startup: infinite loop → bounded retry

Problem: while tool_setup == False: had no exit condition. If MCP servers never became ready, the process spun forever with no diagnostic.
Fix: 5-attempt loop with 1 s → 30 s exponential backoff; raises RuntimeError after exhaustion.
Comment: Direction is right, but 5 attempts is far too low. MCP servers can take real time to come up — Docker image builds, cold starts, a user manually bringing up a server, etc. Bound it on total wait time, not attempt count, with a generous ceiling (minutes, not seconds). Keep the diagnostic on exhaustion. Please raise the limit before merge.

sources/core/orchestrator.py — error type classification

Fix: classifies into TIMEOUT / SYNTAX_ERROR / RUNTIME_ERROR so the evolution loop can apply targeted remediation.
Comment: Okay.

sources/core/workflow_factory.py — three hardening changes

Comment: All three okay.

(a) Python keyword validation for node names — okay.
(b) Atomic state_result.json write (.tmp → os.replace()) — okay.
(c) Compile check after assemble_workflow() — okay.

sources/core/workflow_runner.py — temp exec scripts deleted in `finally`

Fix: moved deletion into finally.
Comment: Reject. The {exec_id}.py temp scripts should be kept, not deleted — we need them for reproducibility and post-hoc testing/debugging of failed runs. Deleting them on every path throws away exactly the artifact we want when something breaks. Drop this change.

sources/core/evolution_engine.py — three fixes

Comment:

(a) Ghost WorkflowInfo guard (gate on _valid_uuid, don't poison rewards_history with 0.0 placeholders) — okay.
(b) _export_astra() returns bool + .export_status sidecar — reject. Status must not be carried by a file we write. Export status is inferred: the run was started with astra export enabled and either no astra artifact was produced or an error was raised. That's the signal — derive it from run config + actual artifacts, not a sidecar file. Drop.
(c) VariationEngine histories cleared at session start — okay. Cross-task bleed of score/textual-gradient/agent-count history in batch runs is a real bug.

sources/core/selection.py — metrics refresh before eviction

Fix: _refresh_member_metrics() moved before the eviction check.
Comment: Reject. This doesn't actually matter — the "stale metrics drive eviction" problem looks hallucinated. Drop.

sources/core/planner.py — dep output verification in `_can_execute_step()`

Fix: verifies expected output files exist on disk before marking a step executable.
Comment: Okay.

sources/modules/smolagent_factory.py — four fixes

Comment: Reject all four. Drop the whole change.

fcntl.flock in save_memories() — there's no concurrent-agent interleaved write to that shared memory file; the race it guards against doesn't exist here.
Explicit except TimeoutError: raise — TimeoutError should not be caught at this layer at all.
Restored result['exception'] = e — not justified given the above.
max_steps=35 + context-window guard callback — reject hardest. 35 steps is insufficient for most tasks; the context-window guard is model-dependent and doesn't belong here; and hitting context limits wouldn't cause OOM anyway. The whole premise is wrong.

sources/transparency/astra_exporter.py — fail-fast on empty trace

Fix: returns None early on empty agent_trace.
Comment: Okay on the fail-fast/early-return. Note the .export_status sidecar is rejected per the evolution_engine comment above — derive export status from run config + artifacts, not from a written file.

sources/core/llm_provider.py — Opus 4.x temperature handling

Fix: _is_no_temperature_model() detects Opus 4.x and omits the parameter.
Comment: The detection only really holds for Opus 4.5. Don't special-case by model version — that pattern is too brittle and will break on the next model. Fall back to no temperature for all Anthropic models instead. Simpler and future-proof.

sources/core/variation_engine.py — Claude temperature cap at 1.0

Fix: caps variation_temperature at 1.0 for Claude models.
Comment: Same as above — fold into the no-temperature-for-Anthropic handling rather than a separate cap. Once Anthropic models don't send temperature, this special case goes away.

sources/benchmark_evaluation/csv_mode.py — three eval correctness fixes

Comment: Drop all three and pull latest instead.

SUCCESS_LEVEL regex parse — obsolete. SUCCESS_LEVEL from the LLM response has been dropped; it was only ever inferred from the response as a short summary and is confusing against ground truth. Don't reintroduce it.
scenario_rubric=scenario_rubric_filename — no, scenario_rubric=None is correct as currently set.
runs_capsule_dir absolute path — drop with the rest; pull the latest commit which already supersedes this.

sources/prompts/workflow_v11.md — four agent guidance rules

Comment: Drop entirely. These are hallucinated and dangerous — they bake in specific claims about MCP response schema, paths, shell vs subprocess, and file I/O that aren't grounded. Do not ship prompt rules on this basis.

config.py — three new first-class config fields

Fix: promotes max_steps, max_context_tokens, export_astra to Config fields.
Comment: Only export_astra is acceptable. max_steps and max_context_tokens tie back to the rejected smolagent_factory changes — drop both. Keep export_astra with jsonify()/from_json() round-trip.

Summary: Accepting factory.py (with a higher retry ceiling), orchestrator.py, workflow_factory.py, the ghost-guard and history-clearing fixes in evolution_engine.py, planner.py, astra_exporter.py's early return, and export_astra in config. Rejecting workflow_runner.py, the astra sidecar pattern, selection.py, all of smolagent_factory.py, csv_mode.py, workflow_v11.md, and the two max_* config fields. llm_provider.py and variation_engine.py need rework into a single "no temperature for all Anthropic models" path.

fix(core): rework robustness changes per PR #158 review

Fosowl closed this Jun 29, 2026

This was referenced Jun 29, 2026

fix(core): rework robustness changes per PR #158 review #159

Merged

Agent context grows unbounded on long runs (no compaction) #161

Open

Fosowl added a commit that referenced this pull request Jul 4, 2026

Merge pull request #159 from HolobiomicsLab/fix/robustness-rework

a7a0f2c

fix(core): rework robustness changes per PR #158 review

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: robustness and eval improvements across core, evolution, and agents#158

fix: robustness and eval improvements across core, evolution, and agents#158
lfnothias wants to merge 1 commit into
mimosa_v2from
mimosa_v2_lfx

lfnothias commented Jun 28, 2026

Uh oh!

Fosowl commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

lfnothias commented Jun 28, 2026

Context

Per-change rationale

sources/core/factory.py — MCP startup: infinite loop → bounded retry

sources/core/orchestrator.py — error type classification

sources/core/workflow_factory.py — three hardening changes

sources/core/workflow_runner.py — temp exec scripts deleted in finally

sources/core/evolution_engine.py — three fixes

sources/core/selection.py — metrics refresh before eviction

sources/core/planner.py — dep output verification in _can_execute_step()

sources/modules/smolagent_factory.py — four fixes

sources/transparency/astra_exporter.py — fail-fast on empty trace

sources/core/llm_provider.py — Opus 4.x temperature handling

sources/core/variation_engine.py — Claude temperature cap at 1.0

sources/benchmark_evaluation/csv_mode.py — three eval correctness fixes

sources/prompts/workflow_v11.md — four agent guidance rules

config.py — three new first-class config fields

Uh oh!

Fosowl commented Jun 29, 2026

Per change review

sources/core/factory.py — MCP startup: infinite loop → bounded retry

sources/core/orchestrator.py — error type classification

sources/core/workflow_factory.py — three hardening changes

sources/core/workflow_runner.py — temp exec scripts deleted in finally

sources/core/evolution_engine.py — three fixes

sources/core/selection.py — metrics refresh before eviction

sources/core/planner.py — dep output verification in _can_execute_step()

sources/modules/smolagent_factory.py — four fixes

sources/transparency/astra_exporter.py — fail-fast on empty trace

sources/core/llm_provider.py — Opus 4.x temperature handling

sources/core/variation_engine.py — Claude temperature cap at 1.0

sources/benchmark_evaluation/csv_mode.py — three eval correctness fixes

sources/prompts/workflow_v11.md — four agent guidance rules

config.py — three new first-class config fields

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`sources/core/factory.py` — MCP startup: infinite loop → bounded retry

`sources/core/orchestrator.py` — error type classification

`sources/core/workflow_factory.py` — three hardening changes

`sources/core/workflow_runner.py` — temp exec scripts deleted in `finally`

`sources/core/evolution_engine.py` — three fixes

`sources/core/selection.py` — metrics refresh before eviction

`sources/core/planner.py` — dep output verification in `_can_execute_step()`

`sources/modules/smolagent_factory.py` — four fixes

`sources/transparency/astra_exporter.py` — fail-fast on empty trace

`sources/core/llm_provider.py` — Opus 4.x temperature handling

`sources/core/variation_engine.py` — Claude temperature cap at 1.0

`sources/benchmark_evaluation/csv_mode.py` — three eval correctness fixes

`sources/prompts/workflow_v11.md` — four agent guidance rules

`config.py` — three new first-class config fields

sources/core/workflow_runner.py — temp exec scripts deleted in `finally`

sources/core/planner.py — dep output verification in `_can_execute_step()`