Skip to content

fix: robustness and eval improvements across core, evolution, and agents#158

Closed
lfnothias wants to merge 1 commit into
mimosa_v2from
mimosa_v2_lfx
Closed

fix: robustness and eval improvements across core, evolution, and agents#158
lfnothias wants to merge 1 commit into
mimosa_v2from
mimosa_v2_lfx

Conversation

@lfnothias

Copy link
Copy Markdown
Collaborator

Context

Changes introduced during an ASB Haffner task_001 evaluation run. All 14 files are general-purpose improvements — none is Haffner-specific. Single cherry-pick of commit 641010e from local mimosa_v2 onto origin/mimosa_v2 (applied cleanly, no conflicts).

Verified against: ASB Haffner task_001, 3 QD iterations, best score 0.8303 (27/34 claims), ~$1.04 cumulative cost. No regressions observed on metabo_dev.csv tasks.


Per-change rationale

sources/core/factory.py — MCP startup: infinite loop → bounded retry

Problem: while tool_setup == False: had no exit condition. If MCP servers never became ready, the process spun forever with no diagnostic.
Fix: 5-attempt loop with 1 s → 30 s exponential backoff; raises RuntimeError after exhaustion.

sources/core/orchestrator.py — error type classification

Problem: all execution failures were logged as generic "execution failed" with no type signal.
Fix: classifies into TIMEOUT / SYNTAX_ERROR / RUNTIME_ERROR so the evolution loop can apply targeted remediation.

sources/core/workflow_factory.py — three hardening changes

  1. Python keyword validation for node names — catches return, import, etc. at generation time instead of as a confusing SyntaxError at runtime.
  2. Atomic state_result.json write — write to .tmp, then os.replace(). Prevents a crash mid-write from leaving a truncated JSON that poisons subsequent reads.
  3. Compile check after assemble_workflow() — catches syntax errors before paying full sandbox overhead.

sources/core/workflow_runner.py — temp exec scripts deleted in finally

Problem: {exec_id}.py temp files were only deleted on the happy path; exceptions left them accumulating.
Fix: moved deletion into finally.

sources/core/evolution_engine.py — three fixes

  1. Ghost WorkflowInfo guard — when uuid == "generation_failed", constructing WorkflowInfo silently corrupted rewards_history with 0.0 placeholders. Fixed by gating on _valid_uuid.
  2. _export_astra() returns bool + .export_status sidecar — callers can inspect export outcome without re-running; failures are non-fatal and self-documenting.
  3. VariationEngine histories cleared at session startscore_history, textual_gradient_history, and agent_count_history persisted across tasks in batch runs, causing the mutator to apply boldness gradients from the previous task.

sources/core/selection.py — metrics refresh before eviction

Problem: eviction decisions were made against stale metrics from the previous iteration.
Fix: _refresh_member_metrics() moved before the eviction check.

sources/core/planner.py — dep output verification in _can_execute_step()

Problem: a completed dep with missing expected output files still marked the step executable, letting dependent steps run on missing inputs.
Fix: verifies expected output files exist on disk.

sources/modules/smolagent_factory.py — four fixes

  1. fcntl.flock in save_memories() — concurrent agents interleaved writes to shared memory file.
  2. Explicit except TimeoutError: raise — broad except Exception was swallowing TimeoutError.
  3. Restored result['exception'] = e — exception capture had been dropped in a prior refactor, breaking retry visibility.
  4. max_steps=35 + context-window guard callback — hard cap on agent steps; graceful early exit before OOM.

sources/transparency/astra_exporter.py — fail-fast on empty trace

Problem: empty agent_trace produced a malformed YAML exported without error.
Fix: returns None early; .export_status sidecar records the failure.

sources/core/llm_provider.py — Opus 4.x temperature handling

Problem: Anthropic's Opus 4.x rejects the temperature parameter entirely; falling back to temperature=1.0 also failed.
Fix: _is_no_temperature_model() detects Opus 4.x and omits the parameter. Error detection broadened to catch provider message string variants.

sources/core/variation_engine.py — Claude temperature cap at 1.0

Problem: variation_temperature=1.2 triggered Anthropic API validation error (temperature must be ≤ 1.0).
Fix: caps at 1.0 for Claude models.

sources/benchmark_evaluation/csv_mode.py — three eval correctness fixes

  1. Parse SUCCESS_LEVEL from LLM response via regex — was always defaulting to "Medium".
  2. scenario_rubric=scenario_rubric_filename — was None, silently ignoring scenario-specific rubric files.
  3. runs_capsule_dir resolved to absolute path — relative paths broke runs started from non-root directories.

sources/prompts/workflow_v11.md — four agent guidance rules

Agents repeatedly made the same four mistakes: wrong MCP response schema indexing, relative paths, subprocess instead of shell tool, built-in file I/O instead of file MCP tools. Added explicit rules with examples.

config.py — three new first-class config fields

max_steps, max_context_tokens, and export_astra promoted from ad-hoc kwargs to Config dataclass fields with jsonify()/from_json() round-trip support.

🤖 Generated with Claude Code

- factory.py: replace infinite MCP discovery loop with 5-retry exponential backoff
- orchestrator.py: classify sandbox errors as TIMEOUT/SYNTAX_ERROR/RUNTIME_ERROR
- workflow_factory.py: validate Python keyword identifiers, atomic state_result.json write, post-assemble compile check
- workflow_runner.py: delete temp exec script in finally block
- evolution_engine.py: ghost WorkflowInfo guard (generation_failed uuid); ASTRA export returns bool + .export_status sidecar; clear VariationEngine score/gradient/agent-count histories on session start
- selection.py: refresh member metrics before eviction check (not after)
- planner.py: verify expected outputs of completed dependencies before marking step executable
- smolagent_factory.py: file lock on save_memories; explicit TimeoutError re-raise before generic handler; restore exception capture in _run_agent retry loop; max_steps=35 + context-window guard callback
- astra_exporter.py: fail-fast on empty trace
- llm_provider.py: strip temperature entirely for Opus 4.x (invalid_request_error); broaden temperature error detection to message string
- variation_engine.py: cap variation temperature at 1.0 for Claude models
- csv_mode.py: parse SUCCESS_LEVEL from LLM response; pass scenario_rubric filename; resolve runs_capsule_dir to absolute path
- config.py: add max_steps, max_context_tokens, export_astra fields
- workflow_v11.md: add MCP response schema guidance, absolute-path rule, env-setup via shell tool, no built-in file I/O rule

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Fosowl

Fosowl commented Jun 29, 2026

Copy link
Copy Markdown
Member

Per change review

sources/core/factory.py — MCP startup: infinite loop → bounded retry

Problem: while tool_setup == False: had no exit condition. If MCP servers never became ready, the process spun forever with no diagnostic.
Fix: 5-attempt loop with 1 s → 30 s exponential backoff; raises RuntimeError after exhaustion.
Comment: Direction is right, but 5 attempts is far too low. MCP servers can take real time to come up — Docker image builds, cold starts, a user manually bringing up a server, etc. Bound it on total wait time, not attempt count, with a generous ceiling (minutes, not seconds). Keep the diagnostic on exhaustion. Please raise the limit before merge.

sources/core/orchestrator.py — error type classification

Fix: classifies into TIMEOUT / SYNTAX_ERROR / RUNTIME_ERROR so the evolution loop can apply targeted remediation.
Comment: Okay.

sources/core/workflow_factory.py — three hardening changes

Comment: All three okay.

  • (a) Python keyword validation for node names — okay.
  • (b) Atomic state_result.json write (.tmpos.replace()) — okay.
  • (c) Compile check after assemble_workflow() — okay.

sources/core/workflow_runner.py — temp exec scripts deleted in finally

Fix: moved deletion into finally.
Comment: Reject. The {exec_id}.py temp scripts should be kept, not deleted — we need them for reproducibility and post-hoc testing/debugging of failed runs. Deleting them on every path throws away exactly the artifact we want when something breaks. Drop this change.

sources/core/evolution_engine.py — three fixes

Comment:

  • (a) Ghost WorkflowInfo guard (gate on _valid_uuid, don't poison rewards_history with 0.0 placeholders) — okay.
  • (b) _export_astra() returns bool + .export_status sidecar — reject. Status must not be carried by a file we write. Export status is inferred: the run was started with astra export enabled and either no astra artifact was produced or an error was raised. That's the signal — derive it from run config + actual artifacts, not a sidecar file. Drop.
  • (c) VariationEngine histories cleared at session start — okay. Cross-task bleed of score/textual-gradient/agent-count history in batch runs is a real bug.

sources/core/selection.py — metrics refresh before eviction

Fix: _refresh_member_metrics() moved before the eviction check.
Comment: Reject. This doesn't actually matter — the "stale metrics drive eviction" problem looks hallucinated. Drop.

sources/core/planner.py — dep output verification in _can_execute_step()

Fix: verifies expected output files exist on disk before marking a step executable.
Comment: Okay.

sources/modules/smolagent_factory.py — four fixes

Comment: Reject all four. Drop the whole change.

  • fcntl.flock in save_memories() — there's no concurrent-agent interleaved write to that shared memory file; the race it guards against doesn't exist here.
  • Explicit except TimeoutError: raiseTimeoutError should not be caught at this layer at all.
  • Restored result['exception'] = e — not justified given the above.
  • max_steps=35 + context-window guard callback — reject hardest. 35 steps is insufficient for most tasks; the context-window guard is model-dependent and doesn't belong here; and hitting context limits wouldn't cause OOM anyway. The whole premise is wrong.

sources/transparency/astra_exporter.py — fail-fast on empty trace

Fix: returns None early on empty agent_trace.
Comment: Okay on the fail-fast/early-return. Note the .export_status sidecar is rejected per the evolution_engine comment above — derive export status from run config + artifacts, not from a written file.

sources/core/llm_provider.py — Opus 4.x temperature handling

Fix: _is_no_temperature_model() detects Opus 4.x and omits the parameter.
Comment: The detection only really holds for Opus 4.5. Don't special-case by model version — that pattern is too brittle and will break on the next model. Fall back to no temperature for all Anthropic models instead. Simpler and future-proof.

sources/core/variation_engine.py — Claude temperature cap at 1.0

Fix: caps variation_temperature at 1.0 for Claude models.
Comment: Same as above — fold into the no-temperature-for-Anthropic handling rather than a separate cap. Once Anthropic models don't send temperature, this special case goes away.

sources/benchmark_evaluation/csv_mode.py — three eval correctness fixes

Comment: Drop all three and pull latest instead.

  • SUCCESS_LEVEL regex parse — obsolete. SUCCESS_LEVEL from the LLM response has been dropped; it was only ever inferred from the response as a short summary and is confusing against ground truth. Don't reintroduce it.
  • scenario_rubric=scenario_rubric_filename — no, scenario_rubric=None is correct as currently set.
  • runs_capsule_dir absolute path — drop with the rest; pull the latest commit which already supersedes this.

sources/prompts/workflow_v11.md — four agent guidance rules

Comment: Drop entirely. These are hallucinated and dangerous — they bake in specific claims about MCP response schema, paths, shell vs subprocess, and file I/O that aren't grounded. Do not ship prompt rules on this basis.

config.py — three new first-class config fields

Fix: promotes max_steps, max_context_tokens, export_astra to Config fields.
Comment: Only export_astra is acceptable. max_steps and max_context_tokens tie back to the rejected smolagent_factory changes — drop both. Keep export_astra with jsonify()/from_json() round-trip.


Summary: Accepting factory.py (with a higher retry ceiling), orchestrator.py, workflow_factory.py, the ghost-guard and history-clearing fixes in evolution_engine.py, planner.py, astra_exporter.py's early return, and export_astra in config. Rejecting workflow_runner.py, the astra sidecar pattern, selection.py, all of smolagent_factory.py, csv_mode.py, workflow_v11.md, and the two max_* config fields. llm_provider.py and variation_engine.py need rework into a single "no temperature for all Anthropic models" path.

@Fosowl Fosowl closed this Jun 29, 2026
Fosowl added a commit that referenced this pull request Jul 4, 2026
fix(core): rework robustness changes per PR #158 review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants