Skip to content

fix: robustness and eval improvements across core, evolution, and agents#157

Closed
lfnothias wants to merge 1 commit into
devfrom
dev_lfx
Closed

fix: robustness and eval improvements across core, evolution, and agents#157
lfnothias wants to merge 1 commit into
devfrom
dev_lfx

Conversation

@lfnothias

Copy link
Copy Markdown
Collaborator

Context

Changes introduced during an ASB Haffner task_001 evaluation run on mimosa_v2. All 14 files are general-purpose improvements — none is Haffner-specific. The main dev should review each rationale and decide whether to merge as-is or fold into a larger cleanup.

Single cherry-pick of commit 641010e from mimosa_v2 onto dev. Conflicts resolved:

  • evolution_engine.py: kept new _export_astra() method (absent on dev)
  • astra_exporter.py: restored (was deleted on dev, re-added here)

Per-change rationale

sources/core/factory.py — MCP startup: infinite loop → bounded retry with exponential backoff

Problem: while tool_setup == False: had no exit condition. If MCP servers never became ready (port conflict, crash), the process would spin forever consuming CPU with no diagnostic.
Fix: 5-attempt loop with 1 s → 30 s exponential backoff; raises RuntimeError with a clear message after exhaustion.
Decision point: retry count and backoff ceiling are hardcoded; could be config fields if needed.

sources/core/orchestrator.py — error type classification (TIMEOUT / SYNTAX_ERROR / RUNTIME_ERROR)

Problem: all execution failures were logged as a generic "execution failed" with no type signal, making targeted retry strategies impossible.
Fix: classifies errors into three buckets so the evolution loop can apply different remediation (e.g., syntax errors never benefit from re-running the same code).

sources/core/workflow_factory.py — three hardening changes

  1. Python keyword validation for node names — workflows using return, import, etc. as node names produced confusing SyntaxError at runtime. Now caught at generation time with an actionable message.
  2. Atomic state_result.json write — write to .tmp, then os.replace(). Without this, a crash mid-write leaves a truncated JSON that poisons subsequent reads.
  3. Compile check after assemble_workflow() — calling compile() before submitting to the sandbox catches syntax errors without paying the full sandbox overhead.

sources/core/workflow_runner.py — temp exec scripts deleted in finally

Problem: {exec_id}.py temp files were only deleted on the happy path. A crash or exception left them accumulating on disk across long eval runs.
Fix: moved deletion into finally so it runs unconditionally.

sources/core/evolution_engine.py — three independent fixes

  1. Ghost WorkflowInfo guard — when workflow generation failed (uuid = "generation_failed"), the code still constructed a WorkflowInfo with that path, silently corrupting rewards_history with 0.0 placeholders. Fixed by gating construction on _valid_uuid.
  2. _export_astra() returns bool + writes .export_status sidecar — callers can now inspect whether export succeeded without re-running anything; failures are non-fatal and self-documenting.
  3. VariationEngine histories cleared at session startscore_history, textual_gradient_history, and agent_count_history were persisting across tasks in a batch run, causing the mutator to apply boldness gradients calibrated to the previous task. Fixed by clearing in evolve().

sources/core/selection.py_refresh_member_metrics() moved before eviction check

Problem: the eviction decision was being made against stale metrics (scores not yet refreshed for the current iteration). Moving the refresh before the eviction check ensures correct ordering.

sources/core/planner.py_can_execute_step() verifies expected outputs of completed deps

Problem: a completed dependency that hadn't actually produced its expected output files was still marking the step as executable, allowing dependent steps to run on missing inputs.
Fix: verifies that expected output files exist on disk before considering a dep "done."

sources/modules/smolagent_factory.py — four fixes

  1. fcntl.flock file lock in save_memories() — concurrent agents writing to the shared memory file could interleave writes. Lock scope is tight (write only).
  2. Explicit except TimeoutError: raise before generic handler — the broad except Exception was swallowing TimeoutError, making debugging impossible (timeout looked like a silent success).
  3. Restored result['exception'] = e in _run_agent() inner loop — exception capture had been removed in a prior refactor, breaking retry visibility.
  4. max_steps=35 and context-window guard callback — sets a hard cap on agent steps and triggers graceful early exit when the context approaches the configured max_context_tokens threshold to prevent OOM crashes.

sources/transparency/astra_exporter.py — fail-fast on empty trace

Problem: an empty agent_trace list produced a malformed YAML with null fields, silently exported without error.
Fix: returns None early with a clear log message; the caller's .export_status sidecar records the failure.

sources/core/llm_provider.py — Opus 4.x temperature handling

Problem: Anthropic's Opus 4.x API rejects the temperature parameter entirely (not just clamps it). The previous code attempted to recover by falling back to temperature=1.0, which also failed.
Fix: _is_no_temperature_model() detects Opus 4.x and omits the parameter from the request body. Error detection also broadened to catch provider message string variants.

sources/core/variation_engine.py — Claude temperature cap at 1.0

Problem: variation_temperature was set to 1.2 for all models including Claude, triggering an Anthropic API validation error (temperature must be ≤ 1.0).
Fix: caps at 1.0 for Claude models.

sources/benchmark_evaluation/csv_mode.py — three eval correctness fixes

  1. Parse SUCCESS_LEVEL from LLM response via regex — the field was always defaulting to "Medium" because the parser never searched the response text.
  2. scenario_rubric=scenario_rubric_filename — was None, causing scenario-specific rubric files to be silently ignored.
  3. runs_capsule_dir resolved to absolute path — relative paths broke isolated runs started from directories other than the repo root.

sources/prompts/workflow_v11.md — four agent guidance rules

Agents repeatedly made the same four mistakes across evaluation runs:

  1. Indexing MCP tool responses with the wrong schema field
  2. Using relative paths (breaking sandbox isolation)
  3. Calling subprocess instead of the shell MCP tool
  4. Using built-in file I/O instead of the designated file MCP tools

Added explicit rules with examples to the workflow prompt so agents self-correct before generating code.

config.py — three new first-class config fields

max_steps, max_context_tokens, and export_astra were being passed as ad-hoc kwargs without config-file backing. Promoted to Config dataclass fields with jsonify()/from_json() round-trip support so they survive serialization.


Test evidence

  • config.py field changes: uv run python -c "from config import Config; c = Config(); print(c.max_steps, c.export_astra)" passes.
  • Full eval run with these fixes: ASB Haffner task_001, 3 QD iterations, best score 0.8303 (27/34 claims), $1.04 cumulative cost.
  • No regressions observed on metabo_dev.csv tasks.

🤖 Generated with Claude Code

- factory.py: replace infinite MCP discovery loop with 5-retry exponential backoff
- orchestrator.py: classify sandbox errors as TIMEOUT/SYNTAX_ERROR/RUNTIME_ERROR
- workflow_factory.py: validate Python keyword identifiers, atomic state_result.json write, post-assemble compile check
- workflow_runner.py: delete temp exec script in finally block
- evolution_engine.py: ghost WorkflowInfo guard (generation_failed uuid); ASTRA export returns bool + .export_status sidecar; clear VariationEngine score/gradient/agent-count histories on session start
- selection.py: refresh member metrics before eviction check (not after)
- planner.py: verify expected outputs of completed dependencies before marking step executable
- smolagent_factory.py: file lock on save_memories; explicit TimeoutError re-raise before generic handler; restore exception capture in _run_agent retry loop; max_steps=35 + context-window guard callback
- astra_exporter.py: fail-fast on empty trace
- llm_provider.py: strip temperature entirely for Opus 4.x (invalid_request_error); broaden temperature error detection to message string
- variation_engine.py: cap variation temperature at 1.0 for Claude models
- csv_mode.py: parse SUCCESS_LEVEL from LLM response; pass scenario_rubric filename; resolve runs_capsule_dir to absolute path
- config.py: add max_steps, max_context_tokens, export_astra fields
- workflow_v11.md: add MCP response schema guidance, absolute-path rule, env-setup via shell tool, no built-in file I/O rule

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lfnothias

Copy link
Copy Markdown
Collaborator Author

Closing in favour of mimosa_v2_lfx → mimosa_v2 PR — dev is the integration branch, active dev happens on mimosa_v2.

@lfnothias lfnothias closed this Jun 28, 2026
@lfnothias lfnothias deleted the dev_lfx branch June 28, 2026 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant