Conversation
- factory.py: replace infinite MCP discovery loop with 5-retry exponential backoff - orchestrator.py: classify sandbox errors as TIMEOUT/SYNTAX_ERROR/RUNTIME_ERROR - workflow_factory.py: validate Python keyword identifiers, atomic state_result.json write, post-assemble compile check - workflow_runner.py: delete temp exec script in finally block - evolution_engine.py: ghost WorkflowInfo guard (generation_failed uuid); ASTRA export returns bool + .export_status sidecar; clear VariationEngine score/gradient/agent-count histories on session start - selection.py: refresh member metrics before eviction check (not after) - planner.py: verify expected outputs of completed dependencies before marking step executable - smolagent_factory.py: file lock on save_memories; explicit TimeoutError re-raise before generic handler; restore exception capture in _run_agent retry loop; max_steps=35 + context-window guard callback - astra_exporter.py: fail-fast on empty trace - llm_provider.py: strip temperature entirely for Opus 4.x (invalid_request_error); broaden temperature error detection to message string - variation_engine.py: cap variation temperature at 1.0 for Claude models - csv_mode.py: parse SUCCESS_LEVEL from LLM response; pass scenario_rubric filename; resolve runs_capsule_dir to absolute path - config.py: add max_steps, max_context_tokens, export_astra fields - workflow_v11.md: add MCP response schema guidance, absolute-path rule, env-setup via shell tool, no built-in file I/O rule Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Collaborator
Author
|
Closing in favour of mimosa_v2_lfx → mimosa_v2 PR — dev is the integration branch, active dev happens on mimosa_v2. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
Changes introduced during an ASB Haffner task_001 evaluation run on
mimosa_v2. All 14 files are general-purpose improvements — none is Haffner-specific. The main dev should review each rationale and decide whether to merge as-is or fold into a larger cleanup.Single cherry-pick of commit
641010efrommimosa_v2ontodev. Conflicts resolved:evolution_engine.py: kept new_export_astra()method (absent on dev)astra_exporter.py: restored (was deleted on dev, re-added here)Per-change rationale
sources/core/factory.py— MCP startup: infinite loop → bounded retry with exponential backoffProblem:
while tool_setup == False:had no exit condition. If MCP servers never became ready (port conflict, crash), the process would spin forever consuming CPU with no diagnostic.Fix: 5-attempt loop with 1 s → 30 s exponential backoff; raises
RuntimeErrorwith a clear message after exhaustion.Decision point: retry count and backoff ceiling are hardcoded; could be config fields if needed.
sources/core/orchestrator.py— error type classification (TIMEOUT / SYNTAX_ERROR / RUNTIME_ERROR)Problem: all execution failures were logged as a generic "execution failed" with no type signal, making targeted retry strategies impossible.
Fix: classifies errors into three buckets so the evolution loop can apply different remediation (e.g., syntax errors never benefit from re-running the same code).
sources/core/workflow_factory.py— three hardening changesreturn,import, etc. as node names produced confusingSyntaxErrorat runtime. Now caught at generation time with an actionable message.state_result.jsonwrite — write to.tmp, thenos.replace(). Without this, a crash mid-write leaves a truncated JSON that poisons subsequent reads.assemble_workflow()— callingcompile()before submitting to the sandbox catches syntax errors without paying the full sandbox overhead.sources/core/workflow_runner.py— temp exec scripts deleted infinallyProblem:
{exec_id}.pytemp files were only deleted on the happy path. A crash or exception left them accumulating on disk across long eval runs.Fix: moved deletion into
finallyso it runs unconditionally.sources/core/evolution_engine.py— three independent fixes"generation_failed"), the code still constructed aWorkflowInfowith that path, silently corruptingrewards_historywith 0.0 placeholders. Fixed by gating construction on_valid_uuid._export_astra()returns bool + writes.export_statussidecar — callers can now inspect whether export succeeded without re-running anything; failures are non-fatal and self-documenting.score_history,textual_gradient_history, andagent_count_historywere persisting across tasks in a batch run, causing the mutator to apply boldness gradients calibrated to the previous task. Fixed by clearing inevolve().sources/core/selection.py—_refresh_member_metrics()moved before eviction checkProblem: the eviction decision was being made against stale metrics (scores not yet refreshed for the current iteration). Moving the refresh before the eviction check ensures correct ordering.
sources/core/planner.py—_can_execute_step()verifies expected outputs of completed depsProblem: a completed dependency that hadn't actually produced its expected output files was still marking the step as executable, allowing dependent steps to run on missing inputs.
Fix: verifies that expected output files exist on disk before considering a dep "done."
sources/modules/smolagent_factory.py— four fixesfcntl.flockfile lock insave_memories()— concurrent agents writing to the shared memory file could interleave writes. Lock scope is tight (write only).except TimeoutError: raisebefore generic handler — the broadexcept Exceptionwas swallowingTimeoutError, making debugging impossible (timeout looked like a silent success).result['exception'] = ein_run_agent()inner loop — exception capture had been removed in a prior refactor, breaking retry visibility.max_steps=35and context-window guard callback — sets a hard cap on agent steps and triggers graceful early exit when the context approaches the configuredmax_context_tokensthreshold to prevent OOM crashes.sources/transparency/astra_exporter.py— fail-fast on empty traceProblem: an empty
agent_tracelist produced a malformed YAML with null fields, silently exported without error.Fix: returns
Noneearly with a clear log message; the caller's.export_statussidecar records the failure.sources/core/llm_provider.py— Opus 4.x temperature handlingProblem: Anthropic's Opus 4.x API rejects the
temperatureparameter entirely (not just clamps it). The previous code attempted to recover by falling back totemperature=1.0, which also failed.Fix:
_is_no_temperature_model()detects Opus 4.x and omits the parameter from the request body. Error detection also broadened to catch provider message string variants.sources/core/variation_engine.py— Claude temperature cap at 1.0Problem:
variation_temperaturewas set to 1.2 for all models including Claude, triggering an Anthropic API validation error (temperature must be ≤ 1.0).Fix: caps at 1.0 for Claude models.
sources/benchmark_evaluation/csv_mode.py— three eval correctness fixesSUCCESS_LEVELfrom LLM response via regex — the field was always defaulting to "Medium" because the parser never searched the response text.scenario_rubric=scenario_rubric_filename— wasNone, causing scenario-specific rubric files to be silently ignored.runs_capsule_dirresolved to absolute path — relative paths broke isolated runs started from directories other than the repo root.sources/prompts/workflow_v11.md— four agent guidance rulesAgents repeatedly made the same four mistakes across evaluation runs:
subprocessinstead of the shell MCP toolAdded explicit rules with examples to the workflow prompt so agents self-correct before generating code.
config.py— three new first-class config fieldsmax_steps,max_context_tokens, andexport_astrawere being passed as ad-hoc kwargs without config-file backing. Promoted toConfigdataclass fields withjsonify()/from_json()round-trip support so they survive serialization.Test evidence
config.pyfield changes:uv run python -c "from config import Config; c = Config(); print(c.max_steps, c.export_astra)"passes.$1.04cumulative cost.metabo_dev.csvtasks.🤖 Generated with Claude Code