Skip to content

Latest V2 improvement to dev: verifier hardening, QD evolution improvements, and workflow runner isolation#156

Merged
Fosowl merged 85 commits into
devfrom
mimosa_v2
Jun 22, 2026
Merged

Latest V2 improvement to dev: verifier hardening, QD evolution improvements, and workflow runner isolation#156
Fosowl merged 85 commits into
devfrom
mimosa_v2

Conversation

@Fosowl

@Fosowl Fosowl commented Jun 22, 2026

Copy link
Copy Markdown
Member

Summary

This PR merges the mimosa_v2 development branch into dev, bringing ~85 commits of verifier hardening, quality-diversity evolution improvements, workflow runner isolation, and CLI robustness fixes.

What's Changed

Verifier & Grounding

  • Per-claim verification scripts now use AST parsing, file-first analysis, and vacuous-antecedent rejection
  • Source B extracts literal identifiers from goal body; Source C scans image artefacts with PIL; Source E flags dataset-sentinel leakage in selection
  • Bounded retry on verifier-side failures with per-claim failure handling
  • Grounding retriever now plans multi-step targeted literature sub-searches (dropped regex cues)
  • Reduced verifier temperature and shorter prompt gradients for more consistent outputs

Evolution & QD

  • Failure-fingerprint descriptor centered for QD novelty computation
  • Genotype-embedding novelty + length penalty in selection
  • Plateau-based stagnation replaces embedding stagnation in variation engine
  • Reduced crossover rate for more stable evolution
  • Per-goal evolution trees rendered into workflow folders

Workflow Runner & Orchestrator

  • Managed venv isolation: workflow dependencies installed into isolated virtual environments
  • max_tokens capped at 16384 to prevent token-limit crashes
  • Random workflow temperature sampling for diversity

CLI & UX

  • Hardened JSON parsing across all CLI tools (gibberish-tolerant prompts)
  • Improved error handling in evaluation_cli, memory_chat_cli, and onboard_cli
  • CSV recovery prompts now respect user-typed values
  • Execution sandbox stdout truncation fixed

Agent & Misc

  • Smolagent additional_authorized_imports updated
  • Agent max steps bumped to 129
  • Python 3.12 runner specification enforced
  • Docs synced with multi-agent evolution claims

Breaking Changes

None expected — all changes are additive or internal refactoring.

Testing

  • Verifier per-claim scripts pass on test goals
  • Workflow runner venv creation tested locally
  • CLI error handling verified with malformed inputs
  • Evolution pipeline runs end-to-end without stagnation false-positives

Contribution checklist

Please confirm the following before requesting review:

  • [x ] I have read CONTRIBUTING.md, docs/licensing-notes.md, and the repository license information (LICENSE, NOTICE) for this repository (Apache License 2.0).
  • [x ] I have read docs/cla-process.md and understand the CLA workflow for contributions.
  • [x ] I confirm that I have the legal right to submit this contribution.
  • [x ] I have signed the required short Individual Contributor Agreement, or I will complete it if requested.
  • [x ] If this contribution is made in the course of employment or under institutional IP rules, I understand that maintainers may also request the optional employer authorization.
  • [x ] I am not knowingly submitting code that is incompatible with this repository’s Apache 2.0 licensing terms and CLA requirements.

Fosowl and others added 30 commits June 9, 2026 19:04
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fosowl and others added 29 commits June 20, 2026 10:44
… set: crossover rate to 0.4 + hard cap at 0.7
…uant floor

When the model creator only self-hosts at int4/fp4 (e.g. moonshotai for
kimi-k2.7-code), the previous floor dropped their endpoint and routed to a
community fp8 reseller. For benchmark reproducibility the authoritative
first-party endpoint is preferred over a third-party requantization; the
JSON-validity + determinism probe still gates them on merit.

Also widen the runtime quantizations filter to admit any quant the precheck
approved, so OpenRouter doesn't silently drop the pinned first-party endpoint.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…gimosa_v2

g Please enter a commit message to explain why this merge is necessary,
@Fosowl Fosowl merged commit 2057cbc into dev Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant