Skip to content

Curator: evaluate & tune LLM consolidation quality (after outcome telemetry) #132

@onsails

Description

@onsails

Context

The periodic skill curator (crates/bot/src/learning_curator.rs, prompt CURATOR_SYSTEM_PROMPT in crates/right-codegen/src/agent_def.rs) ships an LLM consolidation pass: umbrella-merge near-duplicate rightx-* skills, demote narrow skills into an umbrella's references/, archive with absorbed_into. It is enabled by default and runs today.

What's missing: we have no measurement of how good those consolidation decisions are. We don't know whether it merges the right skills, over-merges, or rarely fires usefully. The marketing claim ("the curator decides two skills are duplicates and merges one into the other") is currently unproven in any deployment.

Blocked on

Curator outcome telemetry + dashboard observability (track "A"). We need run-level outcome data — what each curator pass merged/archived/demoted and why — before we can judge or tune quality. Do this issue after that lands.

Scope (once telemetry exists)

  • Evaluate the consolidation pass against a real skill library: are umbrella / demote / archive decisions correct? false merges? missed duplicates? does it act at all?
  • Tune CURATOR_SYSTEM_PROMPT from observed behavior.
  • Consider a lightweight eval harness / golden cases for consolidation decisions.

References

  • Deferred Phase-2: docs/superpowers/specs/2026-05-22-prefilter-classifier-and-curator-state-design.md §11 (outcome-driven prompt calibration).
  • Curator design: docs/superpowers/specs/2026-05-22-skill-learning-writer-curator-design.md.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions