SkillOpt

SkillOpt optimizes Claude Agent Skills using GEPA's optimize_anything API. It can optimize an existing skill or create one from scratch, grounded in standard benchmark task data.

Installation

uv sync

# For SWE-bench and SearchQA support
uv pip install datasets

API configuration

Copy .env.example to .env and set your credentials:

cp .env.example .env

Provider	`.env` keys needed
OpenAI	`OPENAI_API_KEY`
Google Gemini	`GEMINI_API_KEY` + `OPENAI_API_BASE`
Ollama (local)	`API_KEY=ollama` + `OPENAI_API_BASE=http://localhost:11434/v1`
Any OpenAI-compatible	`API_KEY` + `OPENAI_API_BASE`

Key resolution order: OPENAI_API_KEY → GEMINI_API_KEY → API_KEY. Flags --api-key and --api-base take precedence over .env.

Commands

Analyze

Score an existing skill against best practices without modifying it.

uv run python main.py analyze <skill_directory>

Static optimization

Optimize using rule-based scoring only — no LLM judge or eval cases required.

uv run python main.py optimize <skill_directory> --model <model> --api-key <key>

Eval-based optimization

Optimize with a hybrid evaluator: 40% static rules + 60% LLM-as-judge assertion pass rate.

# Hand-written evals
uv run python main.py optimize-evals <skill_directory> --evals <evals.json>

# Auto-generate evals from the skill content
uv run python main.py optimize-evals <skill_directory> --generate-evals

# Use a standard benchmark as the eval source
uv run python main.py optimize-evals <skill_directory> --benchmark tau-bench --benchmark-split airline
uv run python main.py optimize-evals <skill_directory> --benchmark swe-bench --benchmark-variant swe-bench-verified
uv run python main.py optimize-evals <skill_directory> --benchmark searchqa

Create a skill from scratch

No existing SKILL.md required — GEPA generates the first candidate from the benchmark task context.

uv run python main.py optimize-evals <new_directory> --from-scratch --benchmark tau-bench --benchmark-split airline
uv run python main.py optimize-evals <new_directory> --from-scratch --benchmark searchqa

Convert any HuggingFace dataset to eval format

Map any HuggingFace dataset fields to the SkillOpt eval schema. Output can be passed directly to optimize-evals.

uv run python main.py convert <dataset_id> \
    --prompt-field <field> \
    [--context-field <field>] \
    [--answer-field <field>] \
    [--split <split>] [--n <n>] [--config <config>] \
    -o evals.json

Dot notation is supported for nested fields: --prompt-field answers.text.

Dry run

Load and generate evals without running optimization — useful for inspecting eval cases first.

uv run python main.py optimize-evals <skill_directory> --benchmark tau-bench --dry-run

Common flags

Flag	Description
`--model <provider/model>`	LLM in litellm format (default: `openai/gpt-4o`)
`--api-key <key>`	API key (fallback: env vars)
`--api-base <url>`	Base URL for OpenAI-compatible endpoints
`--max-evals <n>`	Max GEPA iterations (default: 10)
`--from-scratch`	Create a new skill instead of optimizing an existing one
`--dry-run`	Phase 1 only — load/generate evals, skip optimization
`-o <dir>`	Output directory (default: `output/<skill-name>-<timestamp>/`)

Benchmark sources

`--benchmark`	`--benchmark-split` / `--benchmark-variant`	Source
`tau-bench`	`airline`, `retail`	GitHub (auto-fetched)
`swe-bench`	`swe-bench`, `swe-bench-verified`, `swe-bench-lite`	HuggingFace
`searchqa`	`train`, `validation`	HuggingFace

Output structure

Every run writes to output/<skill-name>-<timestamp>/:

output/<skill-name>-<timestamp>/
├── SKILL.md                    optimized or generated skill
├── benchmark.json              scores and assertion verdicts
├── evals_with_assertions.json  eval cases used
└── trajectory/
    ├── trajectory.jsonl        live per-iteration log
    ├── trajectory_summary.json post-run summary
    └── candidates/             every proposed variant

Watch a run live: tail -f output/<run>/trajectory/trajectory.jsonl

Library usage

from skillopt import SkillParser, SkillAnalyzer, load_benchmark

# Analyze
skill = SkillParser().parse_directory("<skill_directory>")
report = SkillAnalyzer().analyze(skill)
print(f"Score: {report.score}/100")

# Load benchmark evals
evals = load_benchmark("tau-bench", split="airline", n=3)
evals = load_benchmark("swe-bench", split="test", variant="swe-bench-verified", n=3)
evals = load_benchmark("searchqa", split="validation", n=3)

Project structure

main.py                  # CLI (analyze, optimize, optimize-evals, convert)
skillopt/                # Core library
  skill_parser.py        # Parses SKILL.md + referenced files
  skill_analyzer.py      # Scores skills against best-practice rules
  trajectory.py          # Per-iteration candidate logging
  benchmarks/            # Benchmark adapters (tau_bench, swe_bench, searchqa)
scripts/                 # optimize_skill.py, optimize_skill_with_evals.py, convert_benchmark.py
output/                  # Generated run outputs (gitignored)

License

Apache 2.0 — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.claude/skills/skill-optimizer		.claude/skills/skill-optimizer
scripts		scripts
skillopt		skillopt
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SkillOpt

Installation

API configuration

Commands

Analyze

Static optimization

Eval-based optimization

Create a skill from scratch

Convert any HuggingFace dataset to eval format

Dry run

Common flags

Benchmark sources

Output structure

Library usage

Project structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SkillOpt

Installation

API configuration

Commands

Analyze

Static optimization

Eval-based optimization

Create a skill from scratch

Convert any HuggingFace dataset to eval format

Dry run

Common flags

Benchmark sources

Output structure

Library usage

Project structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages