Eval Banana

Aspect-based evaluation framework - deterministic checks + harness judges. Score anything (agentic outputs, workflows, banana!) with simple YAML check definitions.

_{The name was inspired by this song (my kids love it)}

What it does

Eval Banana discovers YAML check definitions from eval_checks/ directories, runs them, and produces a report. Every check scores 0 or 1 with equal weight.

Two check types:

Type	Purpose	How it works
`deterministic`	Objective assertions (file existence, content, structure)	Runs a Python script via subprocess; exit 0 = pass
`harness_judge`	LLM-as-a-judge (coherence, accuracy, tone)	Invokes the configured AI agent to score target files; expects `{"score": 0\|1}`

The harness judge uses one of the following: codex, gemini, claude, openhands, opencode, pi

Writing checks

Create a directory called eval_checks/ anywhere in your project. Add YAML files -- one per check.

Deterministic check

schema_version: 1
id: output_file_exists
type: deterministic
description: Verify that output.json was generated.
target_paths:
  - output.json
script: |
  import json, sys
  from pathlib import Path
  ctx = json.loads(Path(sys.argv[1]).read_text())
  target = ctx["targets"][0]
  assert target["exists"], f"{target['path']} not found"

Harness judge check

schema_version: 1
id: summary_is_accurate
type: harness_judge
description: The generated summary accurately reflects source data.
target_paths:
  - summary.txt
  - source_data.json
instructions: |
  Compare the summary against the source data.
  Score 1 if accurate, 0 if it contains fabricated claims.

Requires a configured harness agent. Set [harness] agent in config or pass --harness-agent.

Inspiration

Eval Banana's binary 0/1 scoring philosophy draws directly on two earlier bodies of work:

Hamel Husain's Creating LLM-as-a-Judge that drives business results — argues that binary pass/fail judgments produce more reliable, actionable evals than Likert-style 1-5 scales.
RAGAS's Aspect Critic metric — evaluates outputs against a natural-language aspect definition and returns a binary verdict.

The harness_judge check type is essentially an Aspect Critic: you describe what "good" looks like in plain language, and the judge returns {"score": 0|1}.

Skills

eval-banana ships agent skills in the skills/ directory of the repository. Install them into your project with the npx skills CLI:

npx skills add https://github.com/writeitai/eval-banana

The CLI auto-detects installed agents and copies skills into their native directories (.claude/skills/, .codex/skills/, .agents/skills/, .gemini/skills/, etc.).

Quick start

# Install
uv sync

# Initialize project config
eb init

# Run all discovered checks
eb run

# List discovered checks without running
eb list

# Validate YAML definitions without running
eb validate

Installation

# Using uv (recommended)
uv add eval-banana

# Using pip
pip install eval-banana

# From source (development)
git clone https://github.com/writeitai/eval-banana.git
cd eval-banana
uv sync --extra dev

After installation the CLI is available as eb.

Harness configuration

harness_judge checks require a configured harness agent. Configure it via TOML or CLI flags.

TOML

# .eval-banana/config.toml
[harness]
agent = "codex"
model = "gpt-5.4"
# reasoning_effort = "high"

Running in CI / cloud

The harness subprocess inherits the parent shell environment, so provide API keys the same way you would when running the agent locally:

Agent	Environment variable
`claude`	`ANTHROPIC_API_KEY`
`codex`	`OPENAI_API_KEY`
`gemini`	`GEMINI_API_KEY` or `GOOGLE_API_KEY` (or Application Default Credentials)
`openhands`	depends on the configured LLM backend

Example GitHub Actions step:

- name: Run evals
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
  run: eb run

You can also inject extra env vars via [harness.env] in your config:

[harness.env]
MY_CUSTOM_VAR = "value"

Custom agent templates

Add [agents.<name>] sections to override built-in templates or define new ones:

[agents.myagent]
command = ["my-cli", "run"]
shared_flags = ["--headless"]
prompt_flag = "--prompt"
model_flag = "--model"

Configuration

Eval Banana uses a single project-level TOML config at .eval-banana/config.toml.

Create it with eb init.

Config precedence (highest to lowest)

CLI arguments (--output-dir, --harness-model, etc.)
Environment variables (EVAL_BANANA_*)
Project config (.eval-banana/config.toml)
Built-in defaults

Key settings

Setting	Default	Env var
`output_dir`	`.eval-banana/results`	`EVAL_BANANA_OUTPUT_DIR`
`pass_threshold`	`1.0`	`EVAL_BANANA_PASS_THRESHOLD`
`llm_max_input_chars`	`0`	`EVAL_BANANA_LLM_MAX_INPUT_CHARS`
`harness.agent`	unset	`EVAL_BANANA_HARNESS_AGENT`
`harness.model`	unset	`EVAL_BANANA_HARNESS_MODEL`

CLI reference

eb init [--force]                Create project config
eb run [OPTIONS]                  Run all discovered checks
eb list [OPTIONS]                 List discovered checks
eb validate [OPTIONS]             Validate YAML without running

Options for run/list/validate:
  --check-dir PATH              Scan only this directory
  --check-id TEXT               Run only this check ID
  --output-dir TEXT             Override output directory
  --pass-threshold FLOAT        Minimum pass ratio (0.0-1.0)
  --verbose                     Enable debug logging
  --cwd TEXT                    Working directory

Harness options (run only):
  --harness-agent TEXT          Agent CLI used by harness_judge checks
  --harness-model TEXT          Model override for the agent
  --harness-reasoning-effort TEXT  Reasoning effort level

Output

Each run creates a timestamped directory under the configured output_dir:

.eval-banana/results/<run_id>/
  report.json       # Machine-readable full report
  report.md         # Human-readable Markdown report
  checks/
    <check_id>.json       # Per-check result
    <check_id>.stdout.txt # Captured stdout (if any)
    <check_id>.stderr.txt # Captured stderr (if any)

Development

uv sync --extra dev
make test         # Run tests
make fix          # Auto-fix lint + format
make pyright      # Type check
make all-check    # Lint + format + types + tests (matches CI)

Contributing

Issues and pull requests are welcome. Please run make all-check before opening a PR.

Changelog

See CHANGELOG.md for release notes.

License

Apache License 2.0 — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
.claude/skills/skill-creator		.claude/skills/skill-creator
.codex/skills/skill-creator		.codex/skills/skill-creator
.github/workflows		.github/workflows
docs		docs
orchestra		orchestra
skills		skills
src/eval_banana		src/eval_banana
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Eval Banana

What it does

Writing checks

Deterministic check

Harness judge check

Inspiration

Skills

Quick start

Installation

Harness configuration

TOML

Running in CI / cloud

Custom agent templates

Configuration

Config precedence (highest to lowest)

Key settings

CLI reference

Output

Development

Contributing

Changelog

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Eval Banana

What it does

Writing checks

Deterministic check

Harness judge check

Inspiration

Skills

Quick start

Installation

Harness configuration

TOML

Running in CI / cloud

Custom agent templates

Configuration

Config precedence (highest to lowest)

Key settings

CLI reference

Output

Development

Contributing

Changelog

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages