Skip to content

rbcorrales/wpvdb-eval

Repository files navigation

wpvdb-eval

Reproducible quality and latency evaluations for the wpvdb WordPress vector-search plugin. The harness runs externally against live WordPress sites: it POSTs to each site's /wp-json/wpvdb/v1/query and shells out to wp-cli for site metadata. Two tracks (quality A/B and dense-query latency) share one config file, one set of site helpers, and one runtime.

Prerequisites

  • Python 3.10+
  • The wpvdb plugin installed and active on every WordPress site you intend to evaluate.
  • For the performance track only: the application-password user must have the manage_options capability, so the _debug block with per-phase timings comes back in the REST response.
  • A way to run wp-cli against each target site (see wp-cli executor in Setup).

Layout

  • common/ — shared infrastructure (config, sites, auth, REST client, cache ops, CSV helpers).
  • quality/ — relevance A/B. 40 hand-curated queries, NDCG/Recall/MRR, win rate. See quality/README.md.
  • performance/ — dense-query latency. 500 stratified queries, percentile metrics, CDF. See performance/README.md.

Setup

1. Install Python deps

pip install -r requirements.txt

2. Configure

cp config.example.yaml config.yaml
$EDITOR config.yaml

Fill in your WordPress sites, optionally the LLM endpoint, and tune the performance.build_queries curation lists to your corpus domain. config.yaml is gitignored.

3. wp-cli executor

A shell script that takes a wp-cli command string as $1 and runs it against the site identified by the SITE_HOSTNAME env var. Contents are operator-specific (SSH host, transport, auth), so it lives outside version control.

cp bin/site-exec.example.sh bin/site-exec.sh
$EDITOR bin/site-exec.sh        # implement the transport
chmod +x bin/site-exec.sh

If you already have an equivalent script elsewhere, make bin/site-exec.sh a thin wrapper: exec /path/to/your/wrapper "$@". Resolution order at runtime: WPVDB_EVAL_SITE_EXEC env var, then the executor: key in config.yaml, then bin/site-exec.sh.

4. Query batteries

cp quality/queries.example.csv quality/queries.csv
cp performance/queries.example.csv performance/queries.csv

Or generate a full 500-query perf battery from your corpus title vocab plus the curation in config.yaml:

python3 performance/build_queries.py

quality/labels.csv is not copied from labels.example.csv — labels are tied to specific WordPress post IDs on your sites, so the example is a schema reference only. Generate your own labels via the capture → label → finalize pipeline (see quality/README.md).

5. Optional: charts

pip install matplotlib to render CDF / distance-distribution PNGs in score.py. Without it, scoring still works and the chart files are skipped. (Already included if you install requirements-dev.txt.)

Running the evals

Quality track

python3 quality/preflight.py
python3 quality/capture.py
python3 quality/regen_worksheet.py
python3 quality/auto_label.py
python3 quality/finalize_labels.py
python3 quality/score.py

auto_label.py reads the llm: block in config.yaml (or the matching WPVDB_EVAL_LLM_* env vars).

Performance track

python3 performance/preflight.py
python3 performance/build_queries.py        # optional: regenerate queries.csv
python3 performance/capture.py
python3 performance/score.py

Per-execution artifacts land under quality/runs/<utc-timestamp>/ and performance/runs/<utc-timestamp>/. Each run dir is self-contained: meta, raw CSVs, scored REPORT.md.

Development

Install lint + type-check tools (Ruff and basedpyright):

pip install -r requirements-dev.txt

Run before committing:

ruff check .
ruff format --check .
basedpyright

Tool configs live in pyproject.toml under [tool.ruff] and [tool.basedpyright]. There is no CI yet — run these locally before pushing.

License

MIT. See LICENSE.

About

Quality and latency evaluation harness for wpvdb

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors