Reproducible quality and latency evaluations for the wpvdb WordPress vector-search plugin. The harness runs externally against live WordPress sites: it POSTs to each site's /wp-json/wpvdb/v1/query and shells out to wp-cli for site metadata. Two tracks (quality A/B and dense-query latency) share one config file, one set of site helpers, and one runtime.
- Python 3.10+
- The
wpvdbplugin installed and active on every WordPress site you intend to evaluate. - For the performance track only: the application-password user must have the
manage_optionscapability, so the_debugblock with per-phase timings comes back in the REST response. - A way to run wp-cli against each target site (see wp-cli executor in Setup).
common/— shared infrastructure (config, sites, auth, REST client, cache ops, CSV helpers).quality/— relevance A/B. 40 hand-curated queries, NDCG/Recall/MRR, win rate. Seequality/README.md.performance/— dense-query latency. 500 stratified queries, percentile metrics, CDF. Seeperformance/README.md.
pip install -r requirements.txt
cp config.example.yaml config.yaml
$EDITOR config.yaml
Fill in your WordPress sites, optionally the LLM endpoint, and tune the performance.build_queries curation lists to your corpus domain. config.yaml is gitignored.
A shell script that takes a wp-cli command string as $1 and runs it against the site identified by the SITE_HOSTNAME env var. Contents are operator-specific (SSH host, transport, auth), so it lives outside version control.
cp bin/site-exec.example.sh bin/site-exec.sh
$EDITOR bin/site-exec.sh # implement the transport
chmod +x bin/site-exec.sh
If you already have an equivalent script elsewhere, make bin/site-exec.sh a thin wrapper: exec /path/to/your/wrapper "$@". Resolution order at runtime: WPVDB_EVAL_SITE_EXEC env var, then the executor: key in config.yaml, then bin/site-exec.sh.
cp quality/queries.example.csv quality/queries.csv
cp performance/queries.example.csv performance/queries.csv
Or generate a full 500-query perf battery from your corpus title vocab plus the curation in config.yaml:
python3 performance/build_queries.py
quality/labels.csv is not copied from labels.example.csv — labels are tied to specific WordPress post IDs on your sites, so the example is a schema reference only. Generate your own labels via the capture → label → finalize pipeline (see quality/README.md).
pip install matplotlib to render CDF / distance-distribution PNGs in score.py. Without it, scoring still works and the chart files are skipped. (Already included if you install requirements-dev.txt.)
python3 quality/preflight.py
python3 quality/capture.py
python3 quality/regen_worksheet.py
python3 quality/auto_label.py
python3 quality/finalize_labels.py
python3 quality/score.py
auto_label.py reads the llm: block in config.yaml (or the matching WPVDB_EVAL_LLM_* env vars).
python3 performance/preflight.py
python3 performance/build_queries.py # optional: regenerate queries.csv
python3 performance/capture.py
python3 performance/score.py
Per-execution artifacts land under quality/runs/<utc-timestamp>/ and performance/runs/<utc-timestamp>/. Each run dir is self-contained: meta, raw CSVs, scored REPORT.md.
Install lint + type-check tools (Ruff and basedpyright):
pip install -r requirements-dev.txt
Run before committing:
ruff check .
ruff format --check .
basedpyright
Tool configs live in pyproject.toml under [tool.ruff] and [tool.basedpyright]. There is no CI yet — run these locally before pushing.
MIT. See LICENSE.