diff --git a/etc/bench/README.md b/etc/bench/README.md new file mode 100644 index 0000000..c28e595 --- /dev/null +++ b/etc/bench/README.md @@ -0,0 +1,258 @@ +# PurlValidator data structure evaluation + +This document details the research and evaluation of various efficient data +structures for compact PURLs storage and lookup. + +It contains: + +- reference to evaluation/bench scripts +- documentation on the various libraries and data structures under consideration +- the final choice (spoiler an FST, aka. finite state transducer) + + +## Context and Problem + +PurlValidator needs a local queryable dataset of known PURLs to answer one question: + +> Does this PURL exist in the reference dataset? + +The lookup index should be built for each release, and shipped with the library +for access without a network connection. And we want a Go, Rust and Python +implementation. The PURls themselves are collected using PurlDB and FederatedCode. + + +## Solution + +### High level design + +The lookup key is a PURL, cleaned to only keep type, namespace, and name, +(without version, qualifiers and subpath) + +This keeps validation focused for now. Version validation could come later by +extending indexed PURLs with version or baking in support VERS version parsing +for validation + +### Solution elements: Data structures considered + +- Built-in set and map +- FST +- DAWG +- Bloom filter +- SQLite + +Considered but not evaluated: + +- Minimal perfect hash: no compression +- Trie or radix tree: DAWG and FST are similar, but are more compact. Suffix + trees are way too big. + +#### Built-in set and map + +Built-in sets and maps are the simplest baseline in each language, they are as +fast as can be, but they have no compression and no built-in serialization or +memory mapping, and memory use grows quickly for large datasets. + +An interesting path could be to use built-in sets in Rust and Go generating the +code with all the PURL strings so that there is no specific deserialization. The +porblem there is the size as the data is not compressed. + +Built-ins structures are useful for benchmarks as reference but are not suitable +as the main packaged data structure because they are too big. + + +#### FST: finite state transducer + + + +An FST stores a sorted set of strings in a compact automaton. PURLs share common +prefixes such as `pkg:npm/`, `pkg:pypi/`, and `pkg:maven/`. This sharing helps +reduce stored data. + +FST lookup is exact for this use case. The Rust and Go implementations already +ship an FST file. The library opens or embeds that file and performs membership +checks without rebuilding the index. + +The main cost is build complexity. Input must be prepared, sorted, and encoded +when the package data is refreshed. + + +#### DAWG: directed acyclic word graph + +See + +this is aka. DAFSA + + +A DAWG is a compact data structure for a set of strings. It can merge repeated +prefixes and suffixes like an FST. The DAWG is interesting in that it can +support prefix lookup, but in general the DAWG is bigger and slower than an FST, +and has fewer mature/maintained library support. + + +#### Bloom filter + + + +A Bloom filter can store a large set in a small space, but it is a probalistic +structure and can answer that a value is surely absent or maybe present. In that +later case, you need an extra full dataset to validate further the "maybe": this +is the problem of false positives with these filters, hence a Bloom filter +cannot not be used as the only lookup structure, and does not make sense here. +Instead, a Bloom filter could be used before an exact structure to skip some +exact lookups as performance optimization, but outside of the validator. + + +#### SQLite + + + +SQLite can store PURLs in a SQL table with an index for exact lookup. + +The tradeoff is operational weight. Each SQLite language binding adds a +dependency (though this is built in Python). The validator only needs immutable +membership checks, not SQL full power with queries, and update transactions; but +on the other hand the SQLite DB could be the same across all languages. + +SQLite could useful as a benchmark and debugging format. It is not the first +choice for a small language library because this is not compressed. But it will +be a future enhancement for sure. + + +### Preferred solution: FST + +Based on the benchmark and otrher criteria, let's use an FST-backed lookup for +every languages. Do not use a Bloom filter (probalistic). Do not use native +structures that use too much memory. + +And for the library selection, we have these high level requirements: + +- We want exact result without false positives, e.g., no bloom filter. +- Offline use, with no network is a must: the dataset must be bundled in the + releases. +- With build time index construction, the construction time is not critical. +- The bundled index should be small enough to ship below crates, and Pypi + archive size limits. +- No rebuild at startup/runtime, and fast enough load time from disk, ideally + memory-mapped. +- Fast enough lookup. +- Libraries should be maintained, active FOSS for Rust/Go/Python. + +The final selected FST libraries are: + +- Rust: fst crate with a memory-mapped set +- Python: ducer with a memory-mapped map, dict-like + (ducer uses the Rust fst crate inside) +- Go: vellum "fst" module (originally from + now at + ) which is mostly inspired from the + Rust fst crate + + +## Appendix: Benchmarks + +This directory contains evaluation and benchmark files for PurlValidator. + +It compares structures for offline PURL membership checks with these +implementations use: + +- Python: memory-mapped `ducer`. +- Rust: crate `fst`. +- Go: embedded Vellum FST. + +... as well as the builtin Python set and dict, SQLite and a Rust DAWG + +### Expected checkout layout + +Run the scripts from a directory with these repositories checkouts: + +- `/purl-validator` +- `/purl-validator.rs` +- `/purlvalidator-go` + +### benchmarking FST vs. DAWG + +There is a good benchmarch in Go comparing FST and DAWG data structures (and +other structures) that highlights why an FST is a better structure for our cases +than a DAWG: + + + +We also did a simple synthetic benchmark of the Rust fst and dawg crates using +actual base PURLs using the data in + + +The `etc/bench/rust-fst-dawg-bench` code compare these fst and dawg crates. + +The dataset profile has 2,324,119 unique sorted base PURL. The benchmark is to +run 1M queries, where 500K are expected to fail. + +- The fst crate index was built in 11s, with a 26MB serialized file, and took + 0.703s for 1M lookups. +- The dawg crate index was built in 18s, with a 831MB serialized file, and took + 28s for 1M lookups. + +The outcome is that the preferred structure is an FST over a DAWG (at least +with these implementations). + +### benchmarking FST against builtin and SQLite + +Since we picked the FST as the winner, additional review has been focused on +Python by comparing the ducer fst library against other approaches. Since it is +based on the Rust fst and Go's vellum is also based on the fst design, we cover +essentially the three languages at once. + +The `etc/scripts/bench/alternative_benchmark.py` script compares Python lookup +using a text file with one PURL per line for these candidates: + +- Python `set`. +- Python `dict`. +- Python Sorted list plus `bisect`. +- In-memory SQLite. +- FST using a `ducer.Map`. + +Data is from `purl-validator.rs/fst_builder/data/` + +Results with 2,324,119 unique PURLs and 1M lookup queries, 500K existing PURLs: + +```text +structure build (secs) lookup (secs) storage size +-------------------- ------------ -------------- --------------------------- +python set 0.206540 0.275906 304MB in RAM +python dict 0.449625 0.429034 298MB in RAM +ducer FST 3.700943 1.805585 26MB on disk +sorted list+bisect 0.017540 2.783555 236MB in RAM +sqlite in memory 4.855480 4.220032 207MB on disk (or 65MB with zstd) +``` + +### benchmarking FST in Python vs. Go vs. Rust + +This benchmark runs each of the three validator released implementations. The +script is in `etc/scripts/bench/go-rust-py_benchmark.py` + +Data is from `purl-validator.rs/fst_builder/data/` + +Results with 2,324,119 unique PURLs and 1M lookup queries, 500K existing PURLs: + +```text +structure build (secs) lookup (secs) storage size (ondisk) +-------------------- ------------ -------------- --------------------------- +Python purl-validator 16.664847 4.926029 25MB +Rust purl-validator.rs 11.849877 0.348128 25MB +Go purlvalidator-go 2.325181 0.704749 25MB +``` + +### Evaluation + +The results are consistent with expectations: Rust is faster than Go and Python. + +And the Python on disk fst is the same size as the Rust fst (since this is the +same backing code). + +Some surprises: + +- The build of the Go index is the fastest which is surprising and could be an + avenue of improvement for the Rust fst crate. + +- Leaving aside the 10x larger RAM need, the Python set and dict are competitive + speed wise (faster than the on-disk Rust FST) ans super fast to build too. + diff --git a/etc/bench/alternative_benchmark.py b/etc/bench/alternative_benchmark.py new file mode 100644 index 0000000..90b0fcd --- /dev/null +++ b/etc/bench/alternative_benchmark.py @@ -0,0 +1,231 @@ +#!/usr/bin/env python3 +""" +Benchmark of set data structures for PURLs. +""" + +from __future__ import annotations + +import argparse +import mmap +import random +import sqlite3 +import sys +import tempfile +import time + +from bisect import bisect_left +from dataclasses import dataclass +from pathlib import Path + +import ducer + + +@dataclass +class Result: + name: str + build_seconds: float + lookup_seconds: float + hits: int + storage: str + + +def iter_input_files(path: Path) -> list[Path]: + if path.is_dir(): + return sorted(path.glob("*.txt")) + return [path] + + +def load_purls(path: Path, limit: int | None) -> tuple[list[str], int, int]: + purls: list[str] = [] + raw_count = 0 + for input_file in iter_input_files(path): + with input_file.open("r", encoding="utf-8", errors="replace") as lines: + for line in lines: + purl = line.strip() + if not purl: + continue + raw_count += 1 + purls.append(purl) + if limit and len(purls) >= limit: + unique = sorted(set(purls)) + return unique, raw_count, len(iter_input_files(path)) + unique = sorted(set(purls)) + return unique, raw_count, len(iter_input_files(path)) + + +def time_calls(name: str, lookup, queries: list[str]) -> tuple[str, int, float]: + start = time.perf_counter() + hits = sum(1 for query in queries if lookup(query)) + elapsed = time.perf_counter() - start + return name, hits, elapsed + + +def benchmark_set(purls: list[str], queries: list[str]) -> Result: + start = time.perf_counter() + values = set(purls) + build_seconds = time.perf_counter() - start + name, hits, elapsed = time_calls("python_set", values.__contains__, queries) + return Result(name, build_seconds, elapsed, hits, "no disk artifact") + + +def benchmark_dict(purls: list[str], queries: list[str]) -> Result: + start = time.perf_counter() + values = dict.fromkeys(purls, 1) + build_seconds = time.perf_counter() - start + name, hits, elapsed = time_calls("python_dict", values.__contains__, queries) + return Result(name, build_seconds, elapsed, hits, "no disk artifact") + + +def benchmark_sorted_list(purls: list[str], queries: list[str]) -> Result: + start = time.perf_counter() + values = list(purls) + build_seconds = time.perf_counter() - start + + def contains(value: str) -> bool: + index = bisect_left(values, value) + return index != len(values) and values[index] == value + + name, hits, elapsed = time_calls("sorted_list_bisect", contains, queries) + return Result(name, build_seconds, elapsed, hits, "no disk artifact") + + +def benchmark_sqlite(purls: list[str], queries: list[str]) -> Result: + start = time.perf_counter() + connection = sqlite3.connect(":memory:") + connection.execute("CREATE TABLE purls (purl TEXT PRIMARY KEY)") + connection.executemany("INSERT INTO purls (purl) VALUES (?)", ((purl,) for purl in purls)) + connection.commit() + build_seconds = time.perf_counter() - start + + def contains(value: str) -> bool: + row = connection.execute("SELECT 1 FROM purls WHERE purl = ?", (value,)).fetchone() + return row is not None + + name, hits, elapsed = time_calls("sqlite_memory", contains, queries) + connection.close() + return Result(name, build_seconds, elapsed, hits, "no disk artifact") + + +def benchmark_ducer(purls: list[str], queries: list[str]) -> Result | None: + + with tempfile.TemporaryDirectory() as temp_dir: + map_path = Path(temp_dir) / "purls.map" + entries = [(purl.encode("utf-8"), 1) for purl in purls] + start = time.perf_counter() + ducer.Map.build(map_path, entries) + build_seconds = time.perf_counter() - start + with map_path.open("rb") as map_file: + mapped = mmap.mmap(map_file.fileno(), 0, access=mmap.ACCESS_READ) + purl_map = ducer.Map(mapped) + + def contains(value: str) -> bool: + return bool(purl_map.get(value.encode("utf-8"))) + + name, hits, elapsed = time_calls("ducer_map", contains, queries) + return Result(name, build_seconds, elapsed, hits, f"{map_path.stat().st_size} bytes") + + +def make_queries(purls: list[str], count: int) -> list[str]: + if not purls: + return [] + hit_count = count // 2 + miss_count = count - hit_count + hits = [random.choice(purls) for _ in range(hit_count)] + misses = [f"{random.choice(purls)}-missing-{index}" for index in range(miss_count)] + queries = hits + misses + random.shuffle(queries) + return queries + + +def format_report( + input_path: Path, + file_count: int, + raw_count: int, + purls: list[str], + queries: list[str], + results: list[Result], + load_seconds: float, + seed: int, +) -> str: + lines = [ + "PurlValidator lookup structure resulys", + "========================================", + "", + f"Input path: {input_path}", + f"Input files: {file_count}", + f"Input load seconds: {load_seconds:.6f}", + f"Lookup queries: {len(queries)}", + f"Expected hits: {len(queries) // 2}", + f"Random seed: {seed}", + "", + "Results", + "-------", + "", + f"{'structure':<20} {'build_s':>12} {'lookup_s':>12} {'hits':>10} {'storage':>18}", + f"{'-' * 20} {'-' * 12} {'-' * 12} {'-' * 10} {'-' * 18}", + ] + for result in results: + lines.append( + f"{result.name:<20} " + f"{result.build_seconds:>12.6f} " + f"{result.lookup_seconds:>12.6f} " + f"{result.hits:>10} " + f"{result.storage:>18}" + ) + + ) + return "\n".join(lines) + "\n" + + +def main() -> int: + parser = argparse.ArgumentParser() + parser.add_argument( + "--input", + required=True, + type=Path, + help="Directory of text files with one PURL per line.", + ) + parser.add_argument("--limit", type=int, default=100000, help="Maximum PURLs to load.") + parser.add_argument("--queries", type=int, default=20000, help="Number of lookup queries.") + parser.add_argument("--seed", type=int, default=1, help="Random seed for reproducible queries.") + parser.add_argument("--report", type=Path, help="Write report to this file.") + args = parser.parse_args() + + random.seed(args.seed) + start = time.perf_counter() + purls, raw_count, file_count = load_purls(args.input, args.limit) + load_seconds = time.perf_counter() - start + if not purls: + print(f"No PURLs in {args.input}", file=sys.stderr) + return 1 + + queries = make_queries(purls, args.queries) + results = [ + benchmark_set(purls, queries), + benchmark_dict(purls, queries), + benchmark_sorted_list(purls, queries), + benchmark_sqlite(purls, queries), + ] + ducer_result = benchmark_ducer(purls, queries) + if ducer_result: + results.append(ducer_result) + + report = format_report( + input_path=args.input, + file_count=file_count, + raw_count=raw_count, + purls=purls, + queries=queries, + results=results, + load_seconds=load_seconds, + seed=args.seed, + ) + print(report, end="") + if args.report: + args.report.parent.mkdir(parents=True, exist_ok=True) + args.report.write_text(report, encoding="utf-8") + return 0 + + +if __name__ == "__main__": + raise main() diff --git a/etc/bench/go-rust-py_benchmark.py b/etc/bench/go-rust-py_benchmark.py new file mode 100644 index 0000000..0400685 --- /dev/null +++ b/etc/bench/go-rust-py_benchmark.py @@ -0,0 +1,473 @@ +#!/usr/bin/env python3 +""" +Benchmark the Python, Rust, and Go PurlValidator implementations. + +The benchmark uses the PURL source data from purl-validator.rs/fst_builder/data. +And checks: + +- time to build each index +- index size on disk +- time to run 1,000,000 lookups, with half known and half unknown PURLs + +""" + +from __future__ import annotations + +import argparse +import importlib.util +import json +import os +from pathlib import Path +import random +import shutil +import subprocess +import sys +import tempfile +import textwrap +import time + + +WORKSPACE = Path("workspace") +PYTHON_REPO = WORKSPACE / "purl-validator" +RUST_REPO = WORKSPACE / "purl-validator.rs" +GO_REPO = WORKSPACE / "purlvalidator-go" +DEFAULT_DATA_DIR = RUST_REPO / "fst_builder/data" +DEFAULT_REPORT = WORKSPACE / "benchmark-report.md" +DEFAULT_WORK_DIR = WORKSPACE / "benchmark-tmp" + + + +class BenchmarkError(Exception): + pass + + +def run_command(command: list[str], cwd: Path, env: dict[str, str] | None = None) -> None: + completed = subprocess.run( + command, + cwd=cwd, + env=env, + text=True, + stdout=subprocess.PIPE, + stderr=subprocess.STDOUT, + ) + if completed.returncode: + output = completed.stdout[-4000:] + raise BenchmarkError( + f"Command failed with exit code {completed.returncode}: {' '.join(command)}\n{output}" + ) + + +def timed(function): + start = time.perf_counter() + result = function() + return time.perf_counter() - start, result + + +def read_purls(data_dir: Path) -> tuple[list[str], int, int]: + purls: list[str] = [] + files = sorted(data_dir.glob("*.txt")) + raw_count = 0 + for path in files: + with path.open("r", encoding="utf-8", errors="replace") as lines: + for line in lines: + purl = line.strip() + if not purl: + continue + raw_count += 1 + purls.append(purl) + return sorted(set(purls)), raw_count, len(files) + + +def write_queries(purls: list[str], query_count: int, seed: int, path: Path) -> None: + rng = random.Random(seed) + hit_count = query_count // 2 + miss_count = query_count - hit_count + known_lookup_purls = [purl for purl in purls if purl.startswith("pkg:pypi/")] + if not known_lookup_purls: + known_lookup_purls = purls + queries = [rng.choice(known_lookup_purls) for _ in range(hit_count)] + queries.extend(f"pkg:npm/purl-validator-benchmark-unknown-{index:07}" for index in range(miss_count)) + rng.shuffle(queries) + path.write_text("\n".join(queries) + "\n", encoding="utf-8") + + +def copy_data_files(data_dir: Path, target_dir: Path) -> None: + if data_dir.resolve() == target_dir.resolve(): + return + if target_dir.exists(): + shutil.rmtree(target_dir) + target_dir.mkdir(parents=True) + for source in sorted(data_dir.glob("*.txt")): + shutil.copy2(source, target_dir / source.name) + + +def write_rust_lookup_harness(work_dir: Path, fst_path: Path, query_path: Path) -> Path: + project = work_dir / "rust_lookup" + if project.exists(): + shutil.rmtree(project) + (project / "src").mkdir(parents=True) + (project / "Cargo.toml").write_text( + textwrap.dedent( + f""" + [package] + name = "purl-validator-rust-lookup-bench" + version = "0.1.0" + edition = "2024" + + [dependencies] + fst = "0.4.7" + packageurl = "0.6.0" + """ + ).strip() + + "\n", + encoding="utf-8", + ) + (project / "src/main.rs").write_text( + textwrap.dedent( + f""" + use fst::Set; + use packageurl::PackageUrl; + use std::fs; + use std::str::FromStr; + use std::time::Instant; + + fn main() -> Result<(), Box> {{ + let fst_data = fs::read({json.dumps(str(fst_path))})?; + let set = Set::new(fst_data.as_slice())?; + let queries = fs::read_to_string({json.dumps(str(query_path))})?; + let start = Instant::now(); + let mut hits = 0usize; + for query in queries.lines() {{ + let purl = PackageUrl::from_str(query)?; + if purl.version().is_some() + || !purl.qualifiers().is_empty() + || purl.subpath().is_some() + {{ + return Err("unsupported PURL".into()); + }} + let key = query.trim_end_matches('/'); + if set.contains(key) {{ + hits += 1; + }} + }} + println!("hits={{}}", hits); + println!("lookup_seconds={{:.6}}", start.elapsed().as_secs_f64()); + Ok(()) + }} + """ + ).strip() + + "\n", + encoding="utf-8", + ) + return project + + +def write_go_lookup_harness(work_dir: Path, fst_path: Path, query_path: Path) -> Path: + project = work_dir / "go_lookup" + if project.exists(): + shutil.rmtree(project) + project.mkdir(parents=True) + (project / "go.mod").write_text( + textwrap.dedent( + """ + module purl-validator-go-lookup-bench + + go 1.22.3 + + require ( + github.com/blevesearch/vellum v1.1.0 + github.com/package-url/packageurl-go v0.1.5 + ) + """ + ).strip() + + "\n", + encoding="utf-8", + ) + (project / "main.go").write_text( + textwrap.dedent( + f""" + package main + + import ( + "bufio" + "fmt" + "os" + "time" + + "github.com/blevesearch/vellum" + packageurl "github.com/package-url/packageurl-go" + ) + + func main() {{ + data, err := os.ReadFile({json.dumps(str(fst_path))}) + if err != nil {{ + panic(err) + }} + fstMap, err := vellum.Load(data) + if err != nil {{ + panic(err) + }} + file, err := os.Open({json.dumps(str(query_path))}) + if err != nil {{ + panic(err) + }} + defer file.Close() + + start := time.Now() + hits := 0 + scanner := bufio.NewScanner(file) + scanner.Buffer(make([]byte, 1024), 1024*1024) + for scanner.Scan() {{ + query := scanner.Text() + instance, err := packageurl.FromString(query) + if err != nil {{ + panic(err) + }} + if instance.Version != "" || len(instance.Qualifiers) > 0 || instance.Subpath != "" {{ + panic("unsupported PURL") + }} + ok, err := fstMap.Contains([]byte(query)) + if err != nil {{ + panic(err) + }} + if ok {{ + hits++ + }} + }} + if err := scanner.Err(); err != nil {{ + panic(err) + }} + fmt.Printf("hits=%d\\n", hits) + fmt.Printf("lookup_seconds=%.6f\\n", time.Since(start).Seconds()) + }} + """ + ).strip() + + "\n", + encoding="utf-8", + ) + return project + + +def parse_lookup_output(output_path: Path) -> tuple[int, float]: + text = output_path.read_text(encoding="utf-8") + hits = None + seconds = None + for line in text.splitlines(): + if line.startswith("hits="): + hits = int(line.split("=", 1)[1]) + if line.startswith("lookup_seconds="): + seconds = float(line.split("=", 1)[1]) + if hits is None or seconds is None: + raise BenchmarkError(f"Cannot parse lookup output from {output_path}:\n{text}") + return hits, seconds + + +def benchmark_python(purls: list[str], query_path: Path, work_dir: Path) -> dict[str, object]: + sys.path.insert(0, str(PYTHON_REPO / "src")) + spec = importlib.util.spec_from_file_location( + "purl_validator", PYTHON_REPO / "src/purl_validator/__init__.py" + ) + if not spec or not spec.loader: + raise BenchmarkError("Cannot load purl_validator module") + module = importlib.util.module_from_spec(spec) + spec.loader.exec_module(module) + + index = work_dir / "python-purls.map" + + def build(): + generated = Path(module.create_purl_map(purls)) + shutil.copy2(generated, index) + + build_seconds, _result = timed(build) + + validator = module.PurlValidator(index) + queries = query_path.read_text(encoding="utf-8").splitlines() + + def lookup(): + hits = 0 + start = time.perf_counter() + for query in queries: + if validator.validate_purl(query): + hits += 1 + return hits, time.perf_counter() - start + + hits, lookup_seconds = lookup() + return { + "name": "Python purl-validator", + "build_seconds": build_seconds, + "lookup_seconds": lookup_seconds, + "index_size": index.stat().st_size, + "hits": hits, + "index": index, + } + + +def benchmark_rust(data_dir: Path, query_path: Path, work_dir: Path) -> dict[str, object]: + copy_data_files(data_dir, RUST_REPO / "fst_builder/data") + index = RUST_REPO / "purls.fst" + if index.exists(): + index.unlink() + + run_command(["cargo", "build", "--release", "--bin", "fst_builder"], cwd=RUST_REPO) + builder = RUST_REPO / "target/release/fst_builder" + build_seconds, _result = timed(lambda: run_command([str(builder)], cwd=RUST_REPO)) + + copied_index = work_dir / "rust-purls.fst" + shutil.copy2(index, copied_index) + + harness = write_rust_lookup_harness(work_dir, copied_index, query_path) + run_command(["cargo", "build", "--release"], cwd=harness) + output_path = work_dir / "rust-lookup.out" + + def run_lookup(): + completed = subprocess.run( + [str(harness / "target/release/purl-validator-rust-lookup-bench")], + cwd=harness, + text=True, + stdout=subprocess.PIPE, + stderr=subprocess.STDOUT, + check=True, + ) + output_path.write_text(completed.stdout, encoding="utf-8") + + run_lookup() + hits, lookup_seconds = parse_lookup_output(output_path) + return { + "name": "Rust purl-validator.rs", + "build_seconds": build_seconds, + "lookup_seconds": lookup_seconds, + "index_size": copied_index.stat().st_size, + "hits": hits, + "index": copied_index, + } + + +def benchmark_go(data_dir: Path, query_path: Path, work_dir: Path) -> dict[str, object]: + copy_data_files(data_dir, GO_REPO / "cmd/data") + index = GO_REPO / "purls.fst" + if index.exists(): + index.unlink() + + env = os.environ.copy() + env["PATH"] = f"/usr/local/go/bin:{env.get('PATH', '')}" + build_seconds, _result = timed( + lambda: run_command(["go", "run", "./cmd/main.go"], cwd=GO_REPO, env=env) + ) + + copied_index = work_dir / "go-purls.fst" + shutil.copy2(index, copied_index) + + harness = write_go_lookup_harness(work_dir, copied_index, query_path) + run_command(["go", "mod", "tidy"], cwd=harness, env=env) + run_command(["go", "build", "-o", "lookup-bench", "."], cwd=harness, env=env) + output_path = work_dir / "go-lookup.out" + + completed = subprocess.run( + [str(harness / "lookup-bench")], + cwd=harness, + env=env, + text=True, + stdout=subprocess.PIPE, + stderr=subprocess.STDOUT, + check=True, + ) + output_path.write_text(completed.stdout, encoding="utf-8") + hits, lookup_seconds = parse_lookup_output(output_path) + return { + "name": "Go purlvalidator-go", + "build_seconds": build_seconds, + "lookup_seconds": lookup_seconds, + "index_size": copied_index.stat().st_size, + "hits": hits, + "index": copied_index, + } + + +def mib(size: int) -> str: + return f"{size / 1024 / 1024:.2f} MiB" + + +def write_report( + report_path: Path, + data_dir: Path, + raw_count: int, + unique_count: int, + file_count: int, + query_count: int, + seed: int, + results: list[dict[str, object]], +) -> None: + lines = [ + "# PurlValidator implementation benchmark", + "", + "This benchmark uses the data from `purl-validator.rs/fst_builder/data/`.", + "", + "Input summary:", + "", + f"- Data directory: `{data_dir}`", + f"- Input files: `{file_count}`", + f"- Unique PURLs: `{unique_count}`", + f"- Lookup queries: `{query_count}`", + f"- Expected known PURLs: `{query_count // 2}`", + f"- Expected unknown PURLs: `{query_count - (query_count // 2)}`", + f"- Query seed: `{seed}`", + "", + "Results:", + "", + ] + for result in results: + size = int(result["index_size"]) + lines.extend( + [ + f"## {result['name']}", + "", + f"- Build time: `{float(result['build_seconds']):.6f}` seconds", + f"- Lookup time: `{float(result['lookup_seconds']):.6f}` seconds", + f"- Lookup hits: `{int(result['hits'])}`", + f"- Lookup index size: `{size}` bytes, `{mib(size)}`", + f"- Lookup index: `{result['index']}`", + "", + ] + ) + ) + report_path.parent.mkdir(parents=True, exist_ok=True) + report_path.write_text("\n".join(lines) + "\n", encoding="utf-8") + + +def main() -> int: + parser = argparse.ArgumentParser() + parser.add_argument("--data-dir", type=Path, default=DEFAULT_DATA_DIR) + parser.add_argument("--work-dir", type=Path, default=DEFAULT_WORK_DIR) + parser.add_argument("--report", type=Path, default=DEFAULT_REPORT) + parser.add_argument("--queries", type=int, default=1000000) + parser.add_argument("--seed", type=int, default=1) + args = parser.parse_args() + + args.work_dir.mkdir(parents=True, exist_ok=True) + query_path = args.work_dir / "queries.txt" + + purls, raw_count, file_count = read_purls(args.data_dir) + write_queries(purls, args.queries, args.seed, query_path) + + results = [ + benchmark_python(purls, query_path, args.work_dir), + benchmark_rust(args.data_dir, query_path, args.work_dir), + benchmark_go(args.data_dir, query_path, args.work_dir), + ] + + write_report( + report_path=args.report, + data_dir=args.data_dir, + raw_count=raw_count, + unique_count=len(purls), + file_count=file_count, + query_count=args.queries, + seed=args.seed, + results=results, + ) + print(args.report) + return 0 + + +if __name__ == "__main__": + main() diff --git a/etc/bench/rust-fst-dawg-bench/.gitignore b/etc/bench/rust-fst-dawg-bench/.gitignore new file mode 100644 index 0000000..ea8c4bf --- /dev/null +++ b/etc/bench/rust-fst-dawg-bench/.gitignore @@ -0,0 +1 @@ +/target diff --git a/etc/bench/rust-fst-dawg-bench/Cargo.lock b/etc/bench/rust-fst-dawg-bench/Cargo.lock new file mode 100644 index 0000000..6f367d4 --- /dev/null +++ b/etc/bench/rust-fst-dawg-bench/Cargo.lock @@ -0,0 +1,108 @@ +# This file is automatically @generated by Cargo. +# It is not intended for manual editing. +version = 4 + +[[package]] +name = "bincode" +version = "1.3.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b1f45e9417d87227c7a56d22e471c6206462cba514c7590c09aff4cf6d1ddcad" +dependencies = [ + "serde", +] + +[[package]] +name = "dawg" +version = "0.0.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "81c5a800721e14d9d89e9b5b2b8920e341c19ab78687ca8ab25c393f7eddf036" +dependencies = [ + "serde", + "unicode-segmentation", +] + +[[package]] +name = "fst" +version = "0.4.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7ab85b9b05e3978cc9a9cf8fea7f01b494e1a09ed3037e16ba39edc7a29eb61a" + +[[package]] +name = "proc-macro2" +version = "1.0.106" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8fd00f0bb2e90d81d1044c2b32617f68fcb9fa3bb7640c23e9c748e53fb30934" +dependencies = [ + "unicode-ident", +] + +[[package]] +name = "quote" +version = "1.0.45" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "41f2619966050689382d2b44f664f4bc593e129785a36d6ee376ddf37259b924" +dependencies = [ + "proc-macro2", +] + +[[package]] +name = "rust-fst-dawg-bench" +version = "0.1.0" +dependencies = [ + "bincode", + "dawg", + "fst", +] + +[[package]] +name = "serde" +version = "1.0.228" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9a8e94ea7f378bd32cbbd37198a4a91436180c5bb472411e48b5ec2e2124ae9e" +dependencies = [ + "serde_core", + "serde_derive", +] + +[[package]] +name = "serde_core" +version = "1.0.228" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "41d385c7d4ca58e59fc732af25c3983b67ac852c1a25000afe1175de458b67ad" +dependencies = [ + "serde_derive", +] + +[[package]] +name = "serde_derive" +version = "1.0.228" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d540f220d3187173da220f885ab66608367b6574e925011a9353e4badda91d79" +dependencies = [ + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "syn" +version = "2.0.117" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e665b8803e7b1d2a727f4023456bbbbe74da67099c585258af0ad9c5013b9b99" +dependencies = [ + "proc-macro2", + "quote", + "unicode-ident", +] + +[[package]] +name = "unicode-ident" +version = "1.0.24" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e6e4313cd5fcd3dad5cafa179702e2b244f760991f45397d14d4ebf38247da75" + +[[package]] +name = "unicode-segmentation" +version = "1.13.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9629274872b2bfaf8d66f5f15725007f635594914870f65218920345aa11aa8c" diff --git a/etc/bench/rust-fst-dawg-bench/Cargo.toml b/etc/bench/rust-fst-dawg-bench/Cargo.toml new file mode 100644 index 0000000..859234c --- /dev/null +++ b/etc/bench/rust-fst-dawg-bench/Cargo.toml @@ -0,0 +1,9 @@ +[package] +name = "rust-fst-dawg-bench" +version = "0.1.0" +edition = "2024" + +[dependencies] +bincode = "1.3" +dawg = "0.0.7" +fst = "0.4.7" diff --git a/etc/bench/rust-fst-dawg-bench/src/main.rs b/etc/bench/rust-fst-dawg-bench/src/main.rs new file mode 100644 index 0000000..301f1a9 --- /dev/null +++ b/etc/bench/rust-fst-dawg-bench/src/main.rs @@ -0,0 +1,172 @@ +use std::fs; +use std::fs::File; +use std::io::{BufRead, BufReader, BufWriter}; +use std::path::{Path, PathBuf}; +use std::time::{Duration, Instant}; + + +use dawg::Dawg; +use fst::raw::Fst; +use fst::SetBuilder; + +// Simple benchmark to compare PURL lookup using a DAWG or an FST + +const N_LOOKUPS: usize = 1_000_000; +const OUT_DIR: &str = "target/purl-bench"; +const PURL_DATA_DIR: &str = "purl-validator.rs/fst_builder/data"; + +struct BenchResult { + name: &'static str, + build_time: Duration, + disk_bytes: u64, + lookup_time: Duration, + hits: usize, +} + +/// Collect all PURL files (each with one PURL per lien) +fn purl_files(path: &Path) -> Result, Box> { + let mut files = fs::read_dir(path)? + .map(|entry| entry.map(|entry| entry.path())) + .collect::, _>>()?; + files.retain(|path| path.extension().and_then(|ext| ext.to_str()) == Some("txt")); + files.sort(); + Ok(files) +} + +fn load_purls(path: &Path) -> Result<(Vec, usize), Box> { + let mut keys = Vec::new(); + let mut raw_count = 0; + for file_path in purl_files(path)? { + let file = File::open(&file_path)?; + let reader = BufReader::new(file); + for line in reader.lines() { + let line = line?; + if line.is_empty() { + continue; + } + raw_count += 1; + keys.push(line); + } + } + keys.sort(); + Ok((keys, raw_count)) +} + +fn build_queries(keys: &[String]) -> Vec { + let mut queries = Vec::with_capacity(N_LOOKUPS); + let half = N_LOOKUPS / 2; + let n_keys = keys.len(); + for i in 0..half { + queries.push(keys[(i * 9_973) % n_keys].clone()); + queries.push(format!("{}-missing-{}", keys[(i * 15_485_863) % n_keys], i)); + } + queries +} + +/// Bench for the fst crate +fn bench_fst(keys: &[String], queries: &[String]) -> Result> { + let path = Path::new(OUT_DIR).join("real-purls.fst"); + + let build_start = Instant::now(); + { + let file = File::create(&path)?; + let mut builder = SetBuilder::new(file)?; + for key in keys { + builder.insert(key)?; + } + builder.finish()?; + } + let build_time = build_start.elapsed(); + let disk_bytes = fs::metadata(&path)?.len(); + + let bytes = fs::read(&path)?; + let fst = Fst::new(bytes)?; + let lookup_start = Instant::now(); + let hits = queries + .iter() + .filter(|query| fst.get(query.as_bytes()).is_some()) + .count(); + let lookup_time = lookup_start.elapsed(); + + Ok(BenchResult { + name: "fst::Set", + build_time, + disk_bytes, + lookup_time, + hits, + }) +} + +/// Bench for the dwag crate +fn bench_dawg_crate( + keys: &[String], + queries: &[String], +) -> Result> { + let path = Path::new(OUT_DIR).join("real-purls.dawg-bincode"); + + let build_start = Instant::now(); + let mut dawg = Dawg::new(); + for key in keys { + dawg.insert(key.clone()); + } + dawg.finish(); + let build_time = build_start.elapsed(); + + { + let file = File::create(&path)?; + let mut writer = BufWriter::new(file); + bincode::serialize_into(&mut writer, &dawg)?; + } + let disk_bytes = fs::metadata(&path)?.len(); + + let lookup_start = Instant::now(); + let hits = queries + .iter() + .filter(|query| dawg.is_word(query.as_str(), true).is_some()) + .count(); + let lookup_time = lookup_start.elapsed(); + + Ok(BenchResult { + name: "dawg::Dawg", + build_time, + disk_bytes, + lookup_time, + hits, + }) +} + +fn print_measurement(measurement: &BenchResult) { + println!( + "| {} | {:.3} | {} | {:.3} | {} |", + measurement.name, + measurement.build_time.as_secs_f64(), + measurement.disk_bytes, + measurement.lookup_time.as_secs_f64(), + measurement.hits, + ); +} + +fn main() -> Result<(), Box> { + fs::create_dir_all(OUT_DIR)?; + + println!("Loading PURLs from {PURL_DATA_DIR}"); + let load_start = Instant::now(); + let (keys, raw_count) = load_purls(Path::new(PURL_DATA_DIR))?; + let load_time = load_start.elapsed(); + let queries = build_queries(&keys); + println!("Unique sorted keys: {}", keys.len()); + println!("Input load/sort seconds: {:.3}", load_time.as_secs_f64()); + println!("Lookup queries: {N_LOOKUPS}"); + println!("Expected hits: {}", N_LOOKUPS / 2); + println!(); + println!("| structure | build seconds | disk bytes | lookup seconds | hits |"); + println!("| --- | ---: | ---: | ---: | ---: |"); + + let fst = bench_fst(&keys, &queries)?; + print_measurement(&fst); + + let dawg = bench_dawg_crate(&keys, &queries)?; + print_measurement(&dawg); + + Ok(()) +}