diff --git a/etc/bench/README.md b/etc/bench/README.md
new file mode 100644
index 0000000..c28e595
--- /dev/null
+++ b/etc/bench/README.md
@@ -0,0 +1,258 @@
+# PurlValidator data structure evaluation
+
+This document details the research and evaluation of various efficient data
+structures for compact PURLs storage and lookup.
+
+It contains:
+
+- reference to evaluation/bench scripts
+- documentation on the various libraries and data structures under consideration
+- the final choice (spoiler an FST, aka. finite state transducer)
+
+
+## Context and Problem
+
+PurlValidator needs a local queryable dataset of known PURLs to answer one question:
+
+> Does this PURL exist in the reference dataset?
+
+The lookup index should be built for each release, and shipped with the library
+for access without a network connection. And we want a Go, Rust and Python
+implementation. The PURls themselves are collected using PurlDB and FederatedCode.
+
+
+## Solution
+
+### High level design
+
+The lookup key is a PURL, cleaned to only keep type, namespace, and name,
+(without version, qualifiers and subpath)
+
+This keeps validation focused for now. Version validation could come later by
+extending indexed PURLs with version or baking in support VERS version parsing
+for validation
+
+### Solution elements: Data structures considered
+
+- Built-in set and map
+- FST
+- DAWG
+- Bloom filter
+- SQLite
+
+Considered but not evaluated:
+
+- Minimal perfect hash: no compression
+- Trie or radix tree: DAWG and FST are similar, but are more compact. Suffix
+  trees are way too big.
+
+#### Built-in set and map
+
+Built-in sets and maps are the simplest baseline in each language, they are as
+fast as can be, but they have no compression and no built-in serialization or
+memory mapping, and memory use grows quickly for large datasets.
+
+An interesting path could be to use built-in sets in Rust and Go generating the
+code with all the PURL strings so that there is no specific deserialization. The
+porblem there is the size as the data is not compressed.
+
+Built-ins structures are useful for benchmarks as reference but are not suitable
+as the main packaged data structure because they are too big.
+
+
+#### FST: finite state transducer
+
+<https://en.wikipedia.org/wiki/Finite-state_transducer>
+
+An FST stores a sorted set of strings in a compact automaton. PURLs share common
+prefixes such as `pkg:npm/`, `pkg:pypi/`, and `pkg:maven/`. This sharing helps
+reduce stored data.
+
+FST lookup is exact for this use case. The Rust and Go implementations already
+ship an FST file. The library opens or embeds that file and performs membership
+checks without rebuilding the index.
+
+The main cost is build complexity. Input must be prepared, sorted, and encoded
+when the package data is refreshed.
+
+
+#### DAWG: directed acyclic word graph
+
+See <https://stevehanov.ca/blog/compressing-dictionaries-with-a-dawg>
+
+this is aka. DAFSA
+<https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton>
+
+A DAWG is a compact data structure for a set of strings. It can merge repeated
+prefixes and suffixes like an FST. The DAWG is interesting in that it can
+support prefix lookup, but in general the DAWG is bigger and slower than an FST,
+and has fewer mature/maintained library support.
+
+
+#### Bloom filter
+
+<https://en.wikipedia.org/wiki/Bloom_filter>
+
+A Bloom filter can store a large set in a small space, but it is a probalistic
+structure and can answer that a value is surely absent or maybe present. In that
+later case, you need an extra full dataset to validate further the "maybe": this
+is the problem of false positives with these filters, hence a Bloom filter
+cannot not be used as the only lookup structure, and does not make sense here.
+Instead, a Bloom filter could be used before an exact structure to skip some
+exact lookups as performance optimization, but outside of the validator.
+
+
+#### SQLite
+
+<https://sqlite.org/>
+
+SQLite can store PURLs in a SQL table with an index for exact lookup.
+
+The tradeoff is operational weight. Each SQLite language binding adds a
+dependency (though this is built in Python). The validator only needs immutable
+membership checks, not SQL full power with queries, and update transactions; but
+on the other hand the SQLite DB could be the same across all languages.
+
+SQLite could useful as a benchmark and debugging format. It is not the first
+choice for a small language library because this is not compressed. But it will
+be a future enhancement for sure.
+
+
+### Preferred solution: FST
+
+Based on the benchmark and otrher criteria, let's use an FST-backed lookup for
+every languages. Do not use a Bloom filter (probalistic). Do not use native
+structures that use too much memory.
+
+And for the library selection, we have these high level requirements:
+
+- We want exact result without false positives, e.g., no bloom filter.
+- Offline use, with no network is a must: the dataset must be bundled in the
+  releases.
+- With build time index construction, the construction time is not critical.
+- The bundled index should be small enough to ship below crates, and Pypi
+  archive size limits.
+- No rebuild at startup/runtime, and fast enough load time from disk, ideally
+  memory-mapped.
+- Fast enough lookup.
+- Libraries should be maintained, active FOSS for Rust/Go/Python.
+
+The final selected FST libraries are:
+
+- Rust: fst crate with a memory-mapped set <https://github.com/BurntSushi/fst/>
+- Python: ducer with a memory-mapped map, dict-like
+  <https://github.com/jfolz/ducer> (ducer uses the Rust fst crate inside)
+- Go: vellum "fst" module (originally from
+  <https://github.com/couchbase/vellum> now at
+  <https://github.com/blevesearch/vellum>) which is mostly inspired from the
+  Rust fst crate
+
+
+## Appendix: Benchmarks
+
+This directory contains evaluation and benchmark files for PurlValidator.
+
+It compares structures for offline PURL membership checks with these
+implementations use:
+
+- Python: memory-mapped `ducer`.
+- Rust: crate `fst`.
+- Go: embedded Vellum FST.
+
+... as well as the builtin Python set and dict, SQLite and a Rust DAWG
+
+### Expected checkout layout
+
+Run the scripts from a directory with these repositories checkouts:
+
+- `/purl-validator`
+- `/purl-validator.rs`
+- `/purlvalidator-go`
+
+### benchmarking FST vs. DAWG
+
+There is a good benchmarch in Go comparing FST and DAWG data structures (and
+other structures) that highlights why an FST is a better structure for our cases
+than a DAWG:
+
+<https://github.com/timurgarif/go-fsa-trie-bench>
+
+We also did a simple synthetic benchmark of the Rust fst and dawg crates using
+actual base PURLs using the data in
+<https://github.com/aboutcode-org/purl-validator.rs/tree/main/fst_builder/data>
+
+The `etc/bench/rust-fst-dawg-bench` code compare these fst and dawg crates.
+
+The dataset profile has 2,324,119 unique sorted base PURL. The benchmark is to
+run 1M queries, where 500K are expected to fail.
+
+- The fst crate index was built in 11s, with a 26MB serialized file, and took
+  0.703s for 1M lookups.
+- The dawg crate index was built in 18s, with a 831MB serialized file, and took
+  28s for 1M lookups.
+
+The outcome is that the preferred structure is an FST over a DAWG (at least
+with these implementations).
+
+### benchmarking FST against builtin and SQLite
+
+Since we picked the FST as the winner, additional review has been focused on
+Python by comparing the ducer fst library against other approaches. Since it is
+based on the Rust fst and Go's vellum is also based on the fst design, we cover
+essentially the three languages at once.
+
+The `etc/scripts/bench/alternative_benchmark.py` script compares Python lookup
+using a text file with one PURL per line for these candidates:
+
+- Python `set`.
+- Python `dict`.
+- Python Sorted list plus `bisect`.
+- In-memory SQLite.
+- FST using a `ducer.Map`.
+
+Data is from `purl-validator.rs/fst_builder/data/`
+
+Results with 2,324,119 unique PURLs and 1M lookup queries, 500K existing PURLs:
+
+```text
+structure               build (secs)   lookup (secs)   storage size
+--------------------   ------------   --------------   ---------------------------
+python set               0.206540       0.275906        304MB in RAM
+python dict              0.449625       0.429034        298MB in RAM
+ducer FST                3.700943       1.805585         26MB on disk
+sorted list+bisect       0.017540       2.783555        236MB in RAM
+sqlite in memory         4.855480       4.220032        207MB on disk (or 65MB with zstd)
+```
+
+### benchmarking FST in Python vs. Go vs. Rust
+
+This benchmark runs each of the three validator released implementations. The
+script is in `etc/scripts/bench/go-rust-py_benchmark.py`
+
+Data is from `purl-validator.rs/fst_builder/data/`
+
+Results with 2,324,119 unique PURLs and 1M lookup queries, 500K existing PURLs:
+
+```text
+structure               build (secs)   lookup (secs)   storage size (ondisk)
+--------------------   ------------   --------------   ---------------------------
+Python purl-validator    16.664847      4.926029         25MB
+Rust purl-validator.rs   11.849877      0.348128         25MB
+Go purlvalidator-go       2.325181      0.704749         25MB
+```
+
+### Evaluation
+
+The results are consistent with expectations: Rust is faster than Go and Python.
+
+And the Python on disk fst is the same size as the Rust fst (since this is the
+same backing code).
+
+Some surprises:
+
+- The build of the Go index is the fastest which is surprising and could be an
+  avenue of improvement for the Rust fst crate.
+
+- Leaving aside the 10x larger RAM need, the Python set and dict are competitive
+  speed wise (faster than the on-disk Rust FST) ans super fast to build too.
+  
diff --git a/etc/bench/alternative_benchmark.py b/etc/bench/alternative_benchmark.py
new file mode 100644
index 0000000..90b0fcd
--- /dev/null
+++ b/etc/bench/alternative_benchmark.py
@@ -0,0 +1,231 @@
+#!/usr/bin/env python3
+"""
+Benchmark of set data structures for PURLs.
+"""
+
+from __future__ import annotations
+
+import argparse
+import mmap
+import random
+import sqlite3
+import sys
+import tempfile
+import time
+
+from bisect import bisect_left
+from dataclasses import dataclass
+from pathlib import Path
+
+import ducer
+
+
+@dataclass
+class Result:
+    name: str
+    build_seconds: float
+    lookup_seconds: float
+    hits: int
+    storage: str
+
+
+def iter_input_files(path: Path) -> list[Path]:
+    if path.is_dir():
+        return sorted(path.glob("*.txt"))
+    return [path]
+
+
+def load_purls(path: Path, limit: int | None) -> tuple[list[str], int, int]:
+    purls: list[str] = []
+    raw_count = 0
+    for input_file in iter_input_files(path):
+        with input_file.open("r", encoding="utf-8", errors="replace") as lines:
+            for line in lines:
+                purl = line.strip()
+                if not purl:
+                    continue
+                raw_count += 1
+                purls.append(purl)
+                if limit and len(purls) >= limit:
+                    unique = sorted(set(purls))
+                    return unique, raw_count, len(iter_input_files(path))
+    unique = sorted(set(purls))
+    return unique, raw_count, len(iter_input_files(path))
+
+
+def time_calls(name: str, lookup, queries: list[str]) -> tuple[str, int, float]:
+    start = time.perf_counter()
+    hits = sum(1 for query in queries if lookup(query))
+    elapsed = time.perf_counter() - start
+    return name, hits, elapsed
+
+
+def benchmark_set(purls: list[str], queries: list[str]) -> Result:
+    start = time.perf_counter()
+    values = set(purls)
+    build_seconds = time.perf_counter() - start
+    name, hits, elapsed = time_calls("python_set", values.__contains__, queries)
+    return Result(name, build_seconds, elapsed, hits, "no disk artifact")
+
+
+def benchmark_dict(purls: list[str], queries: list[str]) -> Result:
+    start = time.perf_counter()
+    values = dict.fromkeys(purls, 1)
+    build_seconds = time.perf_counter() - start
+    name, hits, elapsed = time_calls("python_dict", values.__contains__, queries)
+    return Result(name, build_seconds, elapsed, hits, "no disk artifact")
+
+
+def benchmark_sorted_list(purls: list[str], queries: list[str]) -> Result:
+    start = time.perf_counter()
+    values = list(purls)
+    build_seconds = time.perf_counter() - start
+
+    def contains(value: str) -> bool:
+        index = bisect_left(values, value)
+        return index != len(values) and values[index] == value
+
+    name, hits, elapsed = time_calls("sorted_list_bisect", contains, queries)
+    return Result(name, build_seconds, elapsed, hits, "no disk artifact")
+
+
+def benchmark_sqlite(purls: list[str], queries: list[str]) -> Result:
+    start = time.perf_counter()
+    connection = sqlite3.connect(":memory:")
+    connection.execute("CREATE TABLE purls (purl TEXT PRIMARY KEY)")
+    connection.executemany("INSERT INTO purls (purl) VALUES (?)", ((purl,) for purl in purls))
+    connection.commit()
+    build_seconds = time.perf_counter() - start
+
+    def contains(value: str) -> bool:
+        row = connection.execute("SELECT 1 FROM purls WHERE purl = ?", (value,)).fetchone()
+        return row is not None
+
+    name, hits, elapsed = time_calls("sqlite_memory", contains, queries)
+    connection.close()
+    return Result(name, build_seconds, elapsed, hits, "no disk artifact")
+
+
+def benchmark_ducer(purls: list[str], queries: list[str]) -> Result | None:
+
+    with tempfile.TemporaryDirectory() as temp_dir:
+        map_path = Path(temp_dir) / "purls.map"
+        entries = [(purl.encode("utf-8"), 1) for purl in purls]
+        start = time.perf_counter()
+        ducer.Map.build(map_path, entries)
+        build_seconds = time.perf_counter() - start
+        with map_path.open("rb") as map_file:
+            mapped = mmap.mmap(map_file.fileno(), 0, access=mmap.ACCESS_READ)
+            purl_map = ducer.Map(mapped)
+
+            def contains(value: str) -> bool:
+                return bool(purl_map.get(value.encode("utf-8")))
+
+            name, hits, elapsed = time_calls("ducer_map", contains, queries)
+            return Result(name, build_seconds, elapsed, hits, f"{map_path.stat().st_size} bytes")
+
+
+def make_queries(purls: list[str], count: int) -> list[str]:
+    if not purls:
+        return []
+    hit_count = count // 2
+    miss_count = count - hit_count
+    hits = [random.choice(purls) for _ in range(hit_count)]
+    misses = [f"{random.choice(purls)}-missing-{index}" for index in range(miss_count)]
+    queries = hits + misses
+    random.shuffle(queries)
+    return queries
+
+
+def format_report(
+    input_path: Path,
+    file_count: int,
+    raw_count: int,
+    purls: list[str],
+    queries: list[str],
+    results: list[Result],
+    load_seconds: float,
+    seed: int,
+) -> str:
+    lines = [
+        "PurlValidator lookup structure resulys",
+        "========================================",
+        "",
+        f"Input path:             {input_path}",
+        f"Input files:            {file_count}",
+        f"Input load seconds:     {load_seconds:.6f}",
+        f"Lookup queries:         {len(queries)}",
+        f"Expected hits:          {len(queries) // 2}",
+        f"Random seed:            {seed}",
+        "",
+        "Results",
+        "-------",
+        "",
+        f"{'structure':<20} {'build_s':>12} {'lookup_s':>12} {'hits':>10} {'storage':>18}",
+        f"{'-' * 20} {'-' * 12} {'-' * 12} {'-' * 10} {'-' * 18}",
+    ]
+    for result in results:
+        lines.append(
+            f"{result.name:<20} "
+            f"{result.build_seconds:>12.6f} "
+            f"{result.lookup_seconds:>12.6f} "
+            f"{result.hits:>10} "
+            f"{result.storage:>18}"
+        )
+
+    )
+    return "\n".join(lines) + "\n"
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--input",
+        required=True,
+        type=Path,
+        help="Directory of text files with one PURL per line.",
+    )
+    parser.add_argument("--limit", type=int, default=100000, help="Maximum PURLs to load.")
+    parser.add_argument("--queries", type=int, default=20000, help="Number of lookup queries.")
+    parser.add_argument("--seed", type=int, default=1, help="Random seed for reproducible queries.")
+    parser.add_argument("--report", type=Path, help="Write report to this file.")
+    args = parser.parse_args()
+
+    random.seed(args.seed)
+    start = time.perf_counter()
+    purls, raw_count, file_count = load_purls(args.input, args.limit)
+    load_seconds = time.perf_counter() - start
+    if not purls:
+        print(f"No PURLs in {args.input}", file=sys.stderr)
+        return 1
+
+    queries = make_queries(purls, args.queries)
+    results = [
+        benchmark_set(purls, queries),
+        benchmark_dict(purls, queries),
+        benchmark_sorted_list(purls, queries),
+        benchmark_sqlite(purls, queries),
+    ]
+    ducer_result = benchmark_ducer(purls, queries)
+    if ducer_result:
+        results.append(ducer_result)
+
+    report = format_report(
+        input_path=args.input,
+        file_count=file_count,
+        raw_count=raw_count,
+        purls=purls,
+        queries=queries,
+        results=results,
+        load_seconds=load_seconds,
+        seed=args.seed,
+    )
+    print(report, end="")
+    if args.report:
+        args.report.parent.mkdir(parents=True, exist_ok=True)
+        args.report.write_text(report, encoding="utf-8")
+    return 0
+
+
+if __name__ == "__main__":
+    raise main()
diff --git a/etc/bench/go-rust-py_benchmark.py b/etc/bench/go-rust-py_benchmark.py
new file mode 100644
index 0000000..0400685
--- /dev/null
+++ b/etc/bench/go-rust-py_benchmark.py
@@ -0,0 +1,473 @@
+#!/usr/bin/env python3
+"""
+Benchmark the Python, Rust, and Go PurlValidator implementations.
+
+The benchmark uses the PURL source data from purl-validator.rs/fst_builder/data.
+And checks:
+
+- time to build each index
+- index size on disk
+- time to run 1,000,000 lookups, with half known and half unknown PURLs
+
+"""
+
+from __future__ import annotations
+
+import argparse
+import importlib.util
+import json
+import os
+from pathlib import Path
+import random
+import shutil
+import subprocess
+import sys
+import tempfile
+import textwrap
+import time
+
+
+WORKSPACE = Path("workspace")
+PYTHON_REPO = WORKSPACE / "purl-validator"
+RUST_REPO = WORKSPACE / "purl-validator.rs"
+GO_REPO = WORKSPACE / "purlvalidator-go"
+DEFAULT_DATA_DIR = RUST_REPO / "fst_builder/data"
+DEFAULT_REPORT = WORKSPACE / "benchmark-report.md"
+DEFAULT_WORK_DIR = WORKSPACE / "benchmark-tmp"
+
+
+
+class BenchmarkError(Exception):
+    pass
+
+
+def run_command(command: list[str], cwd: Path, env: dict[str, str] | None = None) -> None:
+    completed = subprocess.run(
+        command,
+        cwd=cwd,
+        env=env,
+        text=True,
+        stdout=subprocess.PIPE,
+        stderr=subprocess.STDOUT,
+    )
+    if completed.returncode:
+        output = completed.stdout[-4000:]
+        raise BenchmarkError(
+            f"Command failed with exit code {completed.returncode}: {' '.join(command)}\n{output}"
+        )
+
+
+def timed(function):
+    start = time.perf_counter()
+    result = function()
+    return time.perf_counter() - start, result
+
+
+def read_purls(data_dir: Path) -> tuple[list[str], int, int]:
+    purls: list[str] = []
+    files = sorted(data_dir.glob("*.txt"))
+    raw_count = 0
+    for path in files:
+        with path.open("r", encoding="utf-8", errors="replace") as lines:
+            for line in lines:
+                purl = line.strip()
+                if not purl:
+                    continue
+                raw_count += 1
+                purls.append(purl)
+    return sorted(set(purls)), raw_count, len(files)
+
+
+def write_queries(purls: list[str], query_count: int, seed: int, path: Path) -> None:
+    rng = random.Random(seed)
+    hit_count = query_count // 2
+    miss_count = query_count - hit_count
+    known_lookup_purls = [purl for purl in purls if purl.startswith("pkg:pypi/")]
+    if not known_lookup_purls:
+        known_lookup_purls = purls
+    queries = [rng.choice(known_lookup_purls) for _ in range(hit_count)]
+    queries.extend(f"pkg:npm/purl-validator-benchmark-unknown-{index:07}" for index in range(miss_count))
+    rng.shuffle(queries)
+    path.write_text("\n".join(queries) + "\n", encoding="utf-8")
+
+
+def copy_data_files(data_dir: Path, target_dir: Path) -> None:
+    if data_dir.resolve() == target_dir.resolve():
+        return
+    if target_dir.exists():
+        shutil.rmtree(target_dir)
+    target_dir.mkdir(parents=True)
+    for source in sorted(data_dir.glob("*.txt")):
+        shutil.copy2(source, target_dir / source.name)
+
+
+def write_rust_lookup_harness(work_dir: Path, fst_path: Path, query_path: Path) -> Path:
+    project = work_dir / "rust_lookup"
+    if project.exists():
+        shutil.rmtree(project)
+    (project / "src").mkdir(parents=True)
+    (project / "Cargo.toml").write_text(
+        textwrap.dedent(
+            f"""
+            [package]
+            name = "purl-validator-rust-lookup-bench"
+            version = "0.1.0"
+            edition = "2024"
+
+            [dependencies]
+            fst = "0.4.7"
+            packageurl = "0.6.0"
+            """
+        ).strip()
+        + "\n",
+        encoding="utf-8",
+    )
+    (project / "src/main.rs").write_text(
+        textwrap.dedent(
+            f"""
+            use fst::Set;
+            use packageurl::PackageUrl;
+            use std::fs;
+            use std::str::FromStr;
+            use std::time::Instant;
+
+            fn main() -> Result<(), Box<dyn std::error::Error>> {{
+                let fst_data = fs::read({json.dumps(str(fst_path))})?;
+                let set = Set::new(fst_data.as_slice())?;
+                let queries = fs::read_to_string({json.dumps(str(query_path))})?;
+                let start = Instant::now();
+                let mut hits = 0usize;
+                for query in queries.lines() {{
+                    let purl = PackageUrl::from_str(query)?;
+                    if purl.version().is_some()
+                        || !purl.qualifiers().is_empty()
+                        || purl.subpath().is_some()
+                    {{
+                        return Err("unsupported PURL".into());
+                    }}
+                    let key = query.trim_end_matches('/');
+                    if set.contains(key) {{
+                        hits += 1;
+                    }}
+                }}
+                println!("hits={{}}", hits);
+                println!("lookup_seconds={{:.6}}", start.elapsed().as_secs_f64());
+                Ok(())
+            }}
+            """
+        ).strip()
+        + "\n",
+        encoding="utf-8",
+    )
+    return project
+
+
+def write_go_lookup_harness(work_dir: Path, fst_path: Path, query_path: Path) -> Path:
+    project = work_dir / "go_lookup"
+    if project.exists():
+        shutil.rmtree(project)
+    project.mkdir(parents=True)
+    (project / "go.mod").write_text(
+        textwrap.dedent(
+            """
+            module purl-validator-go-lookup-bench
+
+            go 1.22.3
+
+            require (
+                github.com/blevesearch/vellum v1.1.0
+                github.com/package-url/packageurl-go v0.1.5
+            )
+            """
+        ).strip()
+        + "\n",
+        encoding="utf-8",
+    )
+    (project / "main.go").write_text(
+        textwrap.dedent(
+            f"""
+            package main
+
+            import (
+                "bufio"
+                "fmt"
+                "os"
+                "time"
+
+                "github.com/blevesearch/vellum"
+                packageurl "github.com/package-url/packageurl-go"
+            )
+
+            func main() {{
+                data, err := os.ReadFile({json.dumps(str(fst_path))})
+                if err != nil {{
+                    panic(err)
+                }}
+                fstMap, err := vellum.Load(data)
+                if err != nil {{
+                    panic(err)
+                }}
+                file, err := os.Open({json.dumps(str(query_path))})
+                if err != nil {{
+                    panic(err)
+                }}
+                defer file.Close()
+
+                start := time.Now()
+                hits := 0
+                scanner := bufio.NewScanner(file)
+                scanner.Buffer(make([]byte, 1024), 1024*1024)
+                for scanner.Scan() {{
+                    query := scanner.Text()
+                    instance, err := packageurl.FromString(query)
+                    if err != nil {{
+                        panic(err)
+                    }}
+                    if instance.Version != "" || len(instance.Qualifiers) > 0 || instance.Subpath != "" {{
+                        panic("unsupported PURL")
+                    }}
+                    ok, err := fstMap.Contains([]byte(query))
+                    if err != nil {{
+                        panic(err)
+                    }}
+                    if ok {{
+                        hits++
+                    }}
+                }}
+                if err := scanner.Err(); err != nil {{
+                    panic(err)
+                }}
+                fmt.Printf("hits=%d\\n", hits)
+                fmt.Printf("lookup_seconds=%.6f\\n", time.Since(start).Seconds())
+            }}
+            """
+        ).strip()
+        + "\n",
+        encoding="utf-8",
+    )
+    return project
+
+
+def parse_lookup_output(output_path: Path) -> tuple[int, float]:
+    text = output_path.read_text(encoding="utf-8")
+    hits = None
+    seconds = None
+    for line in text.splitlines():
+        if line.startswith("hits="):
+            hits = int(line.split("=", 1)[1])
+        if line.startswith("lookup_seconds="):
+            seconds = float(line.split("=", 1)[1])
+    if hits is None or seconds is None:
+        raise BenchmarkError(f"Cannot parse lookup output from {output_path}:\n{text}")
+    return hits, seconds
+
+
+def benchmark_python(purls: list[str], query_path: Path, work_dir: Path) -> dict[str, object]:
+    sys.path.insert(0, str(PYTHON_REPO / "src"))
+    spec = importlib.util.spec_from_file_location(
+        "purl_validator", PYTHON_REPO / "src/purl_validator/__init__.py"
+    )
+    if not spec or not spec.loader:
+        raise BenchmarkError("Cannot load purl_validator module")
+    module = importlib.util.module_from_spec(spec)
+    spec.loader.exec_module(module)
+
+    index = work_dir / "python-purls.map"
+
+    def build():
+        generated = Path(module.create_purl_map(purls))
+        shutil.copy2(generated, index)
+
+    build_seconds, _result = timed(build)
+
+    validator = module.PurlValidator(index)
+    queries = query_path.read_text(encoding="utf-8").splitlines()
+
+    def lookup():
+        hits = 0
+        start = time.perf_counter()
+        for query in queries:
+            if validator.validate_purl(query):
+                hits += 1
+        return hits, time.perf_counter() - start
+
+    hits, lookup_seconds = lookup()
+    return {
+        "name": "Python purl-validator",
+        "build_seconds": build_seconds,
+        "lookup_seconds": lookup_seconds,
+        "index_size": index.stat().st_size,
+        "hits": hits,
+        "index": index,
+    }
+
+
+def benchmark_rust(data_dir: Path, query_path: Path, work_dir: Path) -> dict[str, object]:
+    copy_data_files(data_dir, RUST_REPO / "fst_builder/data")
+    index = RUST_REPO / "purls.fst"
+    if index.exists():
+        index.unlink()
+
+    run_command(["cargo", "build", "--release", "--bin", "fst_builder"], cwd=RUST_REPO)
+    builder = RUST_REPO / "target/release/fst_builder"
+    build_seconds, _result = timed(lambda: run_command([str(builder)], cwd=RUST_REPO))
+
+    copied_index = work_dir / "rust-purls.fst"
+    shutil.copy2(index, copied_index)
+
+    harness = write_rust_lookup_harness(work_dir, copied_index, query_path)
+    run_command(["cargo", "build", "--release"], cwd=harness)
+    output_path = work_dir / "rust-lookup.out"
+
+    def run_lookup():
+        completed = subprocess.run(
+            [str(harness / "target/release/purl-validator-rust-lookup-bench")],
+            cwd=harness,
+            text=True,
+            stdout=subprocess.PIPE,
+            stderr=subprocess.STDOUT,
+            check=True,
+        )
+        output_path.write_text(completed.stdout, encoding="utf-8")
+
+    run_lookup()
+    hits, lookup_seconds = parse_lookup_output(output_path)
+    return {
+        "name": "Rust purl-validator.rs",
+        "build_seconds": build_seconds,
+        "lookup_seconds": lookup_seconds,
+        "index_size": copied_index.stat().st_size,
+        "hits": hits,
+        "index": copied_index,
+    }
+
+
+def benchmark_go(data_dir: Path, query_path: Path, work_dir: Path) -> dict[str, object]:
+    copy_data_files(data_dir, GO_REPO / "cmd/data")
+    index = GO_REPO / "purls.fst"
+    if index.exists():
+        index.unlink()
+
+    env = os.environ.copy()
+    env["PATH"] = f"/usr/local/go/bin:{env.get('PATH', '')}"
+    build_seconds, _result = timed(
+        lambda: run_command(["go", "run", "./cmd/main.go"], cwd=GO_REPO, env=env)
+    )
+
+    copied_index = work_dir / "go-purls.fst"
+    shutil.copy2(index, copied_index)
+
+    harness = write_go_lookup_harness(work_dir, copied_index, query_path)
+    run_command(["go", "mod", "tidy"], cwd=harness, env=env)
+    run_command(["go", "build", "-o", "lookup-bench", "."], cwd=harness, env=env)
+    output_path = work_dir / "go-lookup.out"
+
+    completed = subprocess.run(
+        [str(harness / "lookup-bench")],
+        cwd=harness,
+        env=env,
+        text=True,
+        stdout=subprocess.PIPE,
+        stderr=subprocess.STDOUT,
+        check=True,
+    )
+    output_path.write_text(completed.stdout, encoding="utf-8")
+    hits, lookup_seconds = parse_lookup_output(output_path)
+    return {
+        "name": "Go purlvalidator-go",
+        "build_seconds": build_seconds,
+        "lookup_seconds": lookup_seconds,
+        "index_size": copied_index.stat().st_size,
+        "hits": hits,
+        "index": copied_index,
+    }
+
+
+def mib(size: int) -> str:
+    return f"{size / 1024 / 1024:.2f} MiB"
+
+
+def write_report(
+    report_path: Path,
+    data_dir: Path,
+    raw_count: int,
+    unique_count: int,
+    file_count: int,
+    query_count: int,
+    seed: int,
+    results: list[dict[str, object]],
+) -> None:
+    lines = [
+        "# PurlValidator implementation benchmark",
+        "",
+        "This benchmark uses the data from `purl-validator.rs/fst_builder/data/`.",
+        "",
+        "Input summary:",
+        "",
+        f"- Data directory: `{data_dir}`",
+        f"- Input files: `{file_count}`",
+        f"- Unique PURLs: `{unique_count}`",
+        f"- Lookup queries: `{query_count}`",
+        f"- Expected known PURLs: `{query_count // 2}`",
+        f"- Expected unknown PURLs: `{query_count - (query_count // 2)}`",
+        f"- Query seed: `{seed}`",
+        "",
+        "Results:",
+        "",
+    ]
+    for result in results:
+        size = int(result["index_size"])
+        lines.extend(
+            [
+                f"## {result['name']}",
+                "",
+                f"- Build time: `{float(result['build_seconds']):.6f}` seconds",
+                f"- Lookup time: `{float(result['lookup_seconds']):.6f}` seconds",
+                f"- Lookup hits: `{int(result['hits'])}`",
+                f"- Lookup index size: `{size}` bytes, `{mib(size)}`",
+                f"- Lookup index: `{result['index']}`",
+                "",
+            ]
+        )
+    )
+    report_path.parent.mkdir(parents=True, exist_ok=True)
+    report_path.write_text("\n".join(lines) + "\n", encoding="utf-8")
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--data-dir", type=Path, default=DEFAULT_DATA_DIR)
+    parser.add_argument("--work-dir", type=Path, default=DEFAULT_WORK_DIR)
+    parser.add_argument("--report", type=Path, default=DEFAULT_REPORT)
+    parser.add_argument("--queries", type=int, default=1000000)
+    parser.add_argument("--seed", type=int, default=1)
+    args = parser.parse_args()
+
+    args.work_dir.mkdir(parents=True, exist_ok=True)
+    query_path = args.work_dir / "queries.txt"
+
+    purls, raw_count, file_count = read_purls(args.data_dir)
+    write_queries(purls, args.queries, args.seed, query_path)
+
+    results = [
+        benchmark_python(purls, query_path, args.work_dir),
+        benchmark_rust(args.data_dir, query_path, args.work_dir),
+        benchmark_go(args.data_dir, query_path, args.work_dir),
+    ]
+
+    write_report(
+        report_path=args.report,
+        data_dir=args.data_dir,
+        raw_count=raw_count,
+        unique_count=len(purls),
+        file_count=file_count,
+        query_count=args.queries,
+        seed=args.seed,
+        results=results,
+    )
+    print(args.report)
+    return 0
+
+
+if __name__ == "__main__":
+    main()
diff --git a/etc/bench/rust-fst-dawg-bench/.gitignore b/etc/bench/rust-fst-dawg-bench/.gitignore
new file mode 100644
index 0000000..ea8c4bf
--- /dev/null
+++ b/etc/bench/rust-fst-dawg-bench/.gitignore
@@ -0,0 +1 @@
+/target
diff --git a/etc/bench/rust-fst-dawg-bench/Cargo.lock b/etc/bench/rust-fst-dawg-bench/Cargo.lock
new file mode 100644
index 0000000..6f367d4
--- /dev/null
+++ b/etc/bench/rust-fst-dawg-bench/Cargo.lock
@@ -0,0 +1,108 @@
+# This file is automatically @generated by Cargo.
+# It is not intended for manual editing.
+version = 4
+
+[[package]]
+name = "bincode"
+version = "1.3.3"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "b1f45e9417d87227c7a56d22e471c6206462cba514c7590c09aff4cf6d1ddcad"
+dependencies = [
+ "serde",
+]
+
+[[package]]
+name = "dawg"
+version = "0.0.7"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "81c5a800721e14d9d89e9b5b2b8920e341c19ab78687ca8ab25c393f7eddf036"
+dependencies = [
+ "serde",
+ "unicode-segmentation",
+]
+
+[[package]]
+name = "fst"
+version = "0.4.7"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "7ab85b9b05e3978cc9a9cf8fea7f01b494e1a09ed3037e16ba39edc7a29eb61a"
+
+[[package]]
+name = "proc-macro2"
+version = "1.0.106"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8fd00f0bb2e90d81d1044c2b32617f68fcb9fa3bb7640c23e9c748e53fb30934"
+dependencies = [
+ "unicode-ident",
+]
+
+[[package]]
+name = "quote"
+version = "1.0.45"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "41f2619966050689382d2b44f664f4bc593e129785a36d6ee376ddf37259b924"
+dependencies = [
+ "proc-macro2",
+]
+
+[[package]]
+name = "rust-fst-dawg-bench"
+version = "0.1.0"
+dependencies = [
+ "bincode",
+ "dawg",
+ "fst",
+]
+
+[[package]]
+name = "serde"
+version = "1.0.228"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9a8e94ea7f378bd32cbbd37198a4a91436180c5bb472411e48b5ec2e2124ae9e"
+dependencies = [
+ "serde_core",
+ "serde_derive",
+]
+
+[[package]]
+name = "serde_core"
+version = "1.0.228"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "41d385c7d4ca58e59fc732af25c3983b67ac852c1a25000afe1175de458b67ad"
+dependencies = [
+ "serde_derive",
+]
+
+[[package]]
+name = "serde_derive"
+version = "1.0.228"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d540f220d3187173da220f885ab66608367b6574e925011a9353e4badda91d79"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn",
+]
+
+[[package]]
+name = "syn"
+version = "2.0.117"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e665b8803e7b1d2a727f4023456bbbbe74da67099c585258af0ad9c5013b9b99"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "unicode-ident",
+]
+
+[[package]]
+name = "unicode-ident"
+version = "1.0.24"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e6e4313cd5fcd3dad5cafa179702e2b244f760991f45397d14d4ebf38247da75"
+
+[[package]]
+name = "unicode-segmentation"
+version = "1.13.2"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9629274872b2bfaf8d66f5f15725007f635594914870f65218920345aa11aa8c"
diff --git a/etc/bench/rust-fst-dawg-bench/Cargo.toml b/etc/bench/rust-fst-dawg-bench/Cargo.toml
new file mode 100644
index 0000000..859234c
--- /dev/null
+++ b/etc/bench/rust-fst-dawg-bench/Cargo.toml
@@ -0,0 +1,9 @@
+[package]
+name = "rust-fst-dawg-bench"
+version = "0.1.0"
+edition = "2024"
+
+[dependencies]
+bincode = "1.3"
+dawg = "0.0.7"
+fst = "0.4.7"
diff --git a/etc/bench/rust-fst-dawg-bench/src/main.rs b/etc/bench/rust-fst-dawg-bench/src/main.rs
new file mode 100644
index 0000000..301f1a9
--- /dev/null
+++ b/etc/bench/rust-fst-dawg-bench/src/main.rs
@@ -0,0 +1,172 @@
+use std::fs;
+use std::fs::File;
+use std::io::{BufRead, BufReader, BufWriter};
+use std::path::{Path, PathBuf};
+use std::time::{Duration, Instant};
+
+
+use dawg::Dawg;
+use fst::raw::Fst;
+use fst::SetBuilder;
+
+// Simple benchmark to compare PURL lookup using a DAWG or an FST
+
+const N_LOOKUPS: usize = 1_000_000;
+const OUT_DIR: &str = "target/purl-bench";
+const PURL_DATA_DIR: &str = "purl-validator.rs/fst_builder/data";
+
+struct BenchResult {
+    name: &'static str,
+    build_time: Duration,
+    disk_bytes: u64,
+    lookup_time: Duration,
+    hits: usize,
+}
+
+/// Collect all PURL files (each with one PURL per lien)
+fn purl_files(path: &Path) -> Result<Vec<PathBuf>, Box<dyn std::error::Error>> {
+    let mut files = fs::read_dir(path)?
+        .map(|entry| entry.map(|entry| entry.path()))
+        .collect::<Result<Vec<_>, _>>()?;
+    files.retain(|path| path.extension().and_then(|ext| ext.to_str()) == Some("txt"));
+    files.sort();
+    Ok(files)
+}
+
+fn load_purls(path: &Path) -> Result<(Vec<String>, usize), Box<dyn std::error::Error>> {
+    let mut keys = Vec::new();
+    let mut raw_count = 0;
+    for file_path in purl_files(path)? {
+        let file = File::open(&file_path)?;
+        let reader = BufReader::new(file);
+        for line in reader.lines() {
+            let line = line?;
+            if line.is_empty() {
+                continue;
+            }
+            raw_count += 1;
+            keys.push(line);
+        }
+    }
+    keys.sort();
+    Ok((keys, raw_count))
+}
+
+fn build_queries(keys: &[String]) -> Vec<String> {
+    let mut queries = Vec::with_capacity(N_LOOKUPS);
+    let half = N_LOOKUPS / 2;
+    let n_keys = keys.len();
+    for i in 0..half {
+        queries.push(keys[(i * 9_973) % n_keys].clone());
+        queries.push(format!("{}-missing-{}", keys[(i * 15_485_863) % n_keys], i));
+    }
+    queries
+}
+
+/// Bench for the fst crate
+fn bench_fst(keys: &[String], queries: &[String]) -> Result<BenchResult, Box<dyn std::error::Error>> {
+    let path = Path::new(OUT_DIR).join("real-purls.fst");
+
+    let build_start = Instant::now();
+    {
+        let file = File::create(&path)?;
+        let mut builder = SetBuilder::new(file)?;
+        for key in keys {
+            builder.insert(key)?;
+        }
+        builder.finish()?;
+    }
+    let build_time = build_start.elapsed();
+    let disk_bytes = fs::metadata(&path)?.len();
+
+    let bytes = fs::read(&path)?;
+    let fst = Fst::new(bytes)?;
+    let lookup_start = Instant::now();
+    let hits = queries
+        .iter()
+        .filter(|query| fst.get(query.as_bytes()).is_some())
+        .count();
+    let lookup_time = lookup_start.elapsed();
+
+    Ok(BenchResult {
+        name: "fst::Set",
+        build_time,
+        disk_bytes,
+        lookup_time,
+        hits,
+    })
+}
+
+/// Bench for the dwag crate
+fn bench_dawg_crate(
+    keys: &[String],
+    queries: &[String],
+) -> Result<BenchResult, Box<dyn std::error::Error>> {
+    let path = Path::new(OUT_DIR).join("real-purls.dawg-bincode");
+
+    let build_start = Instant::now();
+    let mut dawg = Dawg::new();
+    for key in keys {
+        dawg.insert(key.clone());
+    }
+    dawg.finish();
+    let build_time = build_start.elapsed();
+
+    {
+        let file = File::create(&path)?;
+        let mut writer = BufWriter::new(file);
+        bincode::serialize_into(&mut writer, &dawg)?;
+    }
+    let disk_bytes = fs::metadata(&path)?.len();
+
+    let lookup_start = Instant::now();
+    let hits = queries
+        .iter()
+        .filter(|query| dawg.is_word(query.as_str(), true).is_some())
+        .count();
+    let lookup_time = lookup_start.elapsed();
+
+    Ok(BenchResult {
+        name: "dawg::Dawg",
+        build_time,
+        disk_bytes,
+        lookup_time,
+        hits,
+    })
+}
+
+fn print_measurement(measurement: &BenchResult) {
+    println!(
+        "| {} | {:.3} | {} | {:.3} | {} |",
+        measurement.name,
+        measurement.build_time.as_secs_f64(),
+        measurement.disk_bytes,
+        measurement.lookup_time.as_secs_f64(),
+        measurement.hits,
+    );
+}
+
+fn main() -> Result<(), Box<dyn std::error::Error>> {
+    fs::create_dir_all(OUT_DIR)?;
+
+    println!("Loading PURLs from {PURL_DATA_DIR}");
+    let load_start = Instant::now();
+    let (keys, raw_count) = load_purls(Path::new(PURL_DATA_DIR))?;
+    let load_time = load_start.elapsed();
+    let queries = build_queries(&keys);
+    println!("Unique sorted keys: {}", keys.len());
+    println!("Input load/sort seconds: {:.3}", load_time.as_secs_f64());
+    println!("Lookup queries: {N_LOOKUPS}");
+    println!("Expected hits: {}", N_LOOKUPS / 2);
+    println!();
+    println!("| structure | build seconds | disk bytes | lookup seconds | hits |");
+    println!("| --- | ---: | ---: | ---: | ---: |");
+
+    let fst = bench_fst(&keys, &queries)?;
+    print_measurement(&fst);
+
+    let dawg = bench_dawg_crate(&keys, &queries)?;
+    print_measurement(&dawg);
+
+    Ok(())
+}