From cb78d4e05215b5888e6cb0a1777efe9115186bd7 Mon Sep 17 00:00:00 2001 From: Philippe Ombredanne Date: Fri, 15 May 2026 18:51:24 +0200 Subject: [PATCH 1/4] Rename README to rst Signed-off-by: Philippe Ombredanne --- etc/bench/README.md | 258 --------------------------------------- etc/bench/README.rst | 284 +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 284 insertions(+), 258 deletions(-) delete mode 100644 etc/bench/README.md create mode 100644 etc/bench/README.rst diff --git a/etc/bench/README.md b/etc/bench/README.md deleted file mode 100644 index c28e595..0000000 --- a/etc/bench/README.md +++ /dev/null @@ -1,258 +0,0 @@ -# PurlValidator data structure evaluation - -This document details the research and evaluation of various efficient data -structures for compact PURLs storage and lookup. - -It contains: - -- reference to evaluation/bench scripts -- documentation on the various libraries and data structures under consideration -- the final choice (spoiler an FST, aka. finite state transducer) - - -## Context and Problem - -PurlValidator needs a local queryable dataset of known PURLs to answer one question: - -> Does this PURL exist in the reference dataset? - -The lookup index should be built for each release, and shipped with the library -for access without a network connection. And we want a Go, Rust and Python -implementation. The PURls themselves are collected using PurlDB and FederatedCode. - - -## Solution - -### High level design - -The lookup key is a PURL, cleaned to only keep type, namespace, and name, -(without version, qualifiers and subpath) - -This keeps validation focused for now. Version validation could come later by -extending indexed PURLs with version or baking in support VERS version parsing -for validation - -### Solution elements: Data structures considered - -- Built-in set and map -- FST -- DAWG -- Bloom filter -- SQLite - -Considered but not evaluated: - -- Minimal perfect hash: no compression -- Trie or radix tree: DAWG and FST are similar, but are more compact. Suffix - trees are way too big. - -#### Built-in set and map - -Built-in sets and maps are the simplest baseline in each language, they are as -fast as can be, but they have no compression and no built-in serialization or -memory mapping, and memory use grows quickly for large datasets. - -An interesting path could be to use built-in sets in Rust and Go generating the -code with all the PURL strings so that there is no specific deserialization. The -porblem there is the size as the data is not compressed. - -Built-ins structures are useful for benchmarks as reference but are not suitable -as the main packaged data structure because they are too big. - - -#### FST: finite state transducer - - - -An FST stores a sorted set of strings in a compact automaton. PURLs share common -prefixes such as `pkg:npm/`, `pkg:pypi/`, and `pkg:maven/`. This sharing helps -reduce stored data. - -FST lookup is exact for this use case. The Rust and Go implementations already -ship an FST file. The library opens or embeds that file and performs membership -checks without rebuilding the index. - -The main cost is build complexity. Input must be prepared, sorted, and encoded -when the package data is refreshed. - - -#### DAWG: directed acyclic word graph - -See - -this is aka. DAFSA - - -A DAWG is a compact data structure for a set of strings. It can merge repeated -prefixes and suffixes like an FST. The DAWG is interesting in that it can -support prefix lookup, but in general the DAWG is bigger and slower than an FST, -and has fewer mature/maintained library support. - - -#### Bloom filter - - - -A Bloom filter can store a large set in a small space, but it is a probalistic -structure and can answer that a value is surely absent or maybe present. In that -later case, you need an extra full dataset to validate further the "maybe": this -is the problem of false positives with these filters, hence a Bloom filter -cannot not be used as the only lookup structure, and does not make sense here. -Instead, a Bloom filter could be used before an exact structure to skip some -exact lookups as performance optimization, but outside of the validator. - - -#### SQLite - - - -SQLite can store PURLs in a SQL table with an index for exact lookup. - -The tradeoff is operational weight. Each SQLite language binding adds a -dependency (though this is built in Python). The validator only needs immutable -membership checks, not SQL full power with queries, and update transactions; but -on the other hand the SQLite DB could be the same across all languages. - -SQLite could useful as a benchmark and debugging format. It is not the first -choice for a small language library because this is not compressed. But it will -be a future enhancement for sure. - - -### Preferred solution: FST - -Based on the benchmark and otrher criteria, let's use an FST-backed lookup for -every languages. Do not use a Bloom filter (probalistic). Do not use native -structures that use too much memory. - -And for the library selection, we have these high level requirements: - -- We want exact result without false positives, e.g., no bloom filter. -- Offline use, with no network is a must: the dataset must be bundled in the - releases. -- With build time index construction, the construction time is not critical. -- The bundled index should be small enough to ship below crates, and Pypi - archive size limits. -- No rebuild at startup/runtime, and fast enough load time from disk, ideally - memory-mapped. -- Fast enough lookup. -- Libraries should be maintained, active FOSS for Rust/Go/Python. - -The final selected FST libraries are: - -- Rust: fst crate with a memory-mapped set -- Python: ducer with a memory-mapped map, dict-like - (ducer uses the Rust fst crate inside) -- Go: vellum "fst" module (originally from - now at - ) which is mostly inspired from the - Rust fst crate - - -## Appendix: Benchmarks - -This directory contains evaluation and benchmark files for PurlValidator. - -It compares structures for offline PURL membership checks with these -implementations use: - -- Python: memory-mapped `ducer`. -- Rust: crate `fst`. -- Go: embedded Vellum FST. - -... as well as the builtin Python set and dict, SQLite and a Rust DAWG - -### Expected checkout layout - -Run the scripts from a directory with these repositories checkouts: - -- `/purl-validator` -- `/purl-validator.rs` -- `/purlvalidator-go` - -### benchmarking FST vs. DAWG - -There is a good benchmarch in Go comparing FST and DAWG data structures (and -other structures) that highlights why an FST is a better structure for our cases -than a DAWG: - - - -We also did a simple synthetic benchmark of the Rust fst and dawg crates using -actual base PURLs using the data in - - -The `etc/bench/rust-fst-dawg-bench` code compare these fst and dawg crates. - -The dataset profile has 2,324,119 unique sorted base PURL. The benchmark is to -run 1M queries, where 500K are expected to fail. - -- The fst crate index was built in 11s, with a 26MB serialized file, and took - 0.703s for 1M lookups. -- The dawg crate index was built in 18s, with a 831MB serialized file, and took - 28s for 1M lookups. - -The outcome is that the preferred structure is an FST over a DAWG (at least -with these implementations). - -### benchmarking FST against builtin and SQLite - -Since we picked the FST as the winner, additional review has been focused on -Python by comparing the ducer fst library against other approaches. Since it is -based on the Rust fst and Go's vellum is also based on the fst design, we cover -essentially the three languages at once. - -The `etc/scripts/bench/alternative_benchmark.py` script compares Python lookup -using a text file with one PURL per line for these candidates: - -- Python `set`. -- Python `dict`. -- Python Sorted list plus `bisect`. -- In-memory SQLite. -- FST using a `ducer.Map`. - -Data is from `purl-validator.rs/fst_builder/data/` - -Results with 2,324,119 unique PURLs and 1M lookup queries, 500K existing PURLs: - -```text -structure build (secs) lookup (secs) storage size --------------------- ------------ -------------- --------------------------- -python set 0.206540 0.275906 304MB in RAM -python dict 0.449625 0.429034 298MB in RAM -ducer FST 3.700943 1.805585 26MB on disk -sorted list+bisect 0.017540 2.783555 236MB in RAM -sqlite in memory 4.855480 4.220032 207MB on disk (or 65MB with zstd) -``` - -### benchmarking FST in Python vs. Go vs. Rust - -This benchmark runs each of the three validator released implementations. The -script is in `etc/scripts/bench/go-rust-py_benchmark.py` - -Data is from `purl-validator.rs/fst_builder/data/` - -Results with 2,324,119 unique PURLs and 1M lookup queries, 500K existing PURLs: - -```text -structure build (secs) lookup (secs) storage size (ondisk) --------------------- ------------ -------------- --------------------------- -Python purl-validator 16.664847 4.926029 25MB -Rust purl-validator.rs 11.849877 0.348128 25MB -Go purlvalidator-go 2.325181 0.704749 25MB -``` - -### Evaluation - -The results are consistent with expectations: Rust is faster than Go and Python. - -And the Python on disk fst is the same size as the Rust fst (since this is the -same backing code). - -Some surprises: - -- The build of the Go index is the fastest which is surprising and could be an - avenue of improvement for the Rust fst crate. - -- Leaving aside the 10x larger RAM need, the Python set and dict are competitive - speed wise (faster than the on-disk Rust FST) ans super fast to build too. - diff --git a/etc/bench/README.rst b/etc/bench/README.rst new file mode 100644 index 0000000..677c3ac --- /dev/null +++ b/etc/bench/README.rst @@ -0,0 +1,284 @@ +PurlValidator data structure evaluation +======================================= + +This document details the research and evaluation of various efficient +data structures for compact PURLs storage and lookup. + +It contains: + +- reference to evaluation/bench scripts +- documentation on the various libraries and data structures under + consideration +- the final choice (spoiler an FST, aka. finite state transducer) + +Context and Problem +------------------- + +PurlValidator needs a local queryable dataset of known PURLs to answer +one question: + + Does this PURL exist in the reference dataset? + +The lookup index should be built for each release, and shipped with the +library for access without a network connection. And we want a Go, Rust +and Python implementation. The PURls themselves are collected using +PurlDB and FederatedCode. + +Solution +-------- + +High level design +~~~~~~~~~~~~~~~~~ + +The lookup key is a PURL, cleaned to only keep type, namespace, and +name, (without version, qualifiers and subpath) + +This keeps validation focused for now. Version validation could come +later by extending indexed PURLs with version or baking in support VERS +version parsing for validation + +Solution elements: Data structures considered +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +- Built-in set and map +- FST +- DAWG +- Bloom filter +- SQLite + +Considered but not evaluated: + +- Minimal perfect hash: no compression +- Trie or radix tree: DAWG and FST are similar, but are more compact. + Suffix trees are way too big. + +Built-in set and map +^^^^^^^^^^^^^^^^^^^^ + +Built-in sets and maps are the simplest baseline in each language, they +are as fast as can be, but they have no compression and no built-in +serialization or memory mapping, and memory use grows quickly for large +datasets. + +An interesting path could be to use built-in sets in Rust and Go +generating the code with all the PURL strings so that there is no +specific deserialization. The porblem there is the size as the data is +not compressed. + +Built-ins structures are useful for benchmarks as reference but are not +suitable as the main packaged data structure because they are too big. + +FST: finite state transducer +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +https://en.wikipedia.org/wiki/Finite-state_transducer + +An FST stores a sorted set of strings in a compact automaton. PURLs +share common prefixes such as ``pkg:npm/``, ``pkg:pypi/``, and +``pkg:maven/``. This sharing helps reduce stored data. + +FST lookup is exact for this use case. The Rust and Go implementations +already ship an FST file. The library opens or embeds that file and +performs membership checks without rebuilding the index. + +The main cost is build complexity. Input must be prepared, sorted, and +encoded when the package data is refreshed. + +DAWG: directed acyclic word graph +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +See https://stevehanov.ca/blog/compressing-dictionaries-with-a-dawg + +this is aka. DAFSA +https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton + +A DAWG is a compact data structure for a set of strings. It can merge +repeated prefixes and suffixes like an FST. The DAWG is interesting in +that it can support prefix lookup, but in general the DAWG is bigger and +slower than an FST, and has fewer mature/maintained library support. + +Bloom filter +^^^^^^^^^^^^ + +https://en.wikipedia.org/wiki/Bloom_filter + +A Bloom filter can store a large set in a small space, but it is a +probalistic structure and can answer that a value is surely absent or +maybe present. In that later case, you need an extra full dataset to +validate further the “maybe”: this is the problem of false positives +with these filters, hence a Bloom filter cannot not be used as the only +lookup structure, and does not make sense here. Instead, a Bloom filter +could be used before an exact structure to skip some exact lookups as +performance optimization, but outside of the validator. + +SQLite +^^^^^^ + +https://sqlite.org/ + +SQLite can store PURLs in a SQL table with an index for exact lookup. + +The tradeoff is operational weight. Each SQLite language binding adds a +dependency (though this is built in Python). The validator only needs +immutable membership checks, not SQL full power with queries, and update +transactions; but on the other hand the SQLite DB could be the same +across all languages. + +SQLite could useful as a benchmark and debugging format. It is not the +first choice for a small language library because this is not +compressed. But it will be a future enhancement for sure. + +Preferred solution: FST +~~~~~~~~~~~~~~~~~~~~~~~ + +Based on the benchmark and otrher criteria, let’s use an FST-backed +lookup for every languages. Do not use a Bloom filter (probalistic). Do +not use native structures that use too much memory. + +And for the library selection, we have these high level requirements: + +- We want exact result without false positives, e.g., no bloom filter. +- Offline use, with no network is a must: the dataset must be bundled + in the releases. +- With build time index construction, the construction time is not + critical. +- The bundled index should be small enough to ship below crates, and + Pypi archive size limits. +- No rebuild at startup/runtime, and fast enough load time from disk, + ideally memory-mapped. +- Fast enough lookup. +- Libraries should be maintained, active FOSS for Rust/Go/Python. + +The final selected FST libraries are: + +- Rust: fst crate with a memory-mapped set + https://github.com/BurntSushi/fst/ +- Python: ducer with a memory-mapped map, dict-like + https://github.com/jfolz/ducer (ducer uses the Rust fst crate inside) +- Go: vellum “fst” module (originally from + https://github.com/couchbase/vellum now at + https://github.com/blevesearch/vellum) which is mostly inspired from + the Rust fst crate + +Appendix: Benchmarks +-------------------- + +This directory contains evaluation and benchmark files for +PurlValidator. + +It compares structures for offline PURL membership checks with these +implementations use: + +- Python: memory-mapped ``ducer``. +- Rust: crate ``fst``. +- Go: embedded Vellum FST. + +… as well as the builtin Python set and dict, SQLite and a Rust DAWG + +Expected checkout layout +~~~~~~~~~~~~~~~~~~~~~~~~ + +Run the scripts from a directory with these repositories checkouts: + +- ``/purl-validator`` +- ``/purl-validator.rs`` +- ``/purlvalidator-go`` + +benchmarking FST vs. DAWG +~~~~~~~~~~~~~~~~~~~~~~~~~ + +There is a good benchmarch in Go comparing FST and DAWG data structures +(and other structures) that highlights why an FST is a better structure +for our cases than a DAWG: + +https://github.com/timurgarif/go-fsa-trie-bench + +We also did a simple synthetic benchmark of the Rust fst and dawg crates +using actual base PURLs using the data in +https://github.com/aboutcode-org/purl-validator.rs/tree/main/fst_builder/data + +The ``etc/bench/rust-fst-dawg-bench`` code compare these fst and dawg +crates. + +The dataset profile has 2,324,119 unique sorted base PURL. The benchmark +is to run 1M queries, where 500K are expected to fail. + +- The fst crate index was built in 11s, with a 26MB serialized file, + and took 0.703s for 1M lookups. +- The dawg crate index was built in 18s, with a 831MB serialized file, + and took 28s for 1M lookups. + +The outcome is that the preferred structure is an FST over a DAWG (at +least with these implementations). + +benchmarking FST against builtin and SQLite +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Since we picked the FST as the winner, additional review has been +focused on Python by comparing the ducer fst library against other +approaches. Since it is based on the Rust fst and Go’s vellum is also +based on the fst design, we cover essentially the three languages at +once. + +The ``etc/scripts/bench/alternative_benchmark.py`` script compares +Python lookup using a text file with one PURL per line for these +candidates: + +- Python ``set``. +- Python ``dict``. +- Python Sorted list plus ``bisect``. +- In-memory SQLite. +- FST using a ``ducer.Map``. + +Data is from ``purl-validator.rs/fst_builder/data/`` + +Results with 2,324,119 unique PURLs and 1M lookup queries, 500K existing +PURLs: + +.. code:: text + + structure build (secs) lookup (secs) storage size + -------------------- ------------ -------------- --------------------------- + python set 0.206540 0.275906 304MB in RAM + python dict 0.449625 0.429034 298MB in RAM + ducer FST 3.700943 1.805585 26MB on disk + sorted list+bisect 0.017540 2.783555 236MB in RAM + sqlite in memory 4.855480 4.220032 207MB on disk (or 65MB with zstd) + +benchmarking FST in Python vs. Go vs. Rust +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This benchmark runs each of the three validator released +implementations. The script is in +``etc/scripts/bench/go-rust-py_benchmark.py`` + +Data is from ``purl-validator.rs/fst_builder/data/`` + +Results with 2,324,119 unique PURLs and 1M lookup queries, 500K existing +PURLs: + +.. code:: text + + structure build (secs) lookup (secs) storage size (ondisk) + -------------------- ------------ -------------- --------------------------- + Python purl-validator 16.664847 4.926029 25MB + Rust purl-validator.rs 11.849877 0.348128 25MB + Go purlvalidator-go 2.325181 0.704749 25MB + +Evaluation +~~~~~~~~~~ + +The results are consistent with expectations: Rust is faster than Go and +Python. + +And the Python on disk fst is the same size as the Rust fst (since this +is the same backing code). + +Some surprises: + +- The build of the Go index is the fastest which is surprising and + could be an avenue of improvement for the Rust fst crate. + +- Leaving aside the 10x larger RAM need, the Python set and dict are + competitive speed wise (faster than the on-disk Rust FST) ans super + fast to build too. From 7fe3516b7292dd039d8343983c0b5ea7687742d6 Mon Sep 17 00:00:00 2001 From: Philippe Ombredanne Date: Fri, 15 May 2026 18:57:41 +0200 Subject: [PATCH 2/4] Update README Signed-off-by: Philippe Ombredanne --- README.md | 70 ++++++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 56 insertions(+), 14 deletions(-) diff --git a/README.md b/README.md index 39036fb..f84aaf6 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,9 @@ [![Version](https://img.shields.io/github/v/release/aboutcode-org/purl-validator?style=for-the-badge)](https://github.com/aboutcode-org/purl-validator/releases) [![Test](https://img.shields.io/github/actions/workflow/status/aboutcode-org/purl-validator/ci.yml?style=for-the-badge&logo=github)](https://github.com/aboutcode-org/purl-validator/actions) -**purl-validator** is a Python library for validating [Package URLs (PURLs)](https://github.com/package-url/purl-spec). It works fully offline, including in **air-gapped** or **restricted environments**, and answers one key question: **Does the package this PURL represents actually exist?** +**purl-validator** is a Python library for validating [Package URLs (PURLs)](https://github.com/package-url/purl-spec). +It works fully offline, including in **air-gapped** or **restricted environments**, +and answers one key question: **Does the package this PURL represents actually exist?** ## How Does It Work? @@ -12,18 +14,18 @@ ## Currently Supported Ecosystems -- **apk** -- **cargo** -- **composer** -- **conan** -- **cpan** -- **cran** -- **debian** -- **maven** -- **npm** -- **nuget** -- **pypi** -- **swift** +- apk +- cargo +- composer +- conan +- cpan +- cran +- debian +- maven +- npm +- nuget +- pypi +- swift ## Usage @@ -47,6 +49,46 @@ PurlValidator.validate_purl("pkg:nuget/FluentValidation") PurlValidator.validate_purl("pkg:nuget/non-existent-foo-bar") >>> False ``` +The validator accepts a PURL string or a `packageurl.PackageURL` object: + +```python +from packageurl import PackageURL +from purl_validator import PurlValidator + +validator = PurlValidator() +purl = PackageURL(type="npm", namespace="@angular", name="core") + +exists = validator.validate_purl(purl) +print(exists) +``` + +Only the base PURL is used for queries (e.g., oonly package type/namespace/name.) +Version, qualifiers, and subpath are not part of the query: + +```python +from purl_validator import create_purl_map_entry + +assert create_purl_map_entry("pkg:pypi/django@5.0.0") == b"pypi/django" +``` + +You can also build and load a custom index for tests or experiments: + +```python +from purl_validator import PurlValidator +from purl_validator import create_purl_map + +purl_map_location = create_purl_map([ + "pkg:pypi/django", + "pkg:npm/%40angular/core", +]) + +validator = PurlValidator(purl_map_location) +assert validator.validate_purl("pkg:pypi/django") is True +assert validator.validate_purl("pkg:pypi/not-a-real-package-name") is False +``` + +Use one `PurlValidator` instance for many lookups. Creating the instance loads +the packaged map, while each validation is an exact membership check. ## Contribution @@ -91,4 +133,4 @@ limitations under the License. ``` [^1]: MineCode continuously collects package metadata from various package ecosystems to maintain an up-to-date catalog of known packages. -[^2]: A Base Package URL is a Package URL without a version, qualifiers or subpath. +[^2]: A Base Package URL is a Package URL without a version, qualifiers, or subpath. From 64931fd4dade6c2d8d724ccbdc78ac581d14de78 Mon Sep 17 00:00:00 2001 From: Philippe Ombredanne Date: Fri, 15 May 2026 19:14:30 +0200 Subject: [PATCH 3/4] Update CI for docs and releases. Signed-off-by: Philippe Ombredanne --- .github/workflows/docs-ci.yml | 7 +++++-- .github/workflows/pypi-release.yml | 24 +++++++++++++++--------- .github/workflows/zizmor.yml | 24 ++++++++++++++++++++++++ 3 files changed, 44 insertions(+), 11 deletions(-) create mode 100644 .github/workflows/zizmor.yml diff --git a/.github/workflows/docs-ci.yml b/.github/workflows/docs-ci.yml index 8d8aa55..fbc267f 100644 --- a/.github/workflows/docs-ci.yml +++ b/.github/workflows/docs-ci.yml @@ -2,6 +2,7 @@ name: CI Documentation on: [push, pull_request] +permissions: {} jobs: build: runs-on: ubuntu-24.04 @@ -13,10 +14,12 @@ jobs: steps: - name: Checkout code - uses: actions/checkout@v4 + uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd + with: + persist-credentials: false - name: Set up Python ${{ matrix.python-version }} - uses: actions/setup-python@v5 + uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 with: python-version: ${{ matrix.python-version }} diff --git a/.github/workflows/pypi-release.yml b/.github/workflows/pypi-release.yml index a461f63..7b5a13a 100644 --- a/.github/workflows/pypi-release.yml +++ b/.github/workflows/pypi-release.yml @@ -18,17 +18,20 @@ on: tags: - "v*.*.*" +permissions: {} jobs: build-pypi-distribs: name: Build and publish library to PyPI runs-on: ubuntu-24.04 steps: - - uses: actions/checkout@v4 + - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd + with: + persist-credentials: false - name: Set up Python - uses: actions/setup-python@v5 + uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 with: - python-version: 3.12 + python-version: 3.13 - name: Install pypa/build and twine run: python -m pip install --user --upgrade build twine pkginfo @@ -36,17 +39,20 @@ jobs: - name: Build a binary wheel and a source tarball run: python -m build --wheel --sdist --outdir dist/ - - name: Validate wheel and sdis for Pypi + - name: Validate wheels and sdists for Pypi run: python -m twine check dist/* - name: Upload built archives - uses: actions/upload-artifact@v4 + uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f with: name: pypi_archives path: dist/* create-gh-release: + # Sets permissions of the GITHUB_TOKEN to allow release upload + permissions: + contents: write name: Create GH release needs: - build-pypi-distribs @@ -54,13 +60,13 @@ jobs: steps: - name: Download built archives - uses: actions/download-artifact@v4 + uses: actions/download-artifact@37930b1c2abaa49bbe596cd826c3c89aef350131 with: name: pypi_archives path: dist - name: Create GH release - uses: softprops/action-gh-release@v2 + uses: softprops/action-gh-release@b4309332981a82ec1c5618f44dd2e27cc8bfbfda with: draft: true files: dist/* @@ -77,11 +83,11 @@ jobs: steps: - name: Download built archives - uses: actions/download-artifact@v4 + uses: actions/download-artifact@37930b1c2abaa49bbe596cd826c3c89aef350131 with: name: pypi_archives path: dist - name: Publish to PyPI if: startsWith(github.ref, 'refs/tags') - uses: pypa/gh-action-pypi-publish@release/v1 + uses: pypa/gh-action-pypi-publish@cef221092ed1bacb1cc03d23a2d87d1d172e277b \ No newline at end of file diff --git a/.github/workflows/zizmor.yml b/.github/workflows/zizmor.yml new file mode 100644 index 0000000..aa8259d --- /dev/null +++ b/.github/workflows/zizmor.yml @@ -0,0 +1,24 @@ +name: GitHub Actions Security Analysis with zizmor 🌈 + +on: + push: + branches: ["main"] + pull_request: + branches: ["**"] + +permissions: {} + +jobs: + zizmor: + name: Run zizmor 🌈 + runs-on: ubuntu-latest + permissions: + security-events: write # Required for upload-sarif (used by zizmor-action) to upload SARIF files. + steps: + - name: Checkout repository + uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 + with: + persist-credentials: false + + - name: Run zizmor 🌈 + uses: zizmorcore/zizmor-action@b1d7e1fb5de872772f31590499237e7cce841e8e # v0.5.3 From caf5703fa32305019226cc02f12b8edabe821aa1 Mon Sep 17 00:00:00 2001 From: Philippe Ombredanne Date: Fri, 15 May 2026 19:19:27 +0200 Subject: [PATCH 4/4] Add combined documentation Signed-off-by: Philippe Ombredanne --- docs/source/conf.py | 8 +- docs/source/contribute/contrib_doc.rst | 2 +- docs/source/data-structure-rationale.rst | 83 ++++++++++++ docs/source/explanations.rst | 31 +++++ docs/source/how-to-guides.rst | 36 +++++ docs/source/index.rst | 59 +++++++-- docs/source/introduction.rst | 47 +++++++ docs/source/quickstart.rst | 84 ++++++++++++ docs/source/reference.rst | 75 +++++++++++ docs/source/skeleton-usage.rst | 160 ----------------------- docs/source/tutorials.rst | 45 +++++++ 11 files changed, 456 insertions(+), 174 deletions(-) create mode 100644 docs/source/data-structure-rationale.rst create mode 100644 docs/source/explanations.rst create mode 100644 docs/source/how-to-guides.rst create mode 100644 docs/source/introduction.rst create mode 100644 docs/source/quickstart.rst create mode 100644 docs/source/reference.rst delete mode 100644 docs/source/skeleton-usage.rst create mode 100644 docs/source/tutorials.rst diff --git a/docs/source/conf.py b/docs/source/conf.py index 056ca6e..410c3bd 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -17,7 +17,7 @@ # -- Project information ----------------------------------------------------- -project = "nexb-skeleton" +project = "purl-validator" copyright = "nexB Inc., AboutCode and others." author = "AboutCode.org authors and contributors" @@ -79,9 +79,9 @@ html_context = { "display_github": True, - "github_user": "nexB", - "github_repo": "nexb-skeleton", - "github_version": "develop", # branch + "github_user": "aboutcode-org", + "github_repo": "purl-validator", + "github_version": "main", # branch "conf_py_path": "/docs/source/", # path in the checkout to the docs root } diff --git a/docs/source/contribute/contrib_doc.rst b/docs/source/contribute/contrib_doc.rst index 2a719a5..b160bc5 100644 --- a/docs/source/contribute/contrib_doc.rst +++ b/docs/source/contribute/contrib_doc.rst @@ -187,7 +187,7 @@ Style Conventions for the Documentaion (`Refer `_) Normally, there are no heading levels assigned to certain characters as the structure is - determined from the succession of headings. However, this convention is used in Python’s Style + determined from the succession of headings. However, this convention is used in Python's Style Guide for documenting which you may follow: # with overline, for parts diff --git a/docs/source/data-structure-rationale.rst b/docs/source/data-structure-rationale.rst new file mode 100644 index 0000000..c2e0f46 --- /dev/null +++ b/docs/source/data-structure-rationale.rst @@ -0,0 +1,83 @@ +.. _data_structure_rationale: + +FST Data Structure Rationale +============================= + +PurlValidator needs exact membership lookup for a large list of base PURLs. The +lookup data index is built before release and bundled with each library. + + +See https://github.com/aboutcode-org/purl-validator/tree/main/etc/bench for +actual detailed rationale and bench for the choice of an FST. + + +Why FSTs are used? +------------------ + +Finite state transducers store sorted strings in a compact form. PURLs share +prefixes such as ``pkg:npm/``, ``pkg:pypi/``, and ``pkg:maven/``. This makes an +FST useful for exact package identity queries. + +FST can be memory-mapped and are super compact. They are not as fast as native +set, but the memory consumption is so much lower than this make them the most +attractive solution, even if it takes more time to build. + + +Requirements +--------------- + +The index structure should provide: + +And for the library selection, we have these high level requirements: + +- We want exact result without false positives, e.g., no bloom filter. +- Offline use, with no network is a must: the dataset must be bundled + in the releases. +- With build time index construction, the construction time is not + critical. +- The bundled index should be small enough to ship below crates, and + Pypi archive size limits. +- No rebuild at startup/runtime, and fast enough load time from disk, + ideally memory-mapped. +- Fast enough lookup. +- Libraries should be maintained, active FOSS for Rust/Go/Python. + + + + +Selected FST libraries +-------------------------- + +Python uses ``ducer.Map`` with ``mmap``. The map is stored on disk and opened +without loading the full catalog into Python objects. + +Rust uses ``fst::Set``. The generated FST is embedded into the crate. + +Go uses Vellum FST. The generated FST is embedded into the module. + +Alternatives +------------ + +We considered also built-in sets and maps as a baseline: + +- Python: ``set`` and ``dict``. +- Rust: ``HashSet`` and ``HashMap``. +- Go: ``map[string]struct{}`` and ``map[string]bool``. + +These structures are simple and fast. They require loading all keys into +runtime memory, so they are less useful as the packaged lookup format. + +Sorted arrays or slices can use binary search. They are simple and exact, but +lookup takes repeated string comparisons and the strings still need to be +loaded. + +SQLite can store the PURLs in an indexed table. It gives exact results, but it +adds a database dependency for a read-only membership check. It has way more +features than needed and is overkill for our use case. + +Bloom filters are small and fast, but they can return false positives. They +should cannot be used as validation index. + +A DAWG can store a set of strings by sharing prefixes and suffixes. It may be a +valid alternative to an FST (it is very similar to) but there are few maintained +libraries in the target languages. diff --git a/docs/source/explanations.rst b/docs/source/explanations.rst new file mode 100644 index 0000000..d5b6185 --- /dev/null +++ b/docs/source/explanations.rst @@ -0,0 +1,31 @@ +.. _explanations: + +Explanations +============ + +Syntax validation and identity validation +----------------------------------------- + +The Package-URL spec defines the PURL format. A PURL can follow the spec +format and still name a package that is not known in the package ecosystems. + +PurlValidator checks the package PURL against reference data of known PURLs. This +helps find misspelled names, wrong package types, and PURL that +do not appear in the reference upstream ecosystem package repositories. + + +Offline validation +------------------ + +SBOM and compliance workflows may run in CI systems, private networks, or +air-gapped environments. PurlValidator packages lookup data with each released +library so validation does not need a network registry access at runtime. + + +Base PURL validation +-------------------- + +PURL existence is checked before version existence. + +The current libraries validate base PURLs only, no versions. Version support +can be a future enhancement. diff --git a/docs/source/how-to-guides.rst b/docs/source/how-to-guides.rst new file mode 100644 index 0000000..7fc0ec4 --- /dev/null +++ b/docs/source/how-to-guides.rst @@ -0,0 +1,36 @@ +.. _how_to_guides: + +How-to Guides +============= + +Choose an implementation +------------------------ + +Use the implementation that matches the application: + +- Use Python for Python scripts, data pipelines, etc. +- Use Rust for Rust appss. +- Use Go for Go apps and command-line tools. + +All implementations package PURL index data with the released library. + + +Update validation data +---------------------- + +PurlValidator index data is released with each package. To update the +data used by an application, update the PurlValidator package version. + + +Validation results +-------------------------- + +Treat validation results in these groups: + +- Known: the PURL is valid and exists in the reference data. +- Unknown: the PURL is valid (parsing) but not present in the reference data. +- Invalid or unsupported: the input is not a supported or known PURL. + +For SBOM checks, you should report unknown and invalid PURLs separately. +Invalid PURLs are usually an error of the SBOM or SCA producer tool. +Unknown PURLs could be new packages, or typos, or SCA tools inventions. diff --git a/docs/source/index.rst b/docs/source/index.rst index eb63717..2cfeeef 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -1,16 +1,57 @@ -Welcome to nexb-skeleton's documentation! -========================================= +PurlValidator Documentation +=========================== -.. toctree:: - :maxdepth: 2 - :caption: Contents: +PurlValidator checks whether a base Package-URL (PURL) is present in a known +package catalog. It works without a network connection after installation. - skeleton-usage - contribute/contrib_doc +A valid PURL string can still name a package that is not known. PurlValidator +adds this package identity check for SBOM, VEX, SCA, and compliance workflows. + +Documentation overview +---------------------- + +Getting started +~~~~~~~~~~~~~~~ + +- :ref:`quickstart` +- :ref:`introduction` + +Tutorials +~~~~~~~~~ + +- :ref:`tutorials` + +How-to guides +~~~~~~~~~~~~~ + +- :ref:`how_to_guides` + +Reference +~~~~~~~~~ + +- :ref:`reference` + +Explanations +~~~~~~~~~~~~ + +- :ref:`explanations` +- :ref:`data_structure_rationale` Indices and tables -================== +------------------ * :ref:`genindex` -* :ref:`modindex` * :ref:`search` + +.. toctree:: + :maxdepth: 2 + :hidden: + + quickstart + introduction + tutorials + how-to-guides + reference + explanations + data-structure-rationale + contribute/contrib_doc diff --git a/docs/source/introduction.rst b/docs/source/introduction.rst new file mode 100644 index 0000000..cffc6df --- /dev/null +++ b/docs/source/introduction.rst @@ -0,0 +1,47 @@ +.. _introduction: + +Introduction +============ + +PurlValidator checks package identity for Package-URLs (PURLs). It does +not replace syntax validation. It adds a lookup against an index of packaged +reference data. + +Why this exists? +----------------- + +PURL is used in SBOMs, VEX documents, SCA tools, and vulnerability databases. +The PURL spec tells tools how to write a package identifier, but does +not prove that the package exists. + +Common PURL data problems include: + +- Misspelled package names. +- Wrong or made up package types. +- Package that are not present in an ecosystem. + +PurlValidator answers this question: + +Does this PURL exists for a known package? + +Repositories +------------ + +We have three implementations in Rust, Go and Python. +Each repository has language-specific usage notes in its README. + +- Python: https://github.com/aboutcode-org/purl-validator +- Rust: https://github.com/aboutcode-org/purl-validator.rs +- Go: https://github.com/aboutcode-org/purlvalidator-go + + +Validation scope +---------------- + +PurlValidator validates PURLs, ignoring version. A base PURL contains: + +- Type, such as ``npm`` or ``pypi``. +- Optional namespace, such as an npm scope or Maven groupid. +- Name. + +Versions, qualifiers, and subpaths are not part of the lookup query. diff --git a/docs/source/quickstart.rst b/docs/source/quickstart.rst new file mode 100644 index 0000000..80f7198 --- /dev/null +++ b/docs/source/quickstart.rst @@ -0,0 +1,84 @@ +.. _quickstart: + +Quickstart +========== + +Python +------ + +Install the Python package: + +.. code-block:: bash + + pip install purl-validator + +Validate a PURL: + +.. code-block:: python + + from purl_validator import PurlValidator + + validator = PurlValidator() + + print(validator.validate_purl("pkg:nuget/FluentValidation")) + print(validator.validate_purl("pkg:nuget/non-existent-foo-bar")) + +Rust +---- + +Install the Rust crate: + +.. code-block:: bash + + cargo add purl_validator + +Validate a PURL: + +.. code-block:: rust + + use purl_validator::validate; + + fn main() { + let exists = validate("pkg:nuget/FluentValidation") + .expect("input must be a supported base PURL"); + + println!("{exists}"); + } + +Go +-- + +Install the Go module: + +.. code-block:: bash + + go get github.com/aboutcode-org/purlvalidator-go + +Validate a PURL: + +.. code-block:: go + + package main + + import ( + "fmt" + "log" + + purlvalidator "github.com/aboutcode-org/purlvalidator-go" + ) + + func main() { + exists, err := purlvalidator.Validate("pkg:nuget/FluentValidation") + if err != nil { + log.Fatal(err) + } + + fmt.Println(exists) + } + +Next steps +---------- + +- Use the Python README for Python-specific helper APIs: https://github.com/aboutcode-org/purl-validator +- Use the Rust README for error handling with ``ValidateError``: https://github.com/aboutcode-org/purl-validator.rs +- Use the Go README for ``Validate`` return values and integration examples: https://github.com/aboutcode-org/purlvalidator-go diff --git a/docs/source/reference.rst b/docs/source/reference.rst new file mode 100644 index 0000000..96a8ed9 --- /dev/null +++ b/docs/source/reference.rst @@ -0,0 +1,75 @@ +.. _reference: + +Reference +========= + +Supported ecosystems +-------------------- + +The current validators package indexed reference data for these pacakge types/ecosystems: + +- ``apk`` +- ``cargo`` +- ``composer`` +- ``conan`` +- ``cpan`` +- ``cran`` +- ``debian`` +- ``maven`` +- ``npm`` +- ``nuget`` +- ``pypi`` +- ``swift`` + +Base PURLs +---------- + +A base PURL is a Package-URL without a version, qualifiers, or subpath. + +Examples: + +.. code-block:: text + + pkg:pypi/django + pkg:npm/%40angular/core + pkg:maven/org.apache.commons/commons-lang3 + +Unsupported examples: + +.. code-block:: text + + pkg:pypi/django@5.0.0 + pkg:npm/%40angular/core?repository_url=https://registry.npmjs.org + pkg:maven/org.apache.commons/commons-lang3#src/main + +Implementation summary +---------------------- + +- Python uses a memory-mapped compact map through ``ducer.Map``. +- Rust uses an embedded ``fst::Set`` generated from sorted PURL strings. +- Go uses an embedded Vellum FST generated from sorted PURL strings. + + +Language APIs +------------- + +Python: + +.. code-block:: python + + from purl_validator import PurlValidator + + validator = PurlValidator() + exists = validator.validate_purl("pkg:pypi/django") + +Rust: + +.. code-block:: rust + + let exists = purl_validator::validate("pkg:pypi/django")?; + +Go: + +.. code-block:: go + + exists, err := purlvalidator.Validate("pkg:pypi/django") diff --git a/docs/source/skeleton-usage.rst b/docs/source/skeleton-usage.rst deleted file mode 100644 index 6cb4cc5..0000000 --- a/docs/source/skeleton-usage.rst +++ /dev/null @@ -1,160 +0,0 @@ -Usage -===== -A brand new project -------------------- -.. code-block:: bash - - git init my-new-repo - cd my-new-repo - git pull git@github.com:nexB/skeleton - - # Create the new repo on GitHub, then update your remote - git remote set-url origin git@github.com:nexB/your-new-repo.git - -From here, you can make the appropriate changes to the files for your specific project. - -Update an existing project ---------------------------- -.. code-block:: bash - - cd my-existing-project - git remote add skeleton git@github.com:nexB/skeleton - git fetch skeleton - git merge skeleton/main --allow-unrelated-histories - -This is also the workflow to use when updating the skeleton files in any given repository. - -Customizing ------------ - -You typically want to perform these customizations: - -- remove or update the src/README.rst and tests/README.rst files -- set project info and dependencies in setup.cfg -- check the configure and configure.bat defaults - -Initializing a project ----------------------- - -All projects using the skeleton will be expected to pull all of it dependencies -from thirdparty.aboutcode.org/pypi or the local thirdparty directory, using -requirements.txt and/or requirements-dev.txt to determine what version of a -package to collect. By default, PyPI will not be used to find and collect -packages from. - -In the case where we are starting a new project where we do not have -requirements.txt and requirements-dev.txt and whose dependencies are not yet on -thirdparty.aboutcode.org/pypi, we run the following command after adding and -customizing the skeleton files to your project: - -.. code-block:: bash - - ./configure - -This will initialize the virtual environment for the project, pull in the -dependencies from PyPI and add them to the virtual environment. - - -Generating requirements.txt and requirements-dev.txt ----------------------------------------------------- - -After the project has been initialized, we can generate the requirements.txt and -requirements-dev.txt files. - -Ensure the virtual environment is enabled. - -.. code-block:: bash - - source venv/bin/activate - -To generate requirements.txt: - -.. code-block:: bash - - python etc/scripts/gen_requirements.py -s venv/lib/python/site-packages/ - -Replace \ with the version number of the Python being used, for example: -``venv/lib/python3.6/site-packages/`` - -To generate requirements-dev.txt after requirements.txt has been generated: - -.. code-block:: bash - - ./configure --dev - python etc/scripts/gen_requirements_dev.py -s venv/lib/python/site-packages/ - -Note: on Windows, the ``site-packages`` directory is located at ``venv\Lib\site-packages\`` - -.. code-block:: bash - - python .\\etc\\scripts\\gen_requirements.py -s .\\venv\\Lib\\site-packages\\ - .\configure --dev - python .\\etc\\scripts\\gen_requirements_dev.py -s .\\venv\\Lib\\site-packages\\ - - -Collecting and generating ABOUT files for dependencies ------------------------------------------------------- - -Ensure that the dependencies used by ``etc/scripts/fetch_thirdparty.py`` are installed: - -.. code-block:: bash - - pip install -r etc/scripts/requirements.txt - -Once we have requirements.txt and requirements-dev.txt, we can fetch the project -dependencies as wheels and generate ABOUT files for them: - -.. code-block:: bash - - python etc/scripts/fetch_thirdparty.py -r requirements.txt -r requirements-dev.txt - -There may be issues with the generated ABOUT files, which will have to be -corrected. You can check to see if your corrections are valid by running: - -.. code-block:: bash - - python etc/scripts/check_thirdparty.py -d thirdparty - -Once the wheels are collected and the ABOUT files are generated and correct, -upload them to thirdparty.aboutcode.org/pypi by placing the wheels and ABOUT -files from the thirdparty directory to the pypi directory at -https://github.com/aboutcode-org/thirdparty-packages - - -Usage after project initialization ----------------------------------- - -Once the ``requirements.txt`` and ``requirements-dev.txt`` have been generated -and the project dependencies and their ABOUT files have been uploaded to -thirdparty.aboutcode.org/pypi, you can configure the project as needed, typically -when you update dependencies or use a new checkout. - -If the virtual env for the project becomes polluted, or you would like to remove -it, use the ``--clean`` option: - -.. code-block:: bash - - ./configure --clean - -Then you can run ``./configure`` again to set up the project virtual environment. - -To set up the project for development use: - -.. code-block:: bash - - ./configure --dev - -To update the project dependencies (adding, removing, updating packages, etc.), -update the dependencies in ``setup.cfg``, then run: - -.. code-block:: bash - - ./configure --clean # Remove existing virtual environment - source venv/bin/activate # Ensure virtual environment is activated - python etc/scripts/gen_requirements.py -s venv/lib/python/site-packages/ # Regenerate requirements.txt - python etc/scripts/gen_requirements_dev.py -s venv/lib/python/site-packages/ # Regenerate requirements-dev.txt - pip install -r etc/scripts/requirements.txt # Install dependencies needed by etc/scripts/bootstrap.py - python etc/scripts/fetch_thirdparty.py -r requirements.txt -r requirements-dev.txt # Collect dependency wheels and their ABOUT files - -Ensure that the generated ABOUT files are valid, then take the dependency wheels -and ABOUT files and upload them to thirdparty.aboutcode.org/pypi. diff --git a/docs/source/tutorials.rst b/docs/source/tutorials.rst new file mode 100644 index 0000000..fd67722 --- /dev/null +++ b/docs/source/tutorials.rst @@ -0,0 +1,45 @@ +.. _tutorials: + +Tutorials +========= + +Validate a list of PURLs with Python +------------------------------------ + +Create a file named ``purls.txt``: + +.. code-block:: text + + pkg:nuget/FluentValidation + pkg:nuget/non-existent-foo-bar + pkg:pypi/django + +Run this script: + +.. code-block:: python + + from pathlib import Path + from purl_validator import PurlValidator + + validator = PurlValidator() + + for line in Path("purls.txt").read_text().splitlines(): + purl = line.strip() + if not purl: + continue + print(purl, validator.validate_purl(purl)) + + +Use PurlValidator in an SBOM check +---------------------------------- + +The basic workflow is: + +1. Extract PURLs from an SBOM. +2. Convert each PURL to its base identity. +3. Validate each base PURL with one PurlValidator library. +4. Report unknown PURLs for review. + +An unknown PURL may be a typo, a wrong package type, or a package missing from +the packaged reference data. Handle unknown PURLs according to the policy for +your project.