Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions .github/workflows/docs-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ name: CI Documentation

on: [push, pull_request]

permissions: {}
jobs:
build:
runs-on: ubuntu-24.04
Expand All @@ -13,10 +14,12 @@ jobs:

steps:
- name: Checkout code
uses: actions/checkout@v4
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd
with:
persist-credentials: false

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405
with:
python-version: ${{ matrix.python-version }}

Expand Down
24 changes: 15 additions & 9 deletions .github/workflows/pypi-release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,49 +18,55 @@
tags:
- "v*.*.*"

permissions: {}
jobs:
build-pypi-distribs:
name: Build and publish library to PyPI
runs-on: ubuntu-24.04

steps:
- uses: actions/checkout@v4
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd
with:
persist-credentials: false
- name: Set up Python
uses: actions/setup-python@v5
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405
with:
python-version: 3.12
python-version: 3.13

- name: Install pypa/build and twine
run: python -m pip install --user --upgrade build twine pkginfo

- name: Build a binary wheel and a source tarball
run: python -m build --wheel --sdist --outdir dist/

- name: Validate wheel and sdis for Pypi
- name: Validate wheels and sdists for Pypi
run: python -m twine check dist/*

- name: Upload built archives
uses: actions/upload-artifact@v4
uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f
with:
name: pypi_archives
path: dist/*


create-gh-release:
# Sets permissions of the GITHUB_TOKEN to allow release upload
permissions:
contents: write
name: Create GH release
needs:
- build-pypi-distribs
runs-on: ubuntu-24.04

steps:
- name: Download built archives
uses: actions/download-artifact@v4
uses: actions/download-artifact@37930b1c2abaa49bbe596cd826c3c89aef350131
with:
name: pypi_archives
path: dist

- name: Create GH release
uses: softprops/action-gh-release@v2
uses: softprops/action-gh-release@b4309332981a82ec1c5618f44dd2e27cc8bfbfda

Check notice

Code scanning / zizmor

action functionality is already included by the runner Note

action functionality is already included by the runner
with:
draft: true
files: dist/*
Expand All @@ -77,11 +83,11 @@

steps:
- name: Download built archives
uses: actions/download-artifact@v4
uses: actions/download-artifact@37930b1c2abaa49bbe596cd826c3c89aef350131
with:
name: pypi_archives
path: dist

- name: Publish to PyPI
if: startsWith(github.ref, 'refs/tags')
uses: pypa/gh-action-pypi-publish@release/v1
uses: pypa/gh-action-pypi-publish@cef221092ed1bacb1cc03d23a2d87d1d172e277b
24 changes: 24 additions & 0 deletions .github/workflows/zizmor.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
name: GitHub Actions Security Analysis with zizmor 🌈

on:
push:
branches: ["main"]
pull_request:
branches: ["**"]

permissions: {}

jobs:
zizmor:
name: Run zizmor 🌈
runs-on: ubuntu-latest
permissions:
security-events: write # Required for upload-sarif (used by zizmor-action) to upload SARIF files.
steps:
- name: Checkout repository
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false

- name: Run zizmor 🌈
uses: zizmorcore/zizmor-action@b1d7e1fb5de872772f31590499237e7cce841e8e # v0.5.3
70 changes: 56 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,26 +4,28 @@
[![Version](https://img.shields.io/github/v/release/aboutcode-org/purl-validator?style=for-the-badge)](https://github.com/aboutcode-org/purl-validator/releases)
[![Test](https://img.shields.io/github/actions/workflow/status/aboutcode-org/purl-validator/ci.yml?style=for-the-badge&logo=github)](https://github.com/aboutcode-org/purl-validator/actions)

**purl-validator** is a Python library for validating [Package URLs (PURLs)](https://github.com/package-url/purl-spec). It works fully offline, including in **air-gapped** or **restricted environments**, and answers one key question: **Does the package this PURL represents actually exist?**
**purl-validator** is a Python library for validating [Package URLs (PURLs)](https://github.com/package-url/purl-spec).
It works fully offline, including in **air-gapped** or **restricted environments**,
and answers one key question: **Does the package this PURL represents actually exist?**

## How Does It Work?

**purl-validator** is shipped with a pre-built FST (Finite State Transducer), a set of compact automata containing latest Package URLs mined by the MineCode[^1]. Library uses this FST to perform lookups and confirm whether the **base PURL**[^2] exists.

## Currently Supported Ecosystems

- **apk**
- **cargo**
- **composer**
- **conan**
- **cpan**
- **cran**
- **debian**
- **maven**
- **npm**
- **nuget**
- **pypi**
- **swift**
- apk
- cargo
- composer
- conan
- cpan
- cran
- debian
- maven
- npm
- nuget
- pypi
- swift

## Usage

Expand All @@ -47,6 +49,46 @@ PurlValidator.validate_purl("pkg:nuget/FluentValidation")
PurlValidator.validate_purl("pkg:nuget/non-existent-foo-bar")
>>> False
```
The validator accepts a PURL string or a `packageurl.PackageURL` object:

```python
from packageurl import PackageURL
from purl_validator import PurlValidator

validator = PurlValidator()
purl = PackageURL(type="npm", namespace="@angular", name="core")

exists = validator.validate_purl(purl)
print(exists)
```

Only the base PURL is used for queries (e.g., oonly package type/namespace/name.)
Version, qualifiers, and subpath are not part of the query:

```python
from purl_validator import create_purl_map_entry

assert create_purl_map_entry("pkg:pypi/django@5.0.0") == b"pypi/django"
```

You can also build and load a custom index for tests or experiments:

```python
from purl_validator import PurlValidator
from purl_validator import create_purl_map

purl_map_location = create_purl_map([
"pkg:pypi/django",
"pkg:npm/%40angular/core",
])

validator = PurlValidator(purl_map_location)
assert validator.validate_purl("pkg:pypi/django") is True
assert validator.validate_purl("pkg:pypi/not-a-real-package-name") is False
```

Use one `PurlValidator` instance for many lookups. Creating the instance loads
the packaged map, while each validation is an exact membership check.

## Contribution

Expand Down Expand Up @@ -91,4 +133,4 @@ limitations under the License.
```

[^1]: MineCode continuously collects package metadata from various package ecosystems to maintain an up-to-date catalog of known packages.
[^2]: A Base Package URL is a Package URL without a version, qualifiers or subpath.
[^2]: A Base Package URL is a Package URL without a version, qualifiers, or subpath.
8 changes: 4 additions & 4 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

# -- Project information -----------------------------------------------------

project = "nexb-skeleton"
project = "purl-validator"
copyright = "nexB Inc., AboutCode and others."
author = "AboutCode.org authors and contributors"

Expand Down Expand Up @@ -79,9 +79,9 @@

html_context = {
"display_github": True,
"github_user": "nexB",
"github_repo": "nexb-skeleton",
"github_version": "develop", # branch
"github_user": "aboutcode-org",
"github_repo": "purl-validator",
"github_version": "main", # branch
"conf_py_path": "/docs/source/", # path in the checkout to the docs root
}

Expand Down
2 changes: 1 addition & 1 deletion docs/source/contribute/contrib_doc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ Style Conventions for the Documentaion

(`Refer <https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#sections>`_)
Normally, there are no heading levels assigned to certain characters as the structure is
determined from the succession of headings. However, this convention is used in Pythons Style
determined from the succession of headings. However, this convention is used in Python's Style
Guide for documenting which you may follow:

# with overline, for parts
Expand Down
83 changes: 83 additions & 0 deletions docs/source/data-structure-rationale.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
.. _data_structure_rationale:

FST Data Structure Rationale
=============================

PurlValidator needs exact membership lookup for a large list of base PURLs. The
lookup data index is built before release and bundled with each library.


See https://github.com/aboutcode-org/purl-validator/tree/main/etc/bench for
actual detailed rationale and bench for the choice of an FST.


Why FSTs are used?
------------------

Finite state transducers store sorted strings in a compact form. PURLs share
prefixes such as ``pkg:npm/``, ``pkg:pypi/``, and ``pkg:maven/``. This makes an
FST useful for exact package identity queries.

FST can be memory-mapped and are super compact. They are not as fast as native
set, but the memory consumption is so much lower than this make them the most
attractive solution, even if it takes more time to build.


Requirements
---------------

The index structure should provide:

And for the library selection, we have these high level requirements:

- We want exact result without false positives, e.g., no bloom filter.
- Offline use, with no network is a must: the dataset must be bundled
in the releases.
- With build time index construction, the construction time is not
critical.
- The bundled index should be small enough to ship below crates, and
Pypi archive size limits.
- No rebuild at startup/runtime, and fast enough load time from disk,
ideally memory-mapped.
- Fast enough lookup.
- Libraries should be maintained, active FOSS for Rust/Go/Python.




Selected FST libraries
--------------------------

Python uses ``ducer.Map`` with ``mmap``. The map is stored on disk and opened
without loading the full catalog into Python objects.

Rust uses ``fst::Set``. The generated FST is embedded into the crate.

Go uses Vellum FST. The generated FST is embedded into the module.

Alternatives
------------

We considered also built-in sets and maps as a baseline:

- Python: ``set`` and ``dict``.
- Rust: ``HashSet`` and ``HashMap``.
- Go: ``map[string]struct{}`` and ``map[string]bool``.

These structures are simple and fast. They require loading all keys into
runtime memory, so they are less useful as the packaged lookup format.

Sorted arrays or slices can use binary search. They are simple and exact, but
lookup takes repeated string comparisons and the strings still need to be
loaded.

SQLite can store the PURLs in an indexed table. It gives exact results, but it
adds a database dependency for a read-only membership check. It has way more
features than needed and is overkill for our use case.

Bloom filters are small and fast, but they can return false positives. They
should cannot be used as validation index.

A DAWG can store a set of strings by sharing prefixes and suffixes. It may be a
valid alternative to an FST (it is very similar to) but there are few maintained
libraries in the target languages.
31 changes: 31 additions & 0 deletions docs/source/explanations.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
.. _explanations:

Explanations
============

Syntax validation and identity validation
-----------------------------------------

The Package-URL spec defines the PURL format. A PURL can follow the spec
format and still name a package that is not known in the package ecosystems.

PurlValidator checks the package PURL against reference data of known PURLs. This
helps find misspelled names, wrong package types, and PURL that
do not appear in the reference upstream ecosystem package repositories.


Offline validation
------------------

SBOM and compliance workflows may run in CI systems, private networks, or
air-gapped environments. PurlValidator packages lookup data with each released
library so validation does not need a network registry access at runtime.


Base PURL validation
--------------------

PURL existence is checked before version existence.

The current libraries validate base PURLs only, no versions. Version support
can be a future enhancement.
Loading