Skip to content

Image-mode default generator cdxgen -t oci loses ecosystem metadata vs Syft — propose priority swap #232

@nissessenap

Description

@nissessenap

Summary

For Docker image scans, sbomify-action's generator registry today prefers CdxgenImageGenerator (priority 20) over SyftImageGenerator (priority 35) — sbomify_action/_generation/generator.py:74-88, registry.py:97-98 (lower priority number wins). This is the default path for any image scan that doesn't match the Chainguard or Docker Hub auto-detection branches.

The result: cdxgen -t oci <image> produces strictly worse SBOMs than Syft for interpreted-language ecosystems, because cdxgen's OCI walker emits pkg:generic/<filename> for venv bin shims and shared libraries instead of resolving Python .dist-info, Node node_modules, gem indexes, etc.

Concrete evidence — same image, two scanners

Test image: a FastAPI app built on cgr.dev/chainguard/python:latest with 17 transitive PyPI dependencies (fastapi, httpx, uvicorn, pydantic, starlette, …).

Scanner Total components pkg:pypi/* pkg:apk/* pkg:generic/*
Lockfile (cdxgen-fs, pyproject.toml/uv.lock) 17 17 0 0
Image: cdxgen -t oci (current default) 78 2 0 76
Image: Syft (priority 35, currently fallback) 1809* 17 25 0

*The 1809 from Syft includes 42 type=library packages + 1766 type=file filesystem entries (separately reported, not conflated with packages) + 1 OS component.

cdxgen-image emits fastapi, httpx, uvicorn as type=file + pkg:generic/fastapi (resolved from /opt/venv/bin/fastapi entry-point scripts), and bundles 73 shared-library entries (libcrypto.so.3, libc.so.6, …) as pkg:generic/*. The actual PyPI package metadata in the venv .dist-info directories is not surfaced.

Syft walks the venv .dist-info directories correctly and produces the same 17-component PyPI closure as the lockfile mode, plus the Chainguard base image's 25 apk packages — without the false-positive pkg:generic/* noise.

Why this isn't Python-specific

The pattern affects every interpreted ecosystem where dependencies live in app-readable form (venv, node_modules, gem caches, vendored directories). Cross-referencing the existing test fixtures (tests/test-data/cgr.dev_chainguard_wolfi-base_latest_cdxgen.cdx.json), all 27 components are pkg:generic/* with cdx:bom:componentTypes: "generic" — confirming cdxgen's OCI walker only produces real purls for OS packages (apk/deb/rpm).

Ecosystem in image cdxgen -t oci Syft
OS packages (apk/deb/rpm) Good purls Good purls
Python (venv .dist-info) Generic shims Proper pypi purls
Node.js (node_modules) Generic shims Proper npm purls
Ruby (gem index) Likely generic Proper gem purls
Java (JARs) JAR manifest parsing Worth testing both
Go / Rust (static binaries) n/a — no in-image data n/a — no in-image data

For Java specifically, cdxgen has explicit JAR-manifest parsing and may genuinely beat Syft on JAR-heavy images — that case warrants benchmarking before any wholesale removal of CdxgenImageGenerator.

Proposed fix

Three options, in increasing aggressiveness:

Option A — Swap priorities

CdxgenImageGenerator.priority = 40 (or higher), SyftImageGenerator.priority = 20. Syft becomes the default; cdxgen remains a fallback if Syft is unavailable. Single-line change in _generation/generators/syft.py and _generation/generators/cdxgen.py.

Option B — Per-ecosystem priority

If JAR images deserve cdxgen, gate via supports(): cdxgen-image returns True only when the image looks JVM-flavored (detect via docker inspect env vars, or via a --prefer-cdxgen-jvm env flag). Syft becomes the default for everything else.

Option C — Remove CdxgenImageGenerator entirely

If JAR images turn out not to be a real cdxgen advantage in practice (benchmark TBD), retire the image-mode cdxgen path. Lockfile-mode cdxgen (CdxgenFsGenerator) is unaffected and still the default for filesystem/lockfile scans — that path remains valuable for multi-ecosystem lockfile resolution.

Recommended starting point: Option A (priority swap). Reversible, single-line, immediately measurable improvement for the vast majority of users.

Adjacent context

PR #225 already moves toward Syft as the image scanner inside its docker-hub-upstream + syft-overlay merge path (cli/main.py:1258-1267). This issue generalizes that direction to the default image-scan path — every image scan benefits from Syft, not just library/* and dhi.io/* ones.

A complementary issue (filed separately) proposes that the Chainguard auto-detection branch be extended to merge with Syft as well — today it bypasses Syft entirely, dropping COPYd app content. That's a shape change in cli/main.py:1153-1198. This issue is the improvement to the non-detected path.

Reproducer

git clone https://github.com/nissessenap/sbom-generation-example
cd sbom-generation-example
make image-python                  # build the Chainguard-based Python image
make sbomify-python                # lockfile mode → 17 pypi components
make sbomify-image-python          # image mode → 2 pypi + 76 generic

# Then run syft directly for comparison:
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
  --entrypoint syft sbomifyhub/sbomify-action:26.2.0 \
  sbom-generation-example/python:dev -o cyclonedx-json \
  | jq '[.components[] | select(.purl|startswith("pkg:pypi"))] | length'
# → 17

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions