Skip to content

bakery ci publish: intermittent digest 'not found' from GHCR (eventual consistency, no retry in publish path) #591

@ianpittwood

Description

@ianpittwood

Summary

bakery ci publish (formerly bakery ci merge) intermittently fails with a digest-addressed manifest reported as not found by GHCR. The failures are ephemeral — re-running the merge job alone almost always succeeds — which points squarely at GHCR read-after-write (eventual consistency) on manifests that we push by digest, combined with the fact that there is no retry/backoff anywhere in the publish path.

Example failures

  • SOCI convert: SOCI convert failed for 'ImageTarget(image='workbench', version='2026.05.0+218.pro1', ...)': Error from source registry for "ghcr.io/posit-dev/workbench/tmp:workbench-...ba39112170": sha256:7e8001fdfc...: not found (run 27354249233)
  • oras fetch: Error response from registry: could not find the manifest ghcr.io/posit-dev/workbench-session-init/tmp:latest@sha256:9a5a07a1...: not found (run 27024890914)
  • oras fetch: ... ghcr.io/posit-dev/workbench-session-init/tmp:latest@sha256:a734c599...: not found (run 27024930267)

Root cause analysis

The mechanism

  1. Per-platform builds push by digest (untagged). In posit_bakery/image/image_target.py (build()), when a temp registry is set, the build output is overridden to:

    output = {"type": "image", "push-by-digest": True, "name-canonical": True, "push": True}

    Each platform's manifest lands in ghcr.io/posit-dev/<image>/tmp addressable only by digest (no tag). The digest is recorded into the per-platform metadata artifact (image_metadata.py<repo>/tmp@sha256:...).

  2. The platforms build on separate runners. Per .github/workflows/bakery-build-native.yml, build-test is a matrix with linux/amd64 and linux/arm64 on different runners (different machine class/region). Each pushes its manifest + blobs to GHCR independently.

  3. The merge job references those digests immediately. bakery ci publish (posit_bakery/cli/ci.py) then, by digest:

    • Phase 1 oras manifest index create — builds the index from get_merge_sources() (the per-platform ...@sha256: refs).
    • Phase 2 SOCI convert — oras cp <index> --to-oci-layout pulls the index and every child manifest/blob down.
    • Phase 3/4 index-copy + verify.
  4. There is zero retry/backoff anywhere in this path. Every subprocess.run in oras.py / soci.py raises hard on the first non-zero exit.

Why it's intermittent — and why a re-run fixes it

All observed errors are the same shape: a digest-addressed manifest reported not found by GHCR. The decisive clue is that re-running the merge job alone succeeds — it re-references the exact same digests. If those manifests had been garbage-collected, the re-run would fail identically. It doesn't, so the manifests are durably present; they were merely transiently unreadable when the merge runner first asked for them.

That is the signature of GHCR eventual consistency: a manifest/blob pushed by digest from runner A isn't guaranteed to be immediately readable by digest from runner B. Pushing by digest with no tag is the worst case — there's no tag write to act as a consistency checkpoint, and oras does HEAD/GET-by-digest lookups that 404 until the backend replicates. By the time the job is re-run, propagation has completed.

(This is diagnosis-by-evidence, not a reproduced failure — GHCR's internal replication can't be reproduced on demand. But the evidence chain — always digest-not-found, always self-heals on plain re-run, separate-runner pushes, no retry — is consistent and rules out permanent GC/deletion.)

Proposed solutions

Ranked. The clean fix lives in bakery, not the workflow.

  1. Retry-with-backoff around the registry-touching oras/soci calls (recommended core fix). Wrap .run() of OrasManifestIndexCreate, the SOCI oras cp --to-oci-layout pull, OrasCopy, and OrasManifestFetch so that transient registry errors (not found / manifest unknown / 5xx / timeouts) retry a handful of times with exponential backoff (e.g. 5 attempts, 2s→32s). Non-transient errors still fail fast. This directly absorbs propagation lag. Retry count/backoff should be configurable.

  2. Explicit pre-flight wait in publish (complements Build test push action #1, better diagnostics). Before Phase 1, poll each source digest with oras manifest fetch --descriptor until all resolve (or timeout). Turns "hope it's propagated" into condition-based waiting and logs which digest was lagging.

  3. Workflow-level retry (stopgap only). Wrap the Publish step with nick-fields/retry. Cheap, but re-runs the whole publish and masks the real fix.

  4. Push per-platform images to real tags (tmp:<uid>-<platform>) instead of by-digest. Would also improve consistency, but a bigger change with its own GC/cleanup implications. Not the lead recommendation.

Recommendation: implement #1 as the core fix (with retry/backoff configurable), optionally #2 for diagnostics. Both should be developed with tests that simulate a transient not found followed by success.

Affected files

  • posit-bakery/posit_bakery/plugins/builtin/oras/oras.pyOrasCommand.run, index-create/copy/fetch workflows
  • posit-bakery/posit_bakery/plugins/builtin/soci/soci.pySociCommand.run, SociConvertWorkflow
  • posit-bakery/posit_bakery/cli/ci.pypublish() orchestration (pre-flight wait would go here)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcvp:0Necessary projects we are undertaking that don’t directly deliver value to the customerdockerRelated to container images we producetdp:3Other teams or customers indirectly notice.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions