You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
bakery ci publish (formerly bakery ci merge) intermittently fails with a digest-addressed manifest reported as not found by GHCR. The failures are ephemeral — re-running the merge job alone almost always succeeds — which points squarely at GHCR read-after-write (eventual consistency) on manifests that we push by digest, combined with the fact that there is no retry/backoff anywhere in the publish path.
Example failures
SOCI convert: SOCI convert failed for 'ImageTarget(image='workbench', version='2026.05.0+218.pro1', ...)': Error from source registry for "ghcr.io/posit-dev/workbench/tmp:workbench-...ba39112170": sha256:7e8001fdfc...: not found (run 27354249233)
oras fetch: Error response from registry: could not find the manifest ghcr.io/posit-dev/workbench-session-init/tmp:latest@sha256:9a5a07a1...: not found (run 27024890914)
oras fetch: ... ghcr.io/posit-dev/workbench-session-init/tmp:latest@sha256:a734c599...: not found (run 27024930267)
Root cause analysis
The mechanism
Per-platform builds push by digest (untagged). In posit_bakery/image/image_target.py (build()), when a temp registry is set, the build output is overridden to:
Each platform's manifest lands in ghcr.io/posit-dev/<image>/tmp addressable only by digest (no tag). The digest is recorded into the per-platform metadata artifact (image_metadata.py → <repo>/tmp@sha256:...).
The platforms build on separate runners. Per .github/workflows/bakery-build-native.yml, build-test is a matrix with linux/amd64 and linux/arm64 on different runners (different machine class/region). Each pushes its manifest + blobs to GHCR independently.
The merge job references those digests immediately.bakery ci publish (posit_bakery/cli/ci.py) then, by digest:
Phase 1oras manifest index create — builds the index from get_merge_sources() (the per-platform ...@sha256: refs).
Phase 2 SOCI convert — oras cp <index> --to-oci-layout pulls the index and every child manifest/blob down.
Phase 3/4 index-copy + verify.
There is zero retry/backoff anywhere in this path. Every subprocess.run in oras.py / soci.py raises hard on the first non-zero exit.
Why it's intermittent — and why a re-run fixes it
All observed errors are the same shape: a digest-addressed manifest reported not found by GHCR. The decisive clue is that re-running the merge job alone succeeds — it re-references the exact same digests. If those manifests had been garbage-collected, the re-run would fail identically. It doesn't, so the manifests are durably present; they were merely transiently unreadable when the merge runner first asked for them.
That is the signature of GHCR eventual consistency: a manifest/blob pushed by digest from runner A isn't guaranteed to be immediately readable by digest from runner B. Pushing by digest with no tag is the worst case — there's no tag write to act as a consistency checkpoint, and oras does HEAD/GET-by-digest lookups that 404 until the backend replicates. By the time the job is re-run, propagation has completed.
(This is diagnosis-by-evidence, not a reproduced failure — GHCR's internal replication can't be reproduced on demand. But the evidence chain — always digest-not-found, always self-heals on plain re-run, separate-runner pushes, no retry — is consistent and rules out permanent GC/deletion.)
Proposed solutions
Ranked. The clean fix lives in bakery, not the workflow.
Retry-with-backoff around the registry-touching oras/soci calls (recommended core fix). Wrap .run() of OrasManifestIndexCreate, the SOCI oras cp --to-oci-layout pull, OrasCopy, and OrasManifestFetch so that transient registry errors (not found / manifest unknown / 5xx / timeouts) retry a handful of times with exponential backoff (e.g. 5 attempts, 2s→32s). Non-transient errors still fail fast. This directly absorbs propagation lag. Retry count/backoff should be configurable.
Explicit pre-flight wait in publish (complements Build test push action #1, better diagnostics). Before Phase 1, poll each source digest with oras manifest fetch --descriptor until all resolve (or timeout). Turns "hope it's propagated" into condition-based waiting and logs which digest was lagging.
Workflow-level retry (stopgap only). Wrap the Publish step with nick-fields/retry. Cheap, but re-runs the whole publish and masks the real fix.
Push per-platform images to real tags (tmp:<uid>-<platform>) instead of by-digest. Would also improve consistency, but a bigger change with its own GC/cleanup implications. Not the lead recommendation.
Recommendation: implement #1 as the core fix (with retry/backoff configurable), optionally #2 for diagnostics. Both should be developed with tests that simulate a transient not found followed by success.
Summary
bakery ci publish(formerlybakery ci merge) intermittently fails with a digest-addressed manifest reported asnot foundby GHCR. The failures are ephemeral — re-running the merge job alone almost always succeeds — which points squarely at GHCR read-after-write (eventual consistency) on manifests that we push by digest, combined with the fact that there is no retry/backoff anywhere in the publish path.Example failures
SOCI convert failed for 'ImageTarget(image='workbench', version='2026.05.0+218.pro1', ...)': Error from source registry for "ghcr.io/posit-dev/workbench/tmp:workbench-...ba39112170": sha256:7e8001fdfc...: not found(run 27354249233)Error response from registry: could not find the manifest ghcr.io/posit-dev/workbench-session-init/tmp:latest@sha256:9a5a07a1...: not found(run 27024890914)... ghcr.io/posit-dev/workbench-session-init/tmp:latest@sha256:a734c599...: not found(run 27024930267)Root cause analysis
The mechanism
Per-platform builds push by digest (untagged). In
posit_bakery/image/image_target.py(build()), when a temp registry is set, the build output is overridden to:Each platform's manifest lands in
ghcr.io/posit-dev/<image>/tmpaddressable only by digest (no tag). The digest is recorded into the per-platform metadata artifact (image_metadata.py→<repo>/tmp@sha256:...).The platforms build on separate runners. Per
.github/workflows/bakery-build-native.yml,build-testis a matrix withlinux/amd64andlinux/arm64on different runners (different machine class/region). Each pushes its manifest + blobs to GHCR independently.The
mergejob references those digests immediately.bakery ci publish(posit_bakery/cli/ci.py) then, by digest:oras manifest index create— builds the index fromget_merge_sources()(the per-platform...@sha256:refs).oras cp <index> --to-oci-layoutpulls the index and every child manifest/blob down.There is zero retry/backoff anywhere in this path. Every
subprocess.runinoras.py/soci.pyraises hard on the first non-zero exit.Why it's intermittent — and why a re-run fixes it
All observed errors are the same shape: a digest-addressed manifest reported
not foundby GHCR. The decisive clue is that re-running the merge job alone succeeds — it re-references the exact same digests. If those manifests had been garbage-collected, the re-run would fail identically. It doesn't, so the manifests are durably present; they were merely transiently unreadable when the merge runner first asked for them.That is the signature of GHCR eventual consistency: a manifest/blob pushed by digest from runner A isn't guaranteed to be immediately readable by digest from runner B. Pushing by digest with no tag is the worst case — there's no tag write to act as a consistency checkpoint, and
orasdoes HEAD/GET-by-digest lookups that 404 until the backend replicates. By the time the job is re-run, propagation has completed.(This is diagnosis-by-evidence, not a reproduced failure — GHCR's internal replication can't be reproduced on demand. But the evidence chain — always digest-not-found, always self-heals on plain re-run, separate-runner pushes, no retry — is consistent and rules out permanent GC/deletion.)
Proposed solutions
Ranked. The clean fix lives in bakery, not the workflow.
Retry-with-backoff around the registry-touching oras/soci calls (recommended core fix). Wrap
.run()ofOrasManifestIndexCreate, the SOCIoras cp --to-oci-layoutpull,OrasCopy, andOrasManifestFetchso that transient registry errors (not found/manifest unknown/ 5xx / timeouts) retry a handful of times with exponential backoff (e.g. 5 attempts, 2s→32s). Non-transient errors still fail fast. This directly absorbs propagation lag. Retry count/backoff should be configurable.Explicit pre-flight wait in
publish(complements Build test push action #1, better diagnostics). Before Phase 1, poll each source digest withoras manifest fetch --descriptoruntil all resolve (or timeout). Turns "hope it's propagated" into condition-based waiting and logs which digest was lagging.Workflow-level retry (stopgap only). Wrap the Publish step with
nick-fields/retry. Cheap, but re-runs the whole publish and masks the real fix.Push per-platform images to real tags (
tmp:<uid>-<platform>) instead of by-digest. Would also improve consistency, but a bigger change with its own GC/cleanup implications. Not the lead recommendation.Recommendation: implement #1 as the core fix (with retry/backoff configurable), optionally #2 for diagnostics. Both should be developed with tests that simulate a transient
not foundfollowed by success.Affected files
posit-bakery/posit_bakery/plugins/builtin/oras/oras.py—OrasCommand.run, index-create/copy/fetch workflowsposit-bakery/posit_bakery/plugins/builtin/soci/soci.py—SociCommand.run,SociConvertWorkflowposit-bakery/posit_bakery/cli/ci.py—publish()orchestration (pre-flight wait would go here)