bakery ci publish: intermittent digest 'not found' from GHCR (eventual consistency, no retry in publish path)

## Summary

`bakery ci publish` (formerly `bakery ci merge`) intermittently fails with a digest-addressed manifest reported as `not found` by GHCR. The failures are ephemeral — **re-running the merge job alone almost always succeeds** — which points squarely at GHCR read-after-write (eventual consistency) on manifests that we push *by digest*, combined with the fact that there is **no retry/backoff anywhere in the publish path**.

### Example failures

- SOCI convert: `SOCI convert failed for 'ImageTarget(image='workbench', version='2026.05.0+218.pro1', ...)': Error from source registry for "ghcr.io/posit-dev/workbench/tmp:workbench-...ba39112170": sha256:7e8001fdfc...: not found` ([run 27354249233](https://github.com/posit-dev/images-workbench/actions/runs/27354249233/job/80829099160))
- oras fetch: `Error response from registry: could not find the manifest ghcr.io/posit-dev/workbench-session-init/tmp:latest@sha256:9a5a07a1...: not found` ([run 27024890914](https://github.com/posit-dev/images-workbench/actions/runs/27024890914/job/79764464152))
- oras fetch: `... ghcr.io/posit-dev/workbench-session-init/tmp:latest@sha256:a734c599...: not found` ([run 27024930267](https://github.com/posit-dev/images-workbench/actions/runs/27024930267/job/79765679957))

## Root cause analysis

### The mechanism

1. **Per-platform builds push *by digest* (untagged).** In `posit_bakery/image/image_target.py` (`build()`), when a temp registry is set, the build output is overridden to:
   ```python
   output = {"type": "image", "push-by-digest": True, "name-canonical": True, "push": True}
   ```
   Each platform's manifest lands in `ghcr.io/posit-dev/<image>/tmp` addressable **only by digest** (no tag). The digest is recorded into the per-platform metadata artifact (`image_metadata.py` → `<repo>/tmp@sha256:...`).

2. **The platforms build on separate runners.** Per `.github/workflows/bakery-build-native.yml`, `build-test` is a matrix with `linux/amd64` and `linux/arm64` on different runners (different machine class/region). Each pushes its manifest + blobs to GHCR independently.

3. **The `merge` job references those digests immediately.** `bakery ci publish` (`posit_bakery/cli/ci.py`) then, *by digest*:
   - **Phase 1** `oras manifest index create` — builds the index from `get_merge_sources()` (the per-platform `...@sha256:` refs).
   - **Phase 2** SOCI convert — `oras cp <index> --to-oci-layout` pulls the index *and every child manifest/blob* down.
   - **Phase 3/4** index-copy + verify.

4. **There is zero retry/backoff** anywhere in this path. Every `subprocess.run` in `oras.py` / `soci.py` raises hard on the first non-zero exit.

### Why it's intermittent — and why a re-run fixes it

All observed errors are the same shape: a **digest-addressed** manifest reported `not found` by GHCR. The decisive clue is that **re-running the merge job alone succeeds** — it re-references the *exact same digests*. If those manifests had been garbage-collected, the re-run would fail identically. It doesn't, so the manifests **are durably present**; they were merely transiently unreadable when the merge runner first asked for them.

That is the signature of **GHCR eventual consistency**: a manifest/blob pushed by digest from runner A isn't guaranteed to be immediately readable by digest from runner B. Pushing *by digest with no tag* is the worst case — there's no tag write to act as a consistency checkpoint, and `oras` does HEAD/GET-by-digest lookups that 404 until the backend replicates. By the time the job is re-run, propagation has completed.

(This is diagnosis-by-evidence, not a reproduced failure — GHCR's internal replication can't be reproduced on demand. But the evidence chain — always digest-not-found, always self-heals on plain re-run, separate-runner pushes, no retry — is consistent and rules out permanent GC/deletion.)

## Proposed solutions

Ranked. The clean fix lives in bakery, not the workflow.

1. **Retry-with-backoff around the registry-touching oras/soci calls** (recommended core fix). Wrap `.run()` of `OrasManifestIndexCreate`, the SOCI `oras cp --to-oci-layout` pull, `OrasCopy`, and `OrasManifestFetch` so that *transient* registry errors (`not found` / `manifest unknown` / 5xx / timeouts) retry a handful of times with exponential backoff (e.g. 5 attempts, 2s→32s). Non-transient errors still fail fast. This directly absorbs propagation lag. Retry count/backoff should be configurable.

2. **Explicit pre-flight wait in `publish`** (complements #1, better diagnostics). Before Phase 1, poll each source digest with `oras manifest fetch --descriptor` until all resolve (or timeout). Turns "hope it's propagated" into condition-based waiting and logs *which* digest was lagging.

3. **Workflow-level retry** (stopgap only). Wrap the Publish step with `nick-fields/retry`. Cheap, but re-runs the whole publish and masks the real fix.

4. **Push per-platform images to real tags** (`tmp:<uid>-<platform>`) instead of by-digest. Would also improve consistency, but a bigger change with its own GC/cleanup implications. Not the lead recommendation.

Recommendation: implement **#1** as the core fix (with retry/backoff configurable), optionally **#2** for diagnostics. Both should be developed with tests that simulate a transient `not found` followed by success.

## Affected files

- `posit-bakery/posit_bakery/plugins/builtin/oras/oras.py` — `OrasCommand.run`, index-create/copy/fetch workflows
- `posit-bakery/posit_bakery/plugins/builtin/soci/soci.py` — `SociCommand.run`, `SociConvertWorkflow`
- `posit-bakery/posit_bakery/cli/ci.py` — `publish()` orchestration (pre-flight wait would go here)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bakery ci publish: intermittent digest 'not found' from GHCR (eventual consistency, no retry in publish path) #591

Summary

Example failures

Root cause analysis

The mechanism

Why it's intermittent — and why a re-run fixes it

Proposed solutions

Affected files

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

bakery ci publish: intermittent digest 'not found' from GHCR (eventual consistency, no retry in publish path) #591

Description

Summary

Example failures

Root cause analysis

The mechanism

Why it's intermittent — and why a re-run fixes it

Proposed solutions

Affected files

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions