Skip to content

feat: parallelize ci publish with retry via a unified imagetools plugin#599

Draft
ianpittwood wants to merge 2 commits into
mainfrom
feat/parallel-calls-publish
Draft

feat: parallelize ci publish with retry via a unified imagetools plugin#599
ianpittwood wants to merge 2 commits into
mainfrom
feat/parallel-calls-publish

Conversation

@ianpittwood

@ianpittwood ianpittwood commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Supersedes #598

What & why

bakery ci publish ran its 4-phase pipeline (oras index-create → soci convert → oras index-copy → verify) serially per target, with no retry — so the transient GHCR eventual-consistency failures in #591 ("not found" / "manifest unknown" on freshly-pushed digests) failed the whole job and only self-healed on a manual re-run.

Building on the feat/parallel-calls parallel module (#588), this:

  • Parallelizes publish. Each target's create → soci → copy → verify sequence runs as one job on the parallel executor; independent targets publish concurrently (bounded workers, live progress table, Ctrl-C-safe process-group termination).
  • Adds retry-with-backoff. Every registry command retries on transient errors (5 attempts, exponential 2s→32s + jitter); permanent auth/reference errors fail fast. The capability lives in the parallel module so future flaky commands can reuse it.
  • Unifies oras + soci into one imagetools plugin. A single plugin owns the whole pipeline and registers oras merge, soci convert, and imagetools publish (command names preserved). ci publish/ci merge route through it. tool: soci options still parse (the tool-options registry now keys by the options class's tool discriminator).

Parallel module additions

RetryPolicy, CommandResult, CommandRunner, ShellJob, JobResult, run_jobs() — added alongside the existing ShellTask/run() (dgoss) path, sharing one _spawn_and_communicate primitive so timeout/termination/interrupt behavior is identical.

Verification

  • just test1769 passed; ruff check + format clean.
  • Discovery: plugins = {dgoss, hadolint, imagetools, wizcli}; tool options include soci.
  • CLI smoke: oras merge, soci convert, ci publish, ci merge, imagetools publish all register.
  • A live bakery ci publish --dry-run is blocked by local command policy; the BDD ci merge scenario + no-mock dry-run integration tests cover the end-to-end path.

🤖 Generated with Claude Code

Apply the parallel execution module to `bakery ci publish`: each target's
oras index-create -> soci convert -> index-copy -> verify sequence now runs as
one job on the parallel executor, so independent targets publish concurrently,
and every registry command is wrapped in retry-with-backoff to absorb the
transient GHCR eventual-consistency failures described in #591 (not found /
manifest unknown / 5xx / timeouts; permanent auth/reference errors fail fast).

Consolidate the oras and soci plugins into a single `imagetools` plugin that
owns the full pipeline. `bakery oras merge`, `soci convert`, `ci publish`, and
`ci merge` all route through it (command names preserved); the soci tool
options still parse via `tool: soci`.

Parallel module gains RetryPolicy, CommandResult, CommandRunner, ShellJob,
JobResult, and run_jobs() alongside the existing one-command ShellTask path,
sharing one tracked-spawn primitive so timeout + process-group termination +
Ctrl-C safety are identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown

Test Results

1 801 tests  +73   1 801 ✅ +73   8m 29s ⏱️ -13s
    1 suites ± 0       0 💤 ± 0 
    1 files   ± 0       0 ❌ ± 0 

Results for commit 14736dc. ± Comparison against base commit c5a2700.

♻️ This comment has been updated with latest results.

Absorb the read-after-write wait from #598: OrasWaitForSourcesWorkflow polls every
per-platform source digest with `oras manifest fetch --descriptor` until all are
readable (10 min timeout), naming any laggard. ImageToolsPlugin.execute runs it once
up front on create-bearing flows (publish / oras merge), aborting the publish if a
digest never propagates — so GHCR eventual-consistency lag becomes condition-based
waiting instead of an opaque downstream #591 failure. The soci-only path skips it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ianpittwood ianpittwood marked this pull request as ready for review June 11, 2026 20:27
@ianpittwood ianpittwood requested a review from bschwedler as a code owner June 11, 2026 20:27
Base automatically changed from feat/parallel-calls to main June 11, 2026 20:40
@ianpittwood ianpittwood marked this pull request as draft June 12, 2026 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant