Skip to content

fix(install): 1h TTL on the agent tarball cache so in-archive manifest changes self-heal (#270)#272

Merged
pawellisowski merged 5 commits into
mainfrom
fix/270-tarball-cache-ttl
Jun 26, 2026
Merged

fix(install): 1h TTL on the agent tarball cache so in-archive manifest changes self-heal (#270)#272
pawellisowski merged 5 commits into
mainfrom
fix/270-tarball-cache-ttl

Conversation

@pawellisowski

Copy link
Copy Markdown
Contributor

What & why

Refs #270. The agent-install tarball cache keyed by sha256(url + index snapshot_fingerprint) busts the moment registry-index.json changes, but is blind to a manifest edit made inside the rolling main.tar.gz (a status: flip, a keyword/description fix) that leaves the index byte-identical. The cache had no TTL (unlike the index/catalog, which carry a 1h CACHE_TTL in fetch.rs), so a warm cache served the stale manifest forever — the live file agent broke file/write apps with a confusing E_APP_AGENT_UNAVAILABLE even though the runtime supports the verb.

Fix (suggested option 2 from the issue — reuse the existing pattern)

Give the tarball cache the same 1h CACHE_TTL the index/catalog already use:

The bulk of the diff is resilience hardening driven by the Codex review below — the staging path now degrades gracefully instead of failing installs a cache could satisfy.

Codex review trail (all addressed)

Codex was the first reviewer; each finding was fixed and re-reviewed:

  1. [P2] offline regression — a TTL that skips a present-but-stale cache turned an offline/timeout install into a hard failure. → Fall back to the stale cache on any refresh failure (mirrors fetch_index); only a cold cache propagates the error.
  2. [P2] index/cache consistency — installs resolve against a cached (1h-TTL) index, so a TTL refresh could pull a rolling archive that advanced past it (subdir moved), failing the install and poisoning the shared cache under the stale fingerprint. → Commit a download to the cache only after a successful extraction; on a refreshed archive that lacks the subdir, fall back to the prior (index-consistent) cache.
  3. [P2] corrupt refresh — a truncated/garbage refreshed body errored before the missing-subdir fallback could run. → Unify both unusable-refresh cases (corrupt or moved-subdir) behind extract_agent_subdir; both fall back to the prior cache.

Open finding #4 — justified, not patched

Codex's 4th note: a TTL refresh validated for agent A commits the shared archive, which could lack a different agent B that the cached index still lists, so a later B install can't fall back to the prior archive.

Why this is left as-is (every alternative is net-negative):

Verification

  • Local gates (CI doesn't run these): cargo fmt --all --check ✓, cargo clippy --all-targets -D warnings ✓, full cargo test ✓ (0 failures).
  • Four new unit tests in install::registry: TTL self-refresh on an unchanged index; stale fallback when refresh fails offline; fallback when the archive outran the index; fallback on a corrupt refresh.
  • Real end-to-end on the built aware.exe (temp AWARE_HOME + file:// registry): an archive shipping status: planned installs planned; flipping it to available inside the same archive (index untouched) keeps serving planned while the cache is warm, then self-heals to available once the cache ages past the TTL.

Scope

Pure CLI-internal change — no agent/command set change, so no stat/registry-count shift.

…changes self-heal (#270)

The agent-install tarball cache key is sha256(url + index snapshot fingerprint),
which busts the moment registry-index.json changes but is blind to a manifest edit
made INSIDE the rolling main.tar.gz (a status: flip, a keyword fix) that leaves the
index byte-identical — so a warm cache served the stale archive forever (the cache
had no TTL, unlike the index/catalog).

Add the same 1h CACHE_TTL the index/catalog already use (fetch::CACHE_TTL) to the
tarball cache: a warm cache self-refreshes within an hour, bounded to one re-download
per snapshot (not per agent). The fingerprint still busts instantly on index changes;
the TTL is the backstop for the residual in-archive case.

Verified end-to-end on the real binary: an archive shipping status:planned installs
planned; flipping it to available inside the same archive (index untouched) keeps
serving planned while the cache is warm, then self-heals to available once the cache
ages past the TTL.
…ffline (#270)

Codex review: the new TTL skipped a present-but-stale cache and fell straight to the
network branch, so an offline/timeout install the warm cache could still satisfy began
failing after an hour — a regression from the prior unconditional cache use.

Extract the refresh (file:// copy or HTTP download) into `refresh_tarball` and, on any
failure, fall back to a stale-but-present cache instead of erroring — mirroring
`fetch_index`'s stale-index fallback. A TTL means "prefer fresh", not "refuse stale when
fresh is unreachable". Only a cold cache (nothing to fall back to) propagates the error.
Covered by `stale_tarball_cache_is_reused_when_refresh_fails_offline`.
…ndex (#270)

Codex review: installs resolve against a cached (1h-TTL) index, so a TTL-triggered tarball
refresh can pull the rolling main.tar.gz at a snapshot that advanced PAST that index — to one
where the agent's subdir moved — making the fresh archive lack entry.subdir. The previous code
would then fail the install ("subdir not in tarball") and had already overwritten the good
cache with the raced archive (poisoning the snapshot's cache file).

Two changes keep the cache consistent with the index it is keyed by:
- Commit a downloaded archive to the shared cache ONLY after it has successfully served the
  agent (post-extraction) — a download that raced past the index can't poison the cache.
- When a refreshed archive lacks the requested subdir, fall back to the prior cache (which was
  consistent with the cached index) instead of failing; the index self-corrects on its own TTL.

Covered by `ttl_refresh_falls_back_to_prior_cache_when_archive_outran_the_index`.
…oo (#270)

Codex review: the moved-subdir fallback ran only after a SUCCESSFUL extraction, so a
TTL-refreshed archive that was truncated/corrupt (a transient bad body, or a local archive
caught mid-write) propagated the gzip/tar error before the fallback could run — failing an
install a prior cache could still satisfy.

Unify the two unusable-refresh cases behind `extract_agent_subdir`, which fails on either a
corrupt archive OR a missing subdir. The caller treats both the same: fall back to the prior,
index-consistent cache before failing. Covered by
`ttl_refresh_falls_back_to_prior_cache_when_refresh_is_corrupt`.
The v0.81.0 release bumped cli/Cargo.toml but did not run sync_stats, leaving
README.md/CLAUDE.md at 0.80.0 — pre-existing drift that fails the Stats CI check.
Pure mechanical sync; no behavior change.
@pawellisowski pawellisowski merged commit fae8925 into main Jun 26, 2026
1 check passed
@pawellisowski pawellisowski deleted the fix/270-tarball-cache-ttl branch June 26, 2026 14:15
pawellisowski added a commit that referenced this pull request Jun 26, 2026
…backs (#270)

Ships the residual tarball-cache staleness fixes merged since v0.81.0:
- #272 (#270): 1h TTL on the agent-install tarball cache so an in-archive manifest
  change that leaves registry-index.json byte-identical self-heals within an hour,
  with graceful fallback to a prior index-consistent cache on offline/corrupt/raced
  refreshes.
- #271: bump registry-index.json updated-at to bust warm caches for the #269 file agent.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant