fix(install): 1h TTL on the agent tarball cache so in-archive manifest changes self-heal (#270)#272
Merged
Merged
Conversation
…changes self-heal (#270) The agent-install tarball cache key is sha256(url + index snapshot fingerprint), which busts the moment registry-index.json changes but is blind to a manifest edit made INSIDE the rolling main.tar.gz (a status: flip, a keyword fix) that leaves the index byte-identical — so a warm cache served the stale archive forever (the cache had no TTL, unlike the index/catalog). Add the same 1h CACHE_TTL the index/catalog already use (fetch::CACHE_TTL) to the tarball cache: a warm cache self-refreshes within an hour, bounded to one re-download per snapshot (not per agent). The fingerprint still busts instantly on index changes; the TTL is the backstop for the residual in-archive case. Verified end-to-end on the real binary: an archive shipping status:planned installs planned; flipping it to available inside the same archive (index untouched) keeps serving planned while the cache is warm, then self-heals to available once the cache ages past the TTL.
…ffline (#270) Codex review: the new TTL skipped a present-but-stale cache and fell straight to the network branch, so an offline/timeout install the warm cache could still satisfy began failing after an hour — a regression from the prior unconditional cache use. Extract the refresh (file:// copy or HTTP download) into `refresh_tarball` and, on any failure, fall back to a stale-but-present cache instead of erroring — mirroring `fetch_index`'s stale-index fallback. A TTL means "prefer fresh", not "refuse stale when fresh is unreachable". Only a cold cache (nothing to fall back to) propagates the error. Covered by `stale_tarball_cache_is_reused_when_refresh_fails_offline`.
…ndex (#270) Codex review: installs resolve against a cached (1h-TTL) index, so a TTL-triggered tarball refresh can pull the rolling main.tar.gz at a snapshot that advanced PAST that index — to one where the agent's subdir moved — making the fresh archive lack entry.subdir. The previous code would then fail the install ("subdir not in tarball") and had already overwritten the good cache with the raced archive (poisoning the snapshot's cache file). Two changes keep the cache consistent with the index it is keyed by: - Commit a downloaded archive to the shared cache ONLY after it has successfully served the agent (post-extraction) — a download that raced past the index can't poison the cache. - When a refreshed archive lacks the requested subdir, fall back to the prior cache (which was consistent with the cached index) instead of failing; the index self-corrects on its own TTL. Covered by `ttl_refresh_falls_back_to_prior_cache_when_archive_outran_the_index`.
…oo (#270) Codex review: the moved-subdir fallback ran only after a SUCCESSFUL extraction, so a TTL-refreshed archive that was truncated/corrupt (a transient bad body, or a local archive caught mid-write) propagated the gzip/tar error before the fallback could run — failing an install a prior cache could still satisfy. Unify the two unusable-refresh cases behind `extract_agent_subdir`, which fails on either a corrupt archive OR a missing subdir. The caller treats both the same: fall back to the prior, index-consistent cache before failing. Covered by `ttl_refresh_falls_back_to_prior_cache_when_refresh_is_corrupt`.
The v0.81.0 release bumped cli/Cargo.toml but did not run sync_stats, leaving README.md/CLAUDE.md at 0.80.0 — pre-existing drift that fails the Stats CI check. Pure mechanical sync; no behavior change.
pawellisowski
added a commit
that referenced
this pull request
Jun 26, 2026
…backs (#270) Ships the residual tarball-cache staleness fixes merged since v0.81.0: - #272 (#270): 1h TTL on the agent-install tarball cache so an in-archive manifest change that leaves registry-index.json byte-identical self-heals within an hour, with graceful fallback to a prior index-consistent cache on offline/corrupt/raced refreshes. - #271: bump registry-index.json updated-at to bust warm caches for the #269 file agent.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
Refs #270. The agent-install tarball cache keyed bysha256(url + index snapshot_fingerprint)busts the momentregistry-index.jsonchanges, but is blind to a manifest edit made inside the rollingmain.tar.gz(astatus:flip, a keyword/description fix) that leaves the index byte-identical. The cache had no TTL (unlike the index/catalog, which carry a 1hCACHE_TTLinfetch.rs), so a warm cache served the stale manifest forever — the livefileagent brokefile/writeapps with a confusingE_APP_AGENT_UNAVAILABLEeven though the runtime supports the verb.Fix (suggested option 2 from the issue — reuse the existing pattern)
Give the tarball cache the same 1h
CACHE_TTLthe index/catalog already use:The bulk of the diff is resilience hardening driven by the Codex review below — the staging path now degrades gracefully instead of failing installs a cache could satisfy.
Codex review trail (all addressed)
Codex was the first reviewer; each finding was fixed and re-reviewed:
fetch_index); only a cold cache propagates the error.extract_agent_subdir; both fall back to the prior cache.Open finding #4 — justified, not patched
Codex's 4th note: a TTL refresh validated for agent A commits the shared archive, which could lack a different agent B that the cached index still lists, so a later B install can't fall back to the prior archive.
Why this is left as-is (every alternative is net-negative):
FPlists B, but the same-FPlive archive lacks B — requires the publishedregistry-index.jsonto be inconsistent with the published tree.registry-index.jsonis generated from the tree byaware agent reindex, andsnapshot_fingerprinthashes the whole serialized index, so removing/moving any agent rotates the cache key. The inconsistent state can't arise from a normal release; it's a registry-authoring bug.mainanymore, so a clear, self-healing (≤1h, until the index TTL rotatesFP)subdir not in tarballerror is more correct than silently serving a stale copy of files that no longer exist upstream.Verification
cargo fmt --all --check✓,cargo clippy --all-targets -D warnings✓, fullcargo test✓ (0 failures).install::registry: TTL self-refresh on an unchanged index; stale fallback when refresh fails offline; fallback when the archive outran the index; fallback on a corrupt refresh.aware.exe(tempAWARE_HOME+file://registry): an archive shippingstatus: plannedinstallsplanned; flipping it toavailableinside the same archive (index untouched) keeps servingplannedwhile the cache is warm, then self-heals toavailableonce the cache ages past the TTL.Scope
Pure CLI-internal change — no agent/command set change, so no stat/registry-count shift.