Repair corrupt HF layer package artifacts#775
Conversation
Validation * Validation tier: Tier 3 - HF layer-package materialization now preflights selected cached/downloaded artifacts by manifest size and SHA, quarantines corrupt final files before retry, and keeps resolver fallback hermetic without changing mesh protocol, Skippy ABI, or package manifest format. * git fetch --no-tags origin main:refs/remotes/origin/main: PASS, origin/main at fb7108d. * git diff --check origin/main...HEAD: PASS, no output. * git diff --check: PASS, no output. * git diff --cached --check: PASS, no output. * cargo fmt --all -- --check: PASS. * LLAMA_STAGE_BUILD_DIR=/Users/Funtland/Downloads/mesh-llm/.deps/llama-build/build-stage-abi-metal cargo test -p mesh-llm-host-runtime package_artifact_repair --lib -- --test-threads=1: PASS, 2 passed. * LLAMA_STAGE_BUILD_DIR=/Users/Funtland/Downloads/mesh-llm/.deps/llama-build/build-stage-abi-metal cargo test -p mesh-llm-host-runtime cached_hf_verification_returns_none_after_quarantining_corrupt_artifact --lib -- --test-threads=1: PASS, 1 passed. * LLAMA_STAGE_BUILD_DIR=/Users/Funtland/Downloads/mesh-llm/.deps/llama-build/build-stage-abi-metal cargo test -p mesh-llm-host-runtime inference::skippy::materialization --lib -- --test-threads=1: PASS, 29 passed, 2 ignored. * LLAMA_STAGE_BUILD_DIR=/Users/Funtland/Downloads/mesh-llm/.deps/llama-build/build-stage-abi-metal cargo check -p mesh-llm: PASS. * LLAMA_STAGE_BUILD_DIR=/Users/Funtland/Downloads/mesh-llm/.deps/llama-build/build-stage-abi-metal /opt/homebrew/bin/cargo-clippy clippy -p mesh-llm-host-runtime --all-targets -- -D warnings: PASS. * Ledger: not applicable - not required for selected validation tier/change family. * Version: not applicable - no release/version sync required for this non-release package-cache repair change. * Not run: live HF corrupt-cache redownload smoke - no controlled remote HF package/cache corruption target was available; hermetic materialization tests cover size mismatch, checksum mismatch, quarantine, cached-snapshot fallback, and existing HF cache selection paths. * Not run: just build - not required for selected validation tier; no UI assets or release bundle changed. Rollback * git revert HEAD
|
@IvGolovach this is a good change but it has me thinking: do we need to do this generally for any artefacts in the hf cache? |
| let source = package_dir.join(relative_path); | ||
| let file_name = relative_path | ||
| .file_name() | ||
| .and_then(|value| value.to_str()) | ||
| .context("package artifact path has no file name")?; | ||
| let stamp = SystemTime::now() | ||
| .duration_since(UNIX_EPOCH) | ||
| .unwrap_or_default() | ||
| .as_nanos(); | ||
| let quarantine_dir = package_dir.join(".mesh-llm-quarantine"); | ||
| fs::create_dir_all(&quarantine_dir).with_context(|| { | ||
| format!( | ||
| "create package artifact quarantine {}", | ||
| quarantine_dir.display() | ||
| ) | ||
| })?; | ||
| let destination = quarantine_dir.join(format!("{file_name}.{reason}.{stamp}.bad")); | ||
| fs::rename(&source, &destination).with_context(|| { |
There was a problem hiding this comment.
This renames the snapshot path, but Rust hf_hub cache snapshots on Unix/macOS are symlinks into blobs/. This quarantines only the symlink; the corrupt blob remains, and a later download_file() can recreate the symlink from the existing blob without redownloading the file body.
The repair loop can keep reusing the same corrupt artifact. Fix by removing/quarantining the resolved blob target, or otherwise forcing a real redownload.
| } | ||
| } | ||
| } | ||
| if layer_start == 0 |
There was a problem hiding this comment.
verify_layer_package_metadata_integrity() only checks shared metadata, so a stale/corrupt projector can now force a quarantine + HF access during package inspection even though this path doesn’t need the projector yet. For metadata-only resolution, can we avoid including projectors in the repair/download set?
Summary
This makes corrupt Hugging Face layer-package cache entries recoverable instead of turning one bad local artifact into a repeated split-startup failure.
When resolving an HF layer package, mesh-llm now preflights the selected cached/downloaded artifacts against the package manifest size and SHA before accepting the snapshot. If a final artifact is present but corrupt, it is moved into a local
.mesh-llm-quarantinedirectory and the resolver treats the snapshot as incomplete so the normal download/retry path can fetch a clean copy.Why
Layer-package materialization already had integrity verification, but a corrupt final file in the HF cache could fail deterministically before the resolver had a chance to repair the local snapshot. That is especially painful for split serving: the node appears to have the package, but startup keeps failing on the same damaged artifact.
This PR converts that state into a safe local cache repair path. Corrupt artifacts are not silently trusted, not deleted outright, and not allowed to poison topology/stage resolution.
Diff Scope
.mesh-llm-quarantinewith a reason-tagged filename.Nonefrom cached HF verification after quarantine so the existing resolver path can redownload or choose a valid fallback snapshot.Compatibility
No mesh protobuf, QUIC ALPN, gossip schema, Skippy ABI, package manifest format, or layer-package identity format changes.
The only behavioral change is local cache handling for artifacts that already fail manifest integrity. Valid cached packages continue to resolve without redownload.
Branch Integrity
mainfb7108d803321bbcf1f6192ac62d43f827d37e8e1def90b51a495e2f36c9486ae1d25d2b41c9519d0 behind / 1 aheadrelative to fetchedorigin/mainfb7108d803321bbcf1f6192ac62d43f827d37e8eCommit Integrity
1def90b51a495e2f36c9486ae1d25d2b41c9519d Repair corrupt HF layer package artifactsThis is one logical change. The PR diff contains only the intended HF layer-package materialization/cache repair path.
Diff Hygiene
Changed files:
crates/mesh-llm-host-runtime/src/inference/skippy/materialization.rsgit diff --check origin/main...HEAD: PASS, no output.Validation
git fetch --no-tags origin main:refs/remotes/origin/main: PASS, origin/main atfb7108d803321bbcf1f6192ac62d43f827d37e8e.git diff --check origin/main...HEAD: PASS, no output.git diff --check: PASS, no output.git diff --cached --check: PASS, no output.cargo fmt --all -- --check: PASS.LLAMA_STAGE_BUILD_DIR=/Users/Funtland/Downloads/mesh-llm/.deps/llama-build/build-stage-abi-metal cargo test -p mesh-llm-host-runtime package_artifact_repair --lib -- --test-threads=1: PASS, 2 passed.LLAMA_STAGE_BUILD_DIR=/Users/Funtland/Downloads/mesh-llm/.deps/llama-build/build-stage-abi-metal cargo test -p mesh-llm-host-runtime cached_hf_verification_returns_none_after_quarantining_corrupt_artifact --lib -- --test-threads=1: PASS, 1 passed.LLAMA_STAGE_BUILD_DIR=/Users/Funtland/Downloads/mesh-llm/.deps/llama-build/build-stage-abi-metal cargo test -p mesh-llm-host-runtime inference::skippy::materialization --lib -- --test-threads=1: PASS, 29 passed, 2 ignored.LLAMA_STAGE_BUILD_DIR=/Users/Funtland/Downloads/mesh-llm/.deps/llama-build/build-stage-abi-metal cargo check -p mesh-llm: PASS.LLAMA_STAGE_BUILD_DIR=/Users/Funtland/Downloads/mesh-llm/.deps/llama-build/build-stage-abi-metal /opt/homebrew/bin/cargo-clippy clippy -p mesh-llm-host-runtime --all-targets -- -D warnings: PASS.just build- not required for selected validation tier; no UI assets or release bundle changed.Required Remote Gates
Pending - PR has not been opened yet, so mandatory GitHub checks have not run for the final PR SHA.
Runtime Safety
The change is limited to local HF layer-package cache resolution and materialization preflight.
No mesh membership, gossip, transport, model serving, package manifest schema, or Skippy execution invariant is changed. Manifest paths continue to pass through the existing safe relative path guard before filesystem access.
No new blocking locks, unbounded queues, protocol fields, or runtime execution paths are introduced. No invariant regression introduced.
Documentation Integrity
No docs or runbooks changed. The behavior is a local self-repair path for corrupt cache artifacts and does not change operator commands or package publishing procedure.
Rollback Plan
Rollback: revert this PR.
DB downgrade: not applicable.
Data repair: not applicable.
Operational caveats: none known.
Known Residual Risks
The remaining proof is a live corrupt-cache redownload smoke against an HF package repository. That was not run locally because no controlled remote HF package/cache corruption target was available. The deterministic materialization tests cover the changed repair and fallback branches.