Skip to content

Repair corrupt HF layer package artifacts#775

Open
IvGolovach wants to merge 1 commit into
mainfrom
codex/layer-package-integrity-repair
Open

Repair corrupt HF layer package artifacts#775
IvGolovach wants to merge 1 commit into
mainfrom
codex/layer-package-integrity-repair

Conversation

@IvGolovach
Copy link
Copy Markdown
Collaborator

Summary

This makes corrupt Hugging Face layer-package cache entries recoverable instead of turning one bad local artifact into a repeated split-startup failure.

When resolving an HF layer package, mesh-llm now preflights the selected cached/downloaded artifacts against the package manifest size and SHA before accepting the snapshot. If a final artifact is present but corrupt, it is moved into a local .mesh-llm-quarantine directory and the resolver treats the snapshot as incomplete so the normal download/retry path can fetch a clean copy.

Why

Layer-package materialization already had integrity verification, but a corrupt final file in the HF cache could fail deterministically before the resolver had a chance to repair the local snapshot. That is especially painful for split serving: the node appears to have the package, but startup keeps failing on the same damaged artifact.

This PR converts that state into a safe local cache repair path. Corrupt artifacts are not silently trusted, not deleted outright, and not allowed to poison topology/stage resolution.

Diff Scope

  • Add manifest-driven artifact expectations for the exact HF package files needed by a requested stage:
    • shared metadata
    • optional embeddings/output
    • requested layer range
    • layer-zero projectors when applicable
  • Preflight cached HF snapshot artifacts by expected byte size and SHA before accepting the local package.
  • Quarantine corrupt final artifacts into .mesh-llm-quarantine with a reason-tagged filename.
  • Return None from cached HF verification after quarantine so the existing resolver path can redownload or choose a valid fallback snapshot.
  • Reuse the same artifact-selection logic for download planning instead of maintaining a second ad hoc list.
  • Add regression coverage for wrong-size artifacts, same-size checksum mismatches, cached verification returning repairable-miss after quarantine, and existing HF materialization behavior.

Compatibility

No mesh protobuf, QUIC ALPN, gossip schema, Skippy ABI, package manifest format, or layer-package identity format changes.

The only behavioral change is local cache handling for artifacts that already fail manifest integrity. Valid cached packages continue to resolve without redownload.

Branch Integrity

  • Base branch: main
  • Validated base SHA: fb7108d803321bbcf1f6192ac62d43f827d37e8e
  • Head SHA: 1def90b51a495e2f36c9486ae1d25d2b41c9519d
  • Ahead/behind: 0 behind / 1 ahead relative to fetched origin/main
  • Merge-base: fb7108d803321bbcf1f6192ac62d43f827d37e8e

Commit Integrity

  • 1def90b51a495e2f36c9486ae1d25d2b41c9519d Repair corrupt HF layer package artifacts

This is one logical change. The PR diff contains only the intended HF layer-package materialization/cache repair path.

Diff Hygiene

Changed files:

  • crates/mesh-llm-host-runtime/src/inference/skippy/materialization.rs

git diff --check origin/main...HEAD: PASS, no output.

Validation

  • Validation tier: Tier 3 - HF layer-package materialization now preflights selected cached/downloaded artifacts by manifest size and SHA, quarantines corrupt final files before retry, and keeps resolver fallback hermetic without changing mesh protocol, Skippy ABI, or package manifest format.
  • git fetch --no-tags origin main:refs/remotes/origin/main: PASS, origin/main at fb7108d803321bbcf1f6192ac62d43f827d37e8e.
  • git diff --check origin/main...HEAD: PASS, no output.
  • git diff --check: PASS, no output.
  • git diff --cached --check: PASS, no output.
  • cargo fmt --all -- --check: PASS.
  • LLAMA_STAGE_BUILD_DIR=/Users/Funtland/Downloads/mesh-llm/.deps/llama-build/build-stage-abi-metal cargo test -p mesh-llm-host-runtime package_artifact_repair --lib -- --test-threads=1: PASS, 2 passed.
  • LLAMA_STAGE_BUILD_DIR=/Users/Funtland/Downloads/mesh-llm/.deps/llama-build/build-stage-abi-metal cargo test -p mesh-llm-host-runtime cached_hf_verification_returns_none_after_quarantining_corrupt_artifact --lib -- --test-threads=1: PASS, 1 passed.
  • LLAMA_STAGE_BUILD_DIR=/Users/Funtland/Downloads/mesh-llm/.deps/llama-build/build-stage-abi-metal cargo test -p mesh-llm-host-runtime inference::skippy::materialization --lib -- --test-threads=1: PASS, 29 passed, 2 ignored.
  • LLAMA_STAGE_BUILD_DIR=/Users/Funtland/Downloads/mesh-llm/.deps/llama-build/build-stage-abi-metal cargo check -p mesh-llm: PASS.
  • LLAMA_STAGE_BUILD_DIR=/Users/Funtland/Downloads/mesh-llm/.deps/llama-build/build-stage-abi-metal /opt/homebrew/bin/cargo-clippy clippy -p mesh-llm-host-runtime --all-targets -- -D warnings: PASS.
  • Ledger: not applicable - not required for selected validation tier/change family.
  • Version: not applicable - no release/version sync required for this non-release package-cache repair change.
  • Not run: live HF corrupt-cache redownload smoke - no controlled remote HF package/cache corruption target was available; hermetic materialization tests cover size mismatch, checksum mismatch, quarantine, cached-snapshot fallback, and existing HF cache selection paths.
  • Not run: just build - not required for selected validation tier; no UI assets or release bundle changed.

Required Remote Gates

Pending - PR has not been opened yet, so mandatory GitHub checks have not run for the final PR SHA.

Runtime Safety

The change is limited to local HF layer-package cache resolution and materialization preflight.

No mesh membership, gossip, transport, model serving, package manifest schema, or Skippy execution invariant is changed. Manifest paths continue to pass through the existing safe relative path guard before filesystem access.

No new blocking locks, unbounded queues, protocol fields, or runtime execution paths are introduced. No invariant regression introduced.

Documentation Integrity

No docs or runbooks changed. The behavior is a local self-repair path for corrupt cache artifacts and does not change operator commands or package publishing procedure.

Rollback Plan

Rollback: revert this PR.

DB downgrade: not applicable.

Data repair: not applicable.

Operational caveats: none known.

Known Residual Risks

The remaining proof is a live corrupt-cache redownload smoke against an HF package repository. That was not run locally because no controlled remote HF package/cache corruption target was available. The deterministic materialization tests cover the changed repair and fallback branches.

Validation
* Validation tier: Tier 3 - HF layer-package materialization now preflights selected cached/downloaded artifacts by manifest size and SHA, quarantines corrupt final files before retry, and keeps resolver fallback hermetic without changing mesh protocol, Skippy ABI, or package manifest format.
* git fetch --no-tags origin main:refs/remotes/origin/main: PASS, origin/main at fb7108d.
* git diff --check origin/main...HEAD: PASS, no output.
* git diff --check: PASS, no output.
* git diff --cached --check: PASS, no output.
* cargo fmt --all -- --check: PASS.
* LLAMA_STAGE_BUILD_DIR=/Users/Funtland/Downloads/mesh-llm/.deps/llama-build/build-stage-abi-metal cargo test -p mesh-llm-host-runtime package_artifact_repair --lib -- --test-threads=1: PASS, 2 passed.
* LLAMA_STAGE_BUILD_DIR=/Users/Funtland/Downloads/mesh-llm/.deps/llama-build/build-stage-abi-metal cargo test -p mesh-llm-host-runtime cached_hf_verification_returns_none_after_quarantining_corrupt_artifact --lib -- --test-threads=1: PASS, 1 passed.
* LLAMA_STAGE_BUILD_DIR=/Users/Funtland/Downloads/mesh-llm/.deps/llama-build/build-stage-abi-metal cargo test -p mesh-llm-host-runtime inference::skippy::materialization --lib -- --test-threads=1: PASS, 29 passed, 2 ignored.
* LLAMA_STAGE_BUILD_DIR=/Users/Funtland/Downloads/mesh-llm/.deps/llama-build/build-stage-abi-metal cargo check -p mesh-llm: PASS.
* LLAMA_STAGE_BUILD_DIR=/Users/Funtland/Downloads/mesh-llm/.deps/llama-build/build-stage-abi-metal /opt/homebrew/bin/cargo-clippy clippy -p mesh-llm-host-runtime --all-targets -- -D warnings: PASS.
* Ledger: not applicable - not required for selected validation tier/change family.
* Version: not applicable - no release/version sync required for this non-release package-cache repair change.
* Not run: live HF corrupt-cache redownload smoke - no controlled remote HF package/cache corruption target was available; hermetic materialization tests cover size mismatch, checksum mismatch, quarantine, cached-snapshot fallback, and existing HF cache selection paths.
* Not run: just build - not required for selected validation tier; no UI assets or release bundle changed.

Rollback
* git revert HEAD
@i386
Copy link
Copy Markdown
Collaborator

i386 commented Jun 2, 2026

@IvGolovach this is a good change but it has me thinking: do we need to do this generally for any artefacts in the hf cache?

Comment on lines +893 to +910
let source = package_dir.join(relative_path);
let file_name = relative_path
.file_name()
.and_then(|value| value.to_str())
.context("package artifact path has no file name")?;
let stamp = SystemTime::now()
.duration_since(UNIX_EPOCH)
.unwrap_or_default()
.as_nanos();
let quarantine_dir = package_dir.join(".mesh-llm-quarantine");
fs::create_dir_all(&quarantine_dir).with_context(|| {
format!(
"create package artifact quarantine {}",
quarantine_dir.display()
)
})?;
let destination = quarantine_dir.join(format!("{file_name}.{reason}.{stamp}.bad"));
fs::rename(&source, &destination).with_context(|| {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This renames the snapshot path, but Rust hf_hub cache snapshots on Unix/macOS are symlinks into blobs/. This quarantines only the symlink; the corrupt blob remains, and a later download_file() can recreate the symlink from the existing blob without redownloading the file body.

The repair loop can keep reusing the same corrupt artifact. Fix by removing/quarantining the resolved blob target, or otherwise forcing a real redownload.

}
}
}
if layer_start == 0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verify_layer_package_metadata_integrity() only checks shared metadata, so a stale/corrupt projector can now force a quarantine + HF access during package inspection even though this path doesn’t need the projector yet. For metadata-only resolution, can we avoid including projectors in the repair/download set?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants