feat: dedup byte-identical extracted firmware children at unpack (F08)#249
Conversation
A firmware re-packs the same binary at several paths — the FIT inner image is byte-identical to the top-level cpio, busybox is a hard-link farm, a package ships in two layers — so unpack_firmware used to mint a separate hidden target + contains edge for each copy, doubling (or more) the graph for no added information. Now it hashes each extracted ELF (sha256) and registers each unique-bytes binary ONCE, pointing every later byte-identical path at that same target via a `dedup_of` ref in the filesystem manifest. The firmware's filesystem tree still lists every path (browsable, addable); it just resolves the duplicates to one target. merge_duplicates remains the backstop for anything that slips through (e.g. dupes across separately-ingested targets). More relevant after G01 (full-firmware extraction now reaches the duplicate-heavy inner layers). engine/targets/unpack.py (dedup loop) + engine/targets/filesystem.py (persist dedup_of). New tests/test_unpack_dedup.py: two byte-identical paths collapse to one child (not two), the duplicate path carries dedup_of to the keeper, a distinct binary is untouched. No model/schema change (the manifest is metadata_json) → no migration; no UI behavior change. Full fast tier: 1367 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Merge-gate review — PR #249 — VERDICT: APPROVEIndependent reviewer (did not author). Ran Correctness — verified
Security — no findingsThis only hashes/reads target bytes already written to disk by the sandbox unpack. No new hostile-byte execution or parse, no subprocess/eval/deserialization, no path handling beyond what pre-F08 already did, no secret logging, and no loopback/egress/policy/gate change. The static-only and sandbox invariants are untouched. Non-blocking findings (posted inline)
Neither is blocking. Nothing fixed in the worktree — both are design/awareness notes, handed back for your call. Verdict: APPROVE; do not merge on my behalf. |
Review finding (medium): F08 dedup gives a byte-identical binary a single keeper target named after its FIRST path; an alternate path gets only a `dedup_of` manifest ref, no row. reveal_dir matched on live Target.name, so revealing a directory that a binary lives in ONLY via its deduped path silently missed it (pre-F08 that path had its own target). Build the path->target set from the unpack manifest (every entry under the prefix, incl. deduped paths) and reveal any hidden child it references, in addition to the name match. New test reveals the directory that only the deduped path occupies. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
What
A firmware re-packs the same binary at several paths — a FIT inner image byte-identical to the top-level cpio, a busybox hard-link farm, a package shipped in two layers.
unpack_firmwareused to mint a separate hidden target +containsedge for each copy, so one image's hundreds of duplicates doubled (or more) the graph for no added information.Now it hashes each extracted ELF (sha256) and registers each unique-bytes binary once, pointing every later byte-identical path at that same target via a
dedup_ofref in the filesystem manifest. The firmware's filesystem tree still lists every path (browsable, addable) — it just resolves the duplicates to one target.merge_duplicatesremains the backstop for anything that slips through (e.g. dupes across separately-ingested targets); this prevents the in-extraction dupes at the source instead of folding them after.Why now
More relevant after G01 (full-firmware extraction now reaches the duplicate-heavy inner package layers).
Changes
engine/targets/unpack.py— sha256 dedup loop (register-once, reuse the target for later identical paths).engine/targets/filesystem.py— persistdedup_ofon the duplicate manifest entries (present only on the dups, so the manifest stays lean).tests/test_unpack_dedup.py— two byte-identical paths collapse to one child (not two); the dup path carriesdedup_ofto the keeper; a distinct binary is untouched.No model/schema change (the manifest is
metadata_json) → no migration. No UI behavior change. Full fast tier: 1367 passed.