Skip to content

feat: dedup byte-identical extracted firmware children at unpack (F08)#249

Merged
branover merged 2 commits into
mainfrom
build/f08-dedup
Jun 10, 2026
Merged

feat: dedup byte-identical extracted firmware children at unpack (F08)#249
branover merged 2 commits into
mainfrom
build/f08-dedup

Conversation

@branover

Copy link
Copy Markdown
Owner

What

A firmware re-packs the same binary at several paths — a FIT inner image byte-identical to the top-level cpio, a busybox hard-link farm, a package shipped in two layers. unpack_firmware used to mint a separate hidden target + contains edge for each copy, so one image's hundreds of duplicates doubled (or more) the graph for no added information.

Now it hashes each extracted ELF (sha256) and registers each unique-bytes binary once, pointing every later byte-identical path at that same target via a dedup_of ref in the filesystem manifest. The firmware's filesystem tree still lists every path (browsable, addable) — it just resolves the duplicates to one target.

merge_duplicates remains the backstop for anything that slips through (e.g. dupes across separately-ingested targets); this prevents the in-extraction dupes at the source instead of folding them after.

Why now

More relevant after G01 (full-firmware extraction now reaches the duplicate-heavy inner package layers).

Changes

  • engine/targets/unpack.py — sha256 dedup loop (register-once, reuse the target for later identical paths).
  • engine/targets/filesystem.py — persist dedup_of on the duplicate manifest entries (present only on the dups, so the manifest stays lean).
  • tests/test_unpack_dedup.py — two byte-identical paths collapse to one child (not two); the dup path carries dedup_of to the keeper; a distinct binary is untouched.

No model/schema change (the manifest is metadata_json) → no migration. No UI behavior change. Full fast tier: 1367 passed.

A firmware re-packs the same binary at several paths — the FIT inner image is byte-identical to the
top-level cpio, busybox is a hard-link farm, a package ships in two layers — so unpack_firmware used
to mint a separate hidden target + contains edge for each copy, doubling (or more) the graph for no
added information. Now it hashes each extracted ELF (sha256) and registers each unique-bytes binary
ONCE, pointing every later byte-identical path at that same target via a `dedup_of` ref in the
filesystem manifest. The firmware's filesystem tree still lists every path (browsable, addable); it
just resolves the duplicates to one target. merge_duplicates remains the backstop for anything that
slips through (e.g. dupes across separately-ingested targets).

More relevant after G01 (full-firmware extraction now reaches the duplicate-heavy inner layers).
engine/targets/unpack.py (dedup loop) + engine/targets/filesystem.py (persist dedup_of). New
tests/test_unpack_dedup.py: two byte-identical paths collapse to one child (not two), the duplicate
path carries dedup_of to the keeper, a distinct binary is untouched. No model/schema change (the
manifest is metadata_json) → no migration; no UI behavior change. Full fast tier: 1367 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread src/hexgraph/engine/targets/unpack.py
Comment thread src/hexgraph/engine/targets/unpack.py
@branover

Copy link
Copy Markdown
Owner Author

Merge-gate review — PR #249 — VERDICT: APPROVE

Independent reviewer (did not author). Ran /code-review high + /security-review on git diff origin/main...HEAD (head fb54f35). The change is small and well-scoped: unpack_firmware now hashes each extracted ELF (file_sha256 on the host file the sandbox already wrote) and registers each unique-bytes binary ONCE, pointing later byte-identical paths at the same keeper target via a dedup_of ref in the filesystem manifest.

Correctness — verified

  • Keeper = FIRST occurrence; later identical bytes reuse child_target_id and set dedup_of, and continue (no second row/edge). Correct.
  • children contains unique ELFs only, so the recon-enrichment loop in pipeline.py runs once per unique binary (no wasted re-recon on identical bytes). Good.
  • packed_containers only inspects container entries, not ELFs, so dedup does not affect it.
  • promote_file reads entry.get("child_target_id") on a deduped path → returns the keeper, still idempotent. Correct.
  • build_links_against keys on basename and already used setdefault, so collapsing duplicate-bytes libs to one target does not regress lib resolution.
  • sha256 is computed on the right host file (base / container_path, with the root/rel fallback) — the same path the pre-F08 code already opened to ingest.
  • Scoping dedup within-unpack (with merge_duplicates as the cross-target backstop) is the right call for this PR.
  • Tests pass: test_unpack_dedup test_unpack_collision test_extraction_honesty test_hidden_targets → 16 passed, 1 skipped (Docker absent).

Security — no findings

This only hashes/reads target bytes already written to disk by the sandbox unpack. No new hostile-byte execution or parse, no subprocess/eval/deserialization, no path handling beyond what pre-F08 already did, no secret logging, and no loopback/egress/policy/gate change. The static-only and sandbox invariants are untouched.

Non-blocking findings (posted inline)

  1. MEDIUM — reveal_dir by-directory regression (unpack.py:70 / reveal.py:101-108): the keeper target is named after the FIRST occurrence and the dup path gets no row, only a dedup_of manifest ref. reveal_dir matches on the live Target.name, so reveal_dir(firmware, <dup-dir-prefix>) now silently misses a binary that exists at that path (pre-F08 it had its own target there). Consider consulting the manifest dedup_of aliases in prefix-scoped reveal, or document the behavior. Graph completeness, not data loss → not blocking.
  2. LOW — contains-edge records one path (unpack.py:77): the contains edge carries only the first occurrence's path attr; alternate paths live only in the manifest dedup_of and arent surfaced in the graph. Reasonable tradeoff; consider folding dup rels into the keeper edge later if "appears at N paths" becomes research-relevant.

Neither is blocking. Nothing fixed in the worktree — both are design/awareness notes, handed back for your call. Verdict: APPROVE; do not merge on my behalf.

Review finding (medium): F08 dedup gives a byte-identical binary a single keeper target named after
its FIRST path; an alternate path gets only a `dedup_of` manifest ref, no row. reveal_dir matched on
live Target.name, so revealing a directory that a binary lives in ONLY via its deduped path silently
missed it (pre-F08 that path had its own target). Build the path->target set from the unpack manifest
(every entry under the prefix, incl. deduped paths) and reveal any hidden child it references, in
addition to the name match. New test reveals the directory that only the deduped path occupies.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@branover branover merged commit e922699 into main Jun 10, 2026
7 checks passed
@branover branover deleted the build/f08-dedup branch June 10, 2026 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant