Skip to content

feat: bound Ghidra analysis of a large monolith — size-scaled mem/tmpfs + cgroup heap + fast-profile (F13 heap half)#248

Merged
branover merged 2 commits into
mainfrom
build/ghidra-heap
Jun 10, 2026
Merged

feat: bound Ghidra analysis of a large monolith — size-scaled mem/tmpfs + cgroup heap + fast-profile (F13 heap half)#248
branover merged 2 commits into
mainfrom
build/ghidra-heap

Conversation

@branover

Copy link
Copy Markdown
Owner

What — the deferred half of F13

A 100 MB+ ELF either crashed Ghidra at import (the BufferMgr "DB buffer" failure) or ran auto-analysis for hours. This turns that into a bounded run that returns usable recon (functions + call graph), paid once via the persistent project. Three coordinated, all-runtime fixes (no sandbox rebuild):

1. Size-scaled mem + tmpfs (sandbox/resources.py)

resource_spec_for_artifact now also raises mem and the /scratch tmpfs for a large artifact (≥64 MiB), not just the timeout. Ghidra's import/auto-analysis of a huge ELF exhausts the 2 GiB heap and fills the 512 MiB DB/recovery tmpfs (which counts against the same cgroup). Bounded by a hard cap and a fraction of host RAM (never over-commits the box); only ever widens; small artifacts byte-for-byte unchanged.

2. cgroup-aware Ghidra heap (sandbox/probes/ghidra_probe.py)

The probe sets -XX:MaxRAMPercentage, so the JVM heap auto-scales to whatever --memory cap the (now bigger) container got — no hardcoded -Xmx to drift from the cap.

3. Fast-profile + graceful bounding

run_probe advertises its wall-clock budget (HEXGRAPH_PROBE_TIMEOUT_S); the cold Ghidra path uses -analysisTimeoutPerFile to stop + save just under it (returns a partial-but-usable result instead of being killed empty), and a -preScript (large binaries only) disables the passes proven pathological on a monolith — Call-Fixup Installer (O(n²) AddressSet), the per-processor Constant Reference Analyzer + Scalar Operand References, and the decompile-every-function passes — while keeping the call-graph/reference/function analyzers. HexGraph decompiles on demand, so the batch decompile passes aren't needed for recon.

How the root cause was found

Thread-dumping a real large-ELF analysis showed the time was a sequence of expensive passes, not one function: CallFixupAnalyzer.added → AddressSet.add (O(n²), ~25 min of CPU), then once that's disabled, the ConstantPropagationAnalyzer. -analysisTimeoutPerFile alone can't rescue it because those passes don't poll the cancel monitor mid-run — hence the fast profile.

Validation (real large ELF, private WITH_GHIDRA image)

  • Crash: at the old 2 GiB/512 MiB ceilings, import died at the BufferMgr flush (~10 min). At the scaled 9 GiB/3 GiB it imports + analyzes with headroom (no crash).
  • Result: the real probe path (fast profile applied) completed rc=0 and returned a real call graph, where the unbounded run never finished.

Changes

  • sandbox/resources.py, sandbox/runner.py, sandbox/probes/ghidra_probe.py.
  • Tests: test_size_aware_resources.py (mem/tmpfs scaling, host cap, docker flags, deadline env), test_ghidra_fast_profile.py (analysis-timeout sizing, fast-profile gating + kept/cut analyzer sets); the existing timeout test retargeted to the medium band.

No model change → no migration. No UI change. Full fast tier: 1365 passed.

…hing or hanging (F13 heap half)

The deferred half of F13: a 100 MB+ ELF either crashed Ghidra at import (the "DB buffer" /
BufferMgr failure) or ran auto-analysis for hours. Three coordinated fixes turn that into a
bounded run that returns usable recon (functions + call graph), with the persistent project so the
cost is paid once.

1. Size-scaled mem + tmpfs (sandbox/resources.py). resource_spec_for_artifact now also raises
   `mem` and the `/scratch` tmpfs for a large artifact (>=64 MiB), not just the timeout — Ghidra's
   import/auto-analysis of a huge ELF exhausts the 2 GiB heap and fills the 512 MiB DB/recovery
   tmpfs (which counts against the same cgroup). Bounded by a hard cap AND a fraction of host RAM
   so it never over-commits the box; only ever widens; small artifacts unchanged. (Verified
   end-to-end: at the old 2 GiB/512 MiB ceilings import died at the BufferMgr flush; at the scaled
   9 GiB/3 GiB it imports + analyzes with headroom.)

2. cgroup-aware Ghidra heap (sandbox/probes/ghidra_probe.py). The probe sets
   `-XX:MaxRAMPercentage` so the JVM heap auto-scales to whatever `--memory` cap the (now bigger)
   container got — no hardcoded -Xmx to drift from the cap.

3. Fast-profile + graceful bounding. run_probe advertises its wall-clock budget
   (HEXGRAPH_PROBE_TIMEOUT_S); the cold Ghidra path uses `-analysisTimeoutPerFile` to stop+SAVE
   just under it (so it returns a partial-but-usable result instead of being killed empty), and a
   `-preScript` (large binaries only) disables the auto-analysis passes proven pathological on a
   monolith — Call-Fixup Installer (O(n^2) AddressSet), the per-processor Constant Reference
   Analyzer + Scalar Operand References, and the decompile-EVERY-function passes — while KEEPING
   the call-graph/reference/function analyzers. HexGraph decompiles on demand, so the batch
   decompile passes aren't needed for recon.

The root cause was identified by thread-dumping a real large-ELF analysis: the time was a SEQUENCE
of expensive passes (Call-Fixup's AddressSet O(n^2), then the Constant Reference Analyzer), not one
function. All runtime — probes/resources mount at runtime, no sandbox rebuild.

Tests: test_size_aware_resources.py (mem/tmpfs scaling, host cap, docker flags, deadline env),
test_ghidra_fast_profile.py (analysis-timeout sizing, fast-profile gating + the kept/cut analyzer
sets); the existing timeout test retargeted to the medium band. Full fast tier: 1365 passed.
No model/UI change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread src/hexgraph/sandbox/probes/ghidra_probe.py Outdated
@branover

Copy link
Copy Markdown
Owner Author

Merge-gate review — PR #248 — VERDICT: APPROVE

Independent reviewer (did not author this PR). Ran /code-review high --comment and /security-review on git diff origin/main...HEAD, plus a manual correctness/security pass. Focused tests green: test_size_aware_resources.py test_ghidra_fast_profile.py test_size_aware_timeout.py28 passed.

Summary

F13 heap-half: make Ghidra analysis of a 100 MB+ ELF finish instead of OOM-ing (DB-buffer) or hanging. Three runtime fixes — size-scale mem/tmpfs (resources.py), advertise the wall-clock budget to the probe (runner.py), and JVM MaxRAMPercentage + -analysisTimeoutPerFile + a fast-profile preScript (ghidra_probe.py). Clean, well-tested, correctly scoped.

Security — PASS

  • Size-scaling touches only mem/tmpfs/timeout; it never derives or relaxes --network none/--cap-drop ALL/--read-only/--no-new-privileges/--user 1000/seccomp (those stay unconditional in _hardening_args), and lives in the spec, not policy.py — no execution/egress/rehost/remote gate is relaxed. Scaling only ever widens ceilings, never a security boundary.
  • _parse_bytes/_host_mem_total_bytes fail safe (unparseable → 0/None → base unchanged; never crash a probe). Host-fraction + hard caps prevent over-commit and never shrink below base.
  • FAST_PROFILE_SCRIPT is a static, trusted Jython literal with no target/user interpolation; the subprocess.run invocation is an argv list (no shell) of fixed flags + the already-validated /artifact path + str(int)/constant args. No command injection, no new mount/capability. No hostile-target byte flows into any new code path.
  • New HEXGRAPH_PROBE_TIMEOUT_S env carries an integer budget only — informational, no secret. No secret logged/stored/returned.

Correctness — PASS

  • Math/caps verified: monotonic, small-artifact no-op (≤32 MiB timeout / ≤64 MiB mem-tmpfs thresholds), unconstrained left untouched, tmpfs ≤ SIZE_TMPFS_MEM_FRACTION of mem, host-RAM-fraction cap, _parse_bytes/_fmt_mb round-trips. All fail-safe to base when a cap falls below base.
  • -analysisTimeoutPerFile budget = advertised HEXGRAPH_PROBE_TIMEOUT_S − 180s; that env is the same value as the external docker kill (runner.py:493/547), so analysis halts before the kill with headroom to save. Consistent.
  • Fast-profile gating (≥100 MiB), cold-only (warm -process uses -noanalysis), and the kept-vs-cut analyzer sets are correct: the call-graph/reference/function-discovery analyzers are preserved; the _slow "." in name guard correctly skips dotted sub-options and toggles only the top-level analyzer booleans. Tests pin both the disabled and kept sets.
  • Retargeted test_size_aware_timeout.py correctly pins the medium-band (48 MiB) behavior: timeout widens, mem/cpu/pids/tmpfs unchanged.

Scope checks — PASS

No model/schema/migration change (pure runtime tuning) → no migration owed. No UI behavior change → no docs/dev/ux-contract.md update owed. Single caller of the size-aware spec is run_probe; start_detached (fuzz) keeps its own hard-cap spec — verified.

Findings

  • No blocking findings. No correctness bugs, no security regressions.
  • Low / non-blocking (1, posted inline): _analysis_timeout_args() returns [] when HEXGRAPH_PROBE_TIMEOUT_S − 180 < 60 (i.e. a configured resources.sandbox.timeout < 240s). A user who lowers the sandbox timeout below 240s loses the graceful stop on a large ELF and could be torn down with nothing saved — the very failure this PR fixes at the default. Harmless at the 300s default; consider keying the floor off the size-scaled budget or documenting it. Discussion-only, not a merge blocker.

Nothing fixed by the reviewer (no blocking issue). Recommend merge once CI is green.

…lock budget (review #248)

Review nit (low): _analysis_timeout_args returned [] once HEXGRAPH_PROBE_TIMEOUT_S - 180 < 60, so
lowering resources.sandbox.timeout below 240s silently dropped the -analysisTimeoutPerFile graceful
save this PR adds for a large ELF. Now any non-trivial budget (>=120s) always gets a stop, floored
at ~half the wall-clock (never below) while still leaving import/save headroom for a large one.
Test split: large budget = total-overhead; a 200s budget now floors to 100s; only <120s / absent /
unparseable yields [].

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@branover branover merged commit 6b710db into main Jun 10, 2026
7 checks passed
@branover branover deleted the build/ghidra-heap branch June 10, 2026 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant