feat: bound Ghidra analysis of a large monolith — size-scaled mem/tmpfs + cgroup heap + fast-profile (F13 heap half) by branover · Pull Request #248 · branover/hexgraph

branover · 2026-06-10T19:32:32Z

What — the deferred half of F13

A 100 MB+ ELF either crashed Ghidra at import (the BufferMgr "DB buffer" failure) or ran auto-analysis for hours. This turns that into a bounded run that returns usable recon (functions + call graph), paid once via the persistent project. Three coordinated, all-runtime fixes (no sandbox rebuild):

1. Size-scaled `mem` + `tmpfs` (`sandbox/resources.py`)

resource_spec_for_artifact now also raises mem and the /scratch tmpfs for a large artifact (≥64 MiB), not just the timeout. Ghidra's import/auto-analysis of a huge ELF exhausts the 2 GiB heap and fills the 512 MiB DB/recovery tmpfs (which counts against the same cgroup). Bounded by a hard cap and a fraction of host RAM (never over-commits the box); only ever widens; small artifacts byte-for-byte unchanged.

2. cgroup-aware Ghidra heap (`sandbox/probes/ghidra_probe.py`)

The probe sets -XX:MaxRAMPercentage, so the JVM heap auto-scales to whatever --memory cap the (now bigger) container got — no hardcoded -Xmx to drift from the cap.

3. Fast-profile + graceful bounding

run_probe advertises its wall-clock budget (HEXGRAPH_PROBE_TIMEOUT_S); the cold Ghidra path uses -analysisTimeoutPerFile to stop + save just under it (returns a partial-but-usable result instead of being killed empty), and a -preScript (large binaries only) disables the passes proven pathological on a monolith — Call-Fixup Installer (O(n²) AddressSet), the per-processor Constant Reference Analyzer + Scalar Operand References, and the decompile-every-function passes — while keeping the call-graph/reference/function analyzers. HexGraph decompiles on demand, so the batch decompile passes aren't needed for recon.

How the root cause was found

Thread-dumping a real large-ELF analysis showed the time was a sequence of expensive passes, not one function: CallFixupAnalyzer.added → AddressSet.add (O(n²), ~25 min of CPU), then once that's disabled, the ConstantPropagationAnalyzer. -analysisTimeoutPerFile alone can't rescue it because those passes don't poll the cancel monitor mid-run — hence the fast profile.

Validation (real large ELF, private WITH_GHIDRA image)

Crash: at the old 2 GiB/512 MiB ceilings, import died at the BufferMgr flush (~10 min). At the scaled 9 GiB/3 GiB it imports + analyzes with headroom (no crash).
Result: the real probe path (fast profile applied) completed rc=0 and returned a real call graph, where the unbounded run never finished.

Changes

sandbox/resources.py, sandbox/runner.py, sandbox/probes/ghidra_probe.py.
Tests: test_size_aware_resources.py (mem/tmpfs scaling, host cap, docker flags, deadline env), test_ghidra_fast_profile.py (analysis-timeout sizing, fast-profile gating + kept/cut analyzer sets); the existing timeout test retargeted to the medium band.

No model change → no migration. No UI change. Full fast tier: 1365 passed.

…hing or hanging (F13 heap half) The deferred half of F13: a 100 MB+ ELF either crashed Ghidra at import (the "DB buffer" / BufferMgr failure) or ran auto-analysis for hours. Three coordinated fixes turn that into a bounded run that returns usable recon (functions + call graph), with the persistent project so the cost is paid once. 1. Size-scaled mem + tmpfs (sandbox/resources.py). resource_spec_for_artifact now also raises `mem` and the `/scratch` tmpfs for a large artifact (>=64 MiB), not just the timeout — Ghidra's import/auto-analysis of a huge ELF exhausts the 2 GiB heap and fills the 512 MiB DB/recovery tmpfs (which counts against the same cgroup). Bounded by a hard cap AND a fraction of host RAM so it never over-commits the box; only ever widens; small artifacts unchanged. (Verified end-to-end: at the old 2 GiB/512 MiB ceilings import died at the BufferMgr flush; at the scaled 9 GiB/3 GiB it imports + analyzes with headroom.) 2. cgroup-aware Ghidra heap (sandbox/probes/ghidra_probe.py). The probe sets `-XX:MaxRAMPercentage` so the JVM heap auto-scales to whatever `--memory` cap the (now bigger) container got — no hardcoded -Xmx to drift from the cap. 3. Fast-profile + graceful bounding. run_probe advertises its wall-clock budget (HEXGRAPH_PROBE_TIMEOUT_S); the cold Ghidra path uses `-analysisTimeoutPerFile` to stop+SAVE just under it (so it returns a partial-but-usable result instead of being killed empty), and a `-preScript` (large binaries only) disables the auto-analysis passes proven pathological on a monolith — Call-Fixup Installer (O(n^2) AddressSet), the per-processor Constant Reference Analyzer + Scalar Operand References, and the decompile-EVERY-function passes — while KEEPING the call-graph/reference/function analyzers. HexGraph decompiles on demand, so the batch decompile passes aren't needed for recon. The root cause was identified by thread-dumping a real large-ELF analysis: the time was a SEQUENCE of expensive passes (Call-Fixup's AddressSet O(n^2), then the Constant Reference Analyzer), not one function. All runtime — probes/resources mount at runtime, no sandbox rebuild. Tests: test_size_aware_resources.py (mem/tmpfs scaling, host cap, docker flags, deadline env), test_ghidra_fast_profile.py (analysis-timeout sizing, fast-profile gating + the kept/cut analyzer sets); the existing timeout test retargeted to the medium band. Full fast tier: 1365 passed. No model/UI change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

branover · 2026-06-10T19:36:23Z

Merge-gate review — PR #248 — VERDICT: APPROVE

Independent reviewer (did not author this PR). Ran /code-review high --comment and /security-review on git diff origin/main...HEAD, plus a manual correctness/security pass. Focused tests green: test_size_aware_resources.py test_ghidra_fast_profile.py test_size_aware_timeout.py → 28 passed.

Summary

F13 heap-half: make Ghidra analysis of a 100 MB+ ELF finish instead of OOM-ing (DB-buffer) or hanging. Three runtime fixes — size-scale mem/tmpfs (resources.py), advertise the wall-clock budget to the probe (runner.py), and JVM MaxRAMPercentage + -analysisTimeoutPerFile + a fast-profile preScript (ghidra_probe.py). Clean, well-tested, correctly scoped.

Security — PASS

Size-scaling touches only mem/tmpfs/timeout; it never derives or relaxes --network none/--cap-drop ALL/--read-only/--no-new-privileges/--user 1000/seccomp (those stay unconditional in _hardening_args), and lives in the spec, not policy.py — no execution/egress/rehost/remote gate is relaxed. Scaling only ever widens ceilings, never a security boundary.
_parse_bytes/_host_mem_total_bytes fail safe (unparseable → 0/None → base unchanged; never crash a probe). Host-fraction + hard caps prevent over-commit and never shrink below base.
FAST_PROFILE_SCRIPT is a static, trusted Jython literal with no target/user interpolation; the subprocess.run invocation is an argv list (no shell) of fixed flags + the already-validated /artifact path + str(int)/constant args. No command injection, no new mount/capability. No hostile-target byte flows into any new code path.
New HEXGRAPH_PROBE_TIMEOUT_S env carries an integer budget only — informational, no secret. No secret logged/stored/returned.

Correctness — PASS

Math/caps verified: monotonic, small-artifact no-op (≤32 MiB timeout / ≤64 MiB mem-tmpfs thresholds), unconstrained left untouched, tmpfs ≤ SIZE_TMPFS_MEM_FRACTION of mem, host-RAM-fraction cap, _parse_bytes/_fmt_mb round-trips. All fail-safe to base when a cap falls below base.
-analysisTimeoutPerFile budget = advertised HEXGRAPH_PROBE_TIMEOUT_S − 180s; that env is the same value as the external docker kill (runner.py:493/547), so analysis halts before the kill with headroom to save. Consistent.
Fast-profile gating (≥100 MiB), cold-only (warm -process uses -noanalysis), and the kept-vs-cut analyzer sets are correct: the call-graph/reference/function-discovery analyzers are preserved; the _slow "." in name guard correctly skips dotted sub-options and toggles only the top-level analyzer booleans. Tests pin both the disabled and kept sets.
Retargeted test_size_aware_timeout.py correctly pins the medium-band (48 MiB) behavior: timeout widens, mem/cpu/pids/tmpfs unchanged.

Scope checks — PASS

No model/schema/migration change (pure runtime tuning) → no migration owed. No UI behavior change → no docs/dev/ux-contract.md update owed. Single caller of the size-aware spec is run_probe; start_detached (fuzz) keeps its own hard-cap spec — verified.

Findings

No blocking findings. No correctness bugs, no security regressions.
Low / non-blocking (1, posted inline): _analysis_timeout_args() returns [] when HEXGRAPH_PROBE_TIMEOUT_S − 180 < 60 (i.e. a configured resources.sandbox.timeout < 240s). A user who lowers the sandbox timeout below 240s loses the graceful stop on a large ELF and could be torn down with nothing saved — the very failure this PR fixes at the default. Harmless at the 300s default; consider keying the floor off the size-scaled budget or documenting it. Discussion-only, not a merge blocker.

Nothing fixed by the reviewer (no blocking issue). Recommend merge once CI is green.

…lock budget (review #248) Review nit (low): _analysis_timeout_args returned [] once HEXGRAPH_PROBE_TIMEOUT_S - 180 < 60, so lowering resources.sandbox.timeout below 240s silently dropped the -analysisTimeoutPerFile graceful save this PR adds for a large ELF. Now any non-trivial budget (>=120s) always gets a stop, floored at ~half the wall-clock (never below) while still leaving import/save headroom for a large one. Test split: large budget = total-overhead; a 200s budget now floors to 100s; only <120s / absent / unparseable yields []. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

branover commented Jun 10, 2026

View reviewed changes

Comment thread src/hexgraph/sandbox/probes/ghidra_probe.py Outdated

branover merged commit 6b710db into main Jun 10, 2026
7 checks passed

branover deleted the build/ghidra-heap branch June 10, 2026 19:43

branover mentioned this pull request Jun 10, 2026

chore(main): release hexgraph 0.8.0 #245

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: bound Ghidra analysis of a large monolith — size-scaled mem/tmpfs + cgroup heap + fast-profile (F13 heap half)#248

feat: bound Ghidra analysis of a large monolith — size-scaled mem/tmpfs + cgroup heap + fast-profile (F13 heap half)#248
branover merged 2 commits into
mainfrom
build/ghidra-heap

branover commented Jun 10, 2026

Uh oh!

Uh oh!

branover commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

branover commented Jun 10, 2026

What — the deferred half of F13

1. Size-scaled mem + tmpfs (sandbox/resources.py)

2. cgroup-aware Ghidra heap (sandbox/probes/ghidra_probe.py)

3. Fast-profile + graceful bounding

How the root cause was found

Validation (real large ELF, private WITH_GHIDRA image)

Changes

Uh oh!

Uh oh!

branover commented Jun 10, 2026

Merge-gate review — PR #248 — VERDICT: APPROVE

Summary

Security — PASS

Correctness — PASS

Scope checks — PASS

Findings

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Size-scaled `mem` + `tmpfs` (`sandbox/resources.py`)

2. cgroup-aware Ghidra heap (`sandbox/probes/ghidra_probe.py`)