feat: bound Ghidra analysis of a large monolith — size-scaled mem/tmpfs + cgroup heap + fast-profile (F13 heap half)#248
Conversation
…hing or hanging (F13 heap half) The deferred half of F13: a 100 MB+ ELF either crashed Ghidra at import (the "DB buffer" / BufferMgr failure) or ran auto-analysis for hours. Three coordinated fixes turn that into a bounded run that returns usable recon (functions + call graph), with the persistent project so the cost is paid once. 1. Size-scaled mem + tmpfs (sandbox/resources.py). resource_spec_for_artifact now also raises `mem` and the `/scratch` tmpfs for a large artifact (>=64 MiB), not just the timeout — Ghidra's import/auto-analysis of a huge ELF exhausts the 2 GiB heap and fills the 512 MiB DB/recovery tmpfs (which counts against the same cgroup). Bounded by a hard cap AND a fraction of host RAM so it never over-commits the box; only ever widens; small artifacts unchanged. (Verified end-to-end: at the old 2 GiB/512 MiB ceilings import died at the BufferMgr flush; at the scaled 9 GiB/3 GiB it imports + analyzes with headroom.) 2. cgroup-aware Ghidra heap (sandbox/probes/ghidra_probe.py). The probe sets `-XX:MaxRAMPercentage` so the JVM heap auto-scales to whatever `--memory` cap the (now bigger) container got — no hardcoded -Xmx to drift from the cap. 3. Fast-profile + graceful bounding. run_probe advertises its wall-clock budget (HEXGRAPH_PROBE_TIMEOUT_S); the cold Ghidra path uses `-analysisTimeoutPerFile` to stop+SAVE just under it (so it returns a partial-but-usable result instead of being killed empty), and a `-preScript` (large binaries only) disables the auto-analysis passes proven pathological on a monolith — Call-Fixup Installer (O(n^2) AddressSet), the per-processor Constant Reference Analyzer + Scalar Operand References, and the decompile-EVERY-function passes — while KEEPING the call-graph/reference/function analyzers. HexGraph decompiles on demand, so the batch decompile passes aren't needed for recon. The root cause was identified by thread-dumping a real large-ELF analysis: the time was a SEQUENCE of expensive passes (Call-Fixup's AddressSet O(n^2), then the Constant Reference Analyzer), not one function. All runtime — probes/resources mount at runtime, no sandbox rebuild. Tests: test_size_aware_resources.py (mem/tmpfs scaling, host cap, docker flags, deadline env), test_ghidra_fast_profile.py (analysis-timeout sizing, fast-profile gating + the kept/cut analyzer sets); the existing timeout test retargeted to the medium band. Full fast tier: 1365 passed. No model/UI change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Merge-gate review — PR #248 — VERDICT: APPROVEIndependent reviewer (did not author this PR). Ran SummaryF13 heap-half: make Ghidra analysis of a 100 MB+ ELF finish instead of OOM-ing (DB-buffer) or hanging. Three runtime fixes — size-scale Security — PASS
Correctness — PASS
Scope checks — PASSNo model/schema/migration change (pure runtime tuning) → no migration owed. No UI behavior change → no Findings
Nothing fixed by the reviewer (no blocking issue). Recommend merge once CI is green. |
…lock budget (review #248) Review nit (low): _analysis_timeout_args returned [] once HEXGRAPH_PROBE_TIMEOUT_S - 180 < 60, so lowering resources.sandbox.timeout below 240s silently dropped the -analysisTimeoutPerFile graceful save this PR adds for a large ELF. Now any non-trivial budget (>=120s) always gets a stop, floored at ~half the wall-clock (never below) while still leaving import/save headroom for a large one. Test split: large budget = total-overhead; a 200s budget now floors to 100s; only <120s / absent / unparseable yields []. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
What — the deferred half of F13
A 100 MB+ ELF either crashed Ghidra at import (the
BufferMgr"DB buffer" failure) or ran auto-analysis for hours. This turns that into a bounded run that returns usable recon (functions + call graph), paid once via the persistent project. Three coordinated, all-runtime fixes (no sandbox rebuild):1. Size-scaled
mem+tmpfs(sandbox/resources.py)resource_spec_for_artifactnow also raisesmemand the/scratchtmpfs for a large artifact (≥64 MiB), not just the timeout. Ghidra's import/auto-analysis of a huge ELF exhausts the 2 GiB heap and fills the 512 MiB DB/recovery tmpfs (which counts against the same cgroup). Bounded by a hard cap and a fraction of host RAM (never over-commits the box); only ever widens; small artifacts byte-for-byte unchanged.2. cgroup-aware Ghidra heap (
sandbox/probes/ghidra_probe.py)The probe sets
-XX:MaxRAMPercentage, so the JVM heap auto-scales to whatever--memorycap the (now bigger) container got — no hardcoded-Xmxto drift from the cap.3. Fast-profile + graceful bounding
run_probeadvertises its wall-clock budget (HEXGRAPH_PROBE_TIMEOUT_S); the cold Ghidra path uses-analysisTimeoutPerFileto stop + save just under it (returns a partial-but-usable result instead of being killed empty), and a-preScript(large binaries only) disables the passes proven pathological on a monolith — Call-Fixup Installer (O(n²)AddressSet), the per-processor Constant Reference Analyzer + Scalar Operand References, and the decompile-every-function passes — while keeping the call-graph/reference/function analyzers. HexGraph decompiles on demand, so the batch decompile passes aren't needed for recon.How the root cause was found
Thread-dumping a real large-ELF analysis showed the time was a sequence of expensive passes, not one function:
CallFixupAnalyzer.added → AddressSet.add(O(n²), ~25 min of CPU), then once that's disabled, theConstantPropagationAnalyzer.-analysisTimeoutPerFilealone can't rescue it because those passes don't poll the cancel monitor mid-run — hence the fast profile.Validation (real large ELF, private WITH_GHIDRA image)
BufferMgrflush (~10 min). At the scaled 9 GiB/3 GiB it imports + analyzes with headroom (no crash).rc=0and returned a real call graph, where the unbounded run never finished.Changes
sandbox/resources.py,sandbox/runner.py,sandbox/probes/ghidra_probe.py.test_size_aware_resources.py(mem/tmpfs scaling, host cap, docker flags, deadline env),test_ghidra_fast_profile.py(analysis-timeout sizing, fast-profile gating + kept/cut analyzer sets); the existing timeout test retargeted to the medium band.No model change → no migration. No UI change. Full fast tier: 1365 passed.