From 7926bee860f2a1965297ff82b250da10275dbf6a Mon Sep 17 00:00:00 2001 From: Jono Prest Date: Thu, 11 Jun 2026 11:01:52 +0000 Subject: [PATCH 1/3] dce: tooling to run reanalyze dead-code analysis on the compiler MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds scripts/dce/run-dce.sh + README to run reanalyze DCE over the compiler's own OCaml source. The vendored rescript-tools reanalyze can't do this: the compiler's OCaml is built by host OCaml 5.3 (cmt magic Caml1999I035), while the vendored ml library only reads the 4.06-era ReScript cmt format, so it dies with Cmi_format.Error. Instead we build the standalone reanalyze against host compiler-libs, using the OCaml-5.3 support from rescript-lang/reanalyze#203 (pinned, not yet merged). It works end-to-end and produces a full report. Raw output is currently dominated by entry-point false positives (dune emits only .cmti for the bsc main, and jsoo isn't built in the default profile), so making it actionable needs root/entry-point fixes — documented in the README along with the lowest-noise starting categories (redundant/unused optional args) and the CI-gating proposal. --- .gitignore | 3 ++ scripts/dce/README.md | 92 ++++++++++++++++++++++++++++++++++++++++++ scripts/dce/run-dce.sh | 52 ++++++++++++++++++++++++ 3 files changed, 147 insertions(+) create mode 100644 scripts/dce/README.md create mode 100755 scripts/dce/run-dce.sh diff --git a/.gitignore b/.gitignore index ba748abd224..d58419eeeee 100644 --- a/.gitignore +++ b/.gitignore @@ -108,3 +108,6 @@ package.tgz # AI Agents .claude/settings.local.json + +# reanalyze DCE report output +_dce/ diff --git a/scripts/dce/README.md b/scripts/dce/README.md new file mode 100644 index 00000000000..2e389fd1c20 --- /dev/null +++ b/scripts/dce/README.md @@ -0,0 +1,92 @@ +# Compiler dead-code analysis (reanalyze DCE) + +Tooling to run reanalyze's dead-code-elimination (DCE) over the **compiler's own +OCaml source**, to find removable dead code. This is the native-OCaml analog of +the `tests/analysis_tests/tests-reanalyze` suite (which analyzes ReScript code). + +## Quick start + +```bash +scripts/dce/run-dce.sh # writes _dce/report.txt +``` + +Requires the host OCaml switch (5.3) with `dune` available (`eval $(opam env)`). +The script fetches + builds a pinned standalone reanalyze on first run (cached in +`~/.cache/rescript-dce`). + +## Why the vendored `rescript-tools reanalyze` does NOT work here + +The compiler's OCaml is compiled by the **host OCaml (5.3)** via dune, producing +`.cmt`/`.cmi` in the **5.3 format** (`Caml1999I035`). The vendored +`rescript-tools reanalyze` links the ReScript `ml` library, which only understands +the old **4.06-era** ReScript cmt format (used for `.res` → `.cmt`). Pointed at the +compiler's own cmts it fails immediately with `Fatal error: Cmi_format.Error`. + +So we use the **standalone** reanalyze (`rescript-lang/reanalyze`), built against +the host `compiler-libs.common`. OCaml 5.3 support comes from +[reanalyze#203](https://github.com/rescript-lang/reanalyze/pull/203) (branch +`ocaml-5-3`), **not yet merged** — the runner pins its commit. + +## Status: it works + +Running DCE over all 529 dune-built cmts produces a full report. Category counts +from a baseline run (`-dce-cmt _build/default`): + +| Category | Count | +|---|---| +| Dead Value | ~2740 | +| Dead Type | ~386 | +| Redundant Optional Argument | ~212 | +| Dead Module | ~203 | +| Unused Argument | ~169 | +| Dead Value With Side Effects | ~108 | +| (of which) variant cases "never constructed" | ~278 | + +## The big caveat: entry-point false positives + +The raw numbers are dominated by false positives. reanalyze DCE is a whole-program +reachability analysis anchored at **roots** (entry points). The compiler's true +entry points are the executable mains — and **dune emits only `.cmti` (no `.cmt` +body) for the `bsc` main**, while the playground (`jsoo`) main isn't built in the +default profile at all. With the entry-point bodies missing, everything reachable +only from them looks dead. Concrete example: `Bs_version` is flagged a "dead +module" even though `compiler/bsc/rescript_compiler_main.ml` and +`compiler/jsoo/jsoo_playground_main.ml` use it. + +**To make this actionable, the next step is fixing roots**, via some combination of: +- getting dune to emit `.cmt` for the executable mains (so their call sites are + analyzed), and/or +- declaring entry points live with `-live-paths` / `-live-names`, and/or +- `@live`/`@dead` source annotations (precedent: ~38 already exist in + `compiler/gentype` and `compiler/syntax`), plus `-suppress` for whole subtrees. + +## Lowest-noise categories to start from + +These depend on **call-site** analysis within the set, so they're far less +sensitive to the missing-roots problem and are the best first targets: + +- **Redundant Optional Argument** — optional arg *always supplied* at every call, + e.g. `transl_apply`'s `~inlined`/`~transformed_jsx` (`compiler/ml/translcore.ml`), + `Typ.poly`'s `~loc` (`compiler/ml/ast_helper`). Can be made mandatory. +- **Unused Argument** — optional arg *never used* in the body, e.g. + `type_open_`'s `~used_slot` (`compiler/ml/typemod.ml`), + `disjoint_union`'s `~eq`/`~print` (`compiler/ext/identifiable.ml`). + +## Relationship to the manual "unreachable OCaml variant" survey + +Complementary, with a subtle boundary: + +- reanalyze **does** detect "variant case which is never constructed" (~278 found), + overlapping the manual survey — but with false positives when the only + construction site sits outside the analyzed cmt set. +- reanalyze **misses** variants that *are* constructed but only inside **unreachable + code**. Example: `Longident.Lapply` is "constructed" in `ctype.ml`'s `lid_of_path` + (a branch that never runs), so reanalyze sees it as live and does not flag it — + yet it is genuinely dead. That class still needs the manual reachability reasoning. + +## CI integration (proposed, not yet wired) + +Once roots are fixed and a clean baseline exists, gate CI on **new** dead code: +run `run-dce.sh`, diff against a checked-in baseline, fail on additions. Open +question: how to depend on the external unmerged reanalyze#203 (pin a commit, vendor +it, or wait for merge). diff --git a/scripts/dce/run-dce.sh b/scripts/dce/run-dce.sh new file mode 100755 index 00000000000..99f085d5d46 --- /dev/null +++ b/scripts/dce/run-dce.sh @@ -0,0 +1,52 @@ +#!/bin/bash +# Run reanalyze dead-code-elimination (DCE) over the compiler's OWN OCaml source. +# +# WHY A SEPARATE TOOL: the compiler's OCaml is built by the host OCaml (5.3) via +# dune, producing 5.3-format .cmt/.cmi. The vendored `rescript-tools reanalyze` +# links the ReScript `ml` library, which only understands the old (4.06-era) +# ReScript cmt format used for .res -> .cmt, so it CANNOT read the compiler's own +# cmts (fails with Cmi_format.Error). We therefore use the STANDALONE reanalyze +# built against the host compiler-libs. OCaml 5.3 support comes from +# rescript-lang/reanalyze#203 (branch `ocaml-5-3`), which is not yet merged, so we +# pin its commit here. +# +# Usage: scripts/dce/run-dce.sh [output-file] +# Output defaults to _dce/report.txt +set -euo pipefail + +REPO_ROOT="$(cd "$(dirname "$0")/../.." && pwd)" +cd "$REPO_ROOT" + +# Pinned standalone reanalyze with OCaml 5.3 support (PR #203). +REANALYZE_REPO="https://github.com/rescript-lang/reanalyze.git" +REANALYZE_REF="${REANALYZE_REF:-1327343}" # head of branch ocaml-5-3 at time of writing +REANALYZE_SRC="${REANALYZE_SRC:-$HOME/.cache/rescript-dce/reanalyze}" + +OUT="${1:-_dce/report.txt}" +mkdir -p "$(dirname "$OUT")" + +# 1. Fetch + build the standalone reanalyze (cached). +if [ ! -x "$REANALYZE_SRC/_build/default/src/Reanalyze.exe" ]; then + echo "==> Fetching standalone reanalyze ($REANALYZE_REF)" + rm -rf "$REANALYZE_SRC" + git clone --quiet "$REANALYZE_REPO" "$REANALYZE_SRC" + git -C "$REANALYZE_SRC" checkout --quiet "$REANALYZE_REF" + echo "==> Building reanalyze against $(ocaml -version)" + (cd "$REANALYZE_SRC" && dune build 2>&1 | tail -5) +fi +BIN="$REANALYZE_SRC/_build/default/src/Reanalyze.exe" + +# 2. Build the compiler so dune emits fresh .cmt/.cmti. +echo "==> dune build (producing .cmt files)" +dune build + +# 3. Run DCE over the whole dune build tree (compiler + tools + analysis + +# executables). All are host-OCaml 5.3 cmts; the ReScript runtime (4.06 cmts) +# lives outside _build/default so it is not picked up. +echo "==> Running DCE -> $OUT" +"$BIN" -dce-cmt _build/default > "$OUT" 2>&1 || true + +echo "==> Done. Summary:" +grep -oE "Warning [A-Za-z ]+" "$OUT" | sort | uniq -c | sort -rn || true +echo " never-constructed variants: $(grep -c 'never constructed' "$OUT" || echo 0)" +echo "Full report: $OUT" From d5e2cc143e2e82a24f5dcb2c3be234e5b0057173 Mon Sep 17 00:00:00 2001 From: Jono Prest Date: Thu, 11 Jun 2026 11:15:06 +0000 Subject: [PATCH 2/3] dce: use @check for entry-point cmts; document #203 reliability blocker Two findings from triaging the DCE output: 1. Roots: a plain `dune build` emits only .cmti for modules with an .mli (incl the bsc main), so reanalyze missed the entry-point bodies and over-reported. Switch the runner to `dune build @check`, which emits impl .cmt for every module. Fixes the Bs_version-class false positives (Dead Value ~2740->2130, Dead Module ~203->146). 2. Blocker: even with roots fixed, reanalyze#203 flags obviously-live core modules as dead (Ast_helper has 510 in-tree refs; Lam_compile, Js_dump, Lam_convert likewise). Per the PR, its 5.3 value-dependency tracking is derived 'just based on the type' in cmt_infos and is incomplete (rescript-lang/reanalyze#202). So cross-reference-based categories aren't trustworthy yet. Documented in the README. --- scripts/dce/README.md | 77 +++++++++++++++++++++++++----------------- scripts/dce/run-dce.sh | 10 ++++-- 2 files changed, 53 insertions(+), 34 deletions(-) diff --git a/scripts/dce/README.md b/scripts/dce/README.md index 2e389fd1c20..08dcb03fcdf 100644 --- a/scripts/dce/README.md +++ b/scripts/dce/README.md @@ -27,38 +27,53 @@ the host `compiler-libs.common`. OCaml 5.3 support comes from [reanalyze#203](https://github.com/rescript-lang/reanalyze/pull/203) (branch `ocaml-5-3`), **not yet merged** — the runner pins its commit. -## Status: it works - -Running DCE over all 529 dune-built cmts produces a full report. Category counts -from a baseline run (`-dce-cmt _build/default`): - -| Category | Count | -|---|---| -| Dead Value | ~2740 | -| Dead Type | ~386 | -| Redundant Optional Argument | ~212 | -| Dead Module | ~203 | -| Unused Argument | ~169 | -| Dead Value With Side Effects | ~108 | -| (of which) variant cases "never constructed" | ~278 | - -## The big caveat: entry-point false positives - -The raw numbers are dominated by false positives. reanalyze DCE is a whole-program -reachability analysis anchored at **roots** (entry points). The compiler's true -entry points are the executable mains — and **dune emits only `.cmti` (no `.cmt` -body) for the `bsc` main**, while the playground (`jsoo`) main isn't built in the -default profile at all. With the entry-point bodies missing, everything reachable -only from them looks dead. Concrete example: `Bs_version` is flagged a "dead -module" even though `compiler/bsc/rescript_compiler_main.ml` and -`compiler/jsoo/jsoo_playground_main.ml` use it. - -**To make this actionable, the next step is fixing roots**, via some combination of: -- getting dune to emit `.cmt` for the executable mains (so their call sites are - analyzed), and/or -- declaring entry points live with `-live-paths` / `-live-names`, and/or +## Status: it RUNS, but the output is NOT yet trustworthy + +The tooling runs end-to-end and produces a full report over all dune-built cmts. +That is the mechanical milestone. **But the results cannot currently be acted on**, +for two stacked reasons: + +### 1. (Fixed) Entry-point roots — use `dune build @check` + +reanalyze DCE is a whole-program reachability analysis anchored at **roots** (entry +points = the executable mains). A plain `dune build` does native compilation and +emits only `.cmti` (no impl `.cmt`) for modules with an `.mli` — including the `bsc` +main — so reanalyze never sees the entry-point bodies and over-reports. The runner +now uses `dune build @check`, which typecheck-builds everything and emits the impl +`.cmt` for all modules. This fixed the `Bs_version`-class false positives and cut +Dead Value ~2740 → ~2130, Dead Module ~203 → ~146. + +Residual root gap: the playground `jsoo` main is `enabled_if profile=browser` with +`(modes js wasm)` (no bytecode `.cmt`), and the dune comment says not to build it by +default (slow). So code used only by the playground still shows as dead. + +### 2. (BLOCKER) reanalyze#203's 5.3 dependency tracking is incomplete + +Even with roots fixed, the report flags **obviously-live core modules as dead** — +e.g. `Ast_helper` (510 references in-tree), `Lam_compile`, `Js_dump`, `Lam_convert`. +The cause is stated in the PR itself: reanalyze#203 derives value dependencies +*"just based on the type of what's available in the new `Cmt_format.cmt_infos`"* — +a **tentative, incomplete** method that does not capture the real use edges in 5.3 +cmts. As a result the cross-reference-based categories (Dead Value / Dead Module / +Dead Type / "never constructed") are unreliable: there are real positives buried in +there, but you can't tell which without manually re-verifying every one, which +defeats the purpose. + +Local, call-site-based categories fare better — e.g. `transl_apply`'s `~inlined` is +correctly reported as always-supplied — but they rely on the same reference tracking, +so they're only trustworthy when a function has few, simple call sites. + +**Conclusion:** the real blocker is upstream. Completing reanalyze#203's value- +dependency extraction for OCaml 5.3 cmts (issue rescript-lang/reanalyze#202) is the +prerequisite for this to find dead code reliably. Until then, treat the report as a +lead generator that requires per-item manual verification, not a worklist. + +## Making it actionable (once #203's dependency tracking is solid) + +- declare entry points live with `-live-paths` / `-live-names`, - `@live`/`@dead` source annotations (precedent: ~38 already exist in - `compiler/gentype` and `compiler/syntax`), plus `-suppress` for whole subtrees. + `compiler/gentype` and `compiler/syntax`), plus `-suppress` for whole subtrees, +- handle the `jsoo` root gap (annotate live, or a check-only browser-profile build). ## Lowest-noise categories to start from diff --git a/scripts/dce/run-dce.sh b/scripts/dce/run-dce.sh index 99f085d5d46..93800e52f6a 100755 --- a/scripts/dce/run-dce.sh +++ b/scripts/dce/run-dce.sh @@ -36,9 +36,13 @@ if [ ! -x "$REANALYZE_SRC/_build/default/src/Reanalyze.exe" ]; then fi BIN="$REANALYZE_SRC/_build/default/src/Reanalyze.exe" -# 2. Build the compiler so dune emits fresh .cmt/.cmti. -echo "==> dune build (producing .cmt files)" -dune build +# 2. Typecheck-build so dune emits fresh .cmt for EVERY module, including the +# executable mains (bsc, res_cli). A plain `dune build` does native compilation +# and emits only `.cmti` for modules that have an `.mli` (e.g. the bsc main), +# leaving reanalyze blind to the entry-point bodies and over-reporting dead code. +# `@check` is typecheck-only and emits the impl `.cmt` for all of them. +echo "==> dune build @check (producing .cmt files incl. entry points)" +dune build @check # 3. Run DCE over the whole dune build tree (compiler + tools + analysis + # executables). All are host-OCaml 5.3 cmts; the ReScript runtime (4.06 cmts) From 4183326365127d28bb8d7c03ce5275e5b1fda980 Mon Sep 17 00:00:00 2001 From: Jono Prest Date: Thu, 11 Jun 2026 11:33:04 +0000 Subject: [PATCH 3/3] dce: document the precise root cause of reanalyze#203's false positives Investigated why obviously-live modules (Ast_helper, Lam_compile, Js_dump) are flagged dead. reanalyze's DCE is location-keyed, and OCaml 5.3 broke that on two fronts: cmt_declaration_dependencies is uid-based and carries few edges, and the dominant typedtree-walk reference source relies on Texp_ident val_loc that no longer matches declaration registration cross-module. Confirmed by experiment (local reanalyze branch): a global uid->decl pre-pass is a necessary foundation but insufficient; resolving references via val_uid alone regresses (2129->2526 dead values) because declarations are still keyed by the old location. The real fix keys both declarations and references by uid - a core refactor, not a patch. --- scripts/dce/README.md | 34 ++++++++++++++++++++++++++++++---- 1 file changed, 30 insertions(+), 4 deletions(-) diff --git a/scripts/dce/README.md b/scripts/dce/README.md index 08dcb03fcdf..68dcb4b43ba 100644 --- a/scripts/dce/README.md +++ b/scripts/dce/README.md @@ -63,10 +63,36 @@ Local, call-site-based categories fare better — e.g. `transl_apply`'s `~inline correctly reported as always-supplied — but they rely on the same reference tracking, so they're only trustworthy when a function has few, simple call sites. -**Conclusion:** the real blocker is upstream. Completing reanalyze#203's value- -dependency extraction for OCaml 5.3 cmts (issue rescript-lang/reanalyze#202) is the -prerequisite for this to find dead code reliably. Until then, treat the report as a -lead generator that requires per-item manual verification, not a worklist. +#### Root cause (investigated) and why the fix is non-trivial + +reanalyze's DCE is **location-keyed**: declarations are registered by source +position, and a use is connected to a declaration when the use's recorded "target +location" equals the declaration's. OCaml 5.3 broke this on two fronts: + +1. `cmt_value_dependencies` (direct location-based use→def edges) was replaced by + `cmt_declaration_dependencies`, which identifies declarations by `Shape.Uid`. A + cmt's `cmt_uid_to_decl` only maps the uids it *defines*, so cross-module uids + can't be resolved locally. (#203 also only carries a small subset of edges here — + e.g. `typecore.cmt` exposes ~35 — so this is not the main reference source anyway.) +2. The dominant reference source is the **typedtree walk** (`DeadValue.collectExpr`, + `Texp_ident` → `val_loc`). In 5.3 the `val_loc` on a cross-module reference's + `value_description` no longer matches where the declaration is registered, so the + use→decl edge is dropped → widely-used modules look dead. + +**Experiments (in a local reanalyze fix branch):** +- Added a global, order-independent `Shape.Uid → declaration` table (pre-pass over + all cmts) and resolved `cmt_declaration_dependencies` against it. Correct and + necessary foundation, but ~no effect (that channel carries few edges). +- Then resolved `Texp_ident` references via the value's `val_uid` through that table + instead of `val_loc`. This **regressed** (Dead Value 2129 → 2526), because + declarations are still registered at their old location key, so moving only the + *reference* side just relocates the mismatch. + +**Conclusion:** the fix is a real refactor, not a patch — reanalyze's DCE core must +key declarations *and* references by the same canonical identity (uid-based) so the +two sides agree. Best done upstream on reanalyze#203 (issue #202), ideally with its +author. Until then, treat the report as a lead generator needing per-item manual +verification, not a worklist. ## Making it actionable (once #203's dependency tracking is solid)