Add ocamlgrep: structural search of OCaml code#2064
Conversation
Imports the LexiFi ocamlgrep tool as a sub-package of the merlin repository, both as a [merlin-lib.ocamlgrep] library and as a standalone [ocamlgrep] executable, plus an [ocamlmerlin ocamlgrep] subcommand for editor integrations. The matching logic is the upstream Match module from the LexiFi project, lightly adapted to build against merlin's vendored compiler-libs (parser_raw / lexer_raw in place of Parse, Typemod.initial_env in place of Compmisc, and a build-time-generated Standard_library module in place of Config.standard_library). Bringing this under merlin's roof gives ocamlgrep merlin's per-OCaml- version branch story and release pipeline (currently it is pinned to a single OCaml minor version), and makes the matcher reusable from other merlin-aware tooling. The new [ocamlmerlin single ocamlgrep -query <pat>] subcommand ignores the buffer on stdin, walks the project's _build/ tree for .cmt files, and returns findings as JSON. Cram tests cover basic matches, wildcards, type-constrained matches, parse errors, and a smoke test for the standalone CLI.
Add a TODO in paths.ml flagging that walking _build/ ourselves is the upstream-ocamlgrep approach but not the merlin-idiomatic one: ocaml-index lets the build system (dune via @ocaml-index) enumerate cmt files instead of discovering them by filesystem walk. Worth revisiting if ocamlgrep wants to support build systems other than Dune, or to reuse an existing .ocaml-index file.
A finding can span more than one source line, but the previous
representation collapsed everything to the first line: column ranges
were truncated at the first line break, and the snippet was a single
context string. Two consecutive multi-line findings would be
indistinguishable in plain text.
Replace [Scan.finding] with a proper region:
start_line, start_col, end_line, end_col, lines
The corresponding [ocamlgrep_finding] record in [query_protocol.ml]
mirrors this, and the JSON shape becomes
{ file, start: {line, col}, end: {line, col}, lines: [...] }
matching merlin's existing location convention.
The standalone CLI renders findings in OCaml-compiler style:
foo.ml:5:10-22:
5 | let x = List.length xs
foo.ml:6:2-8:9:
6 | match x with
7 | | None -> None
8 | | Some y -> Some y
Each finding starts with an unambiguous [file:start-end:] header so
no separator is needed between consecutive findings (per the
discussion on PR #1).
The cram tests now pipe the JSON through a small jq filter that
renders the same Format A, keeping the snapshots readable while
exercising the JSON shape end-to-end. The wildcards test was tweaked
to span three source lines and demonstrate the multi-line header.
Replace the one-line [Usage: ocamlgrep <pattern>] message with a full guide that covers every form of the pattern language: plain and numbered metavariables (__, __1, ...), suffix matching of identifiers, optional-argument controls, type constraints, set semantics for match/record clauses, and the [__.id] field shortcut. Includes the example invocations from the upstream README, the rendered output format, the [NO_COLOR] convention, and a pointer to the merlin subcommand for JSON output. Also extend the merlin [ocamlgrep] subcommand's help to mention numbered metavariables (the original text only mentioned [__]), to describe the JSON shape it returns, and to point users at [ocamlgrep --help] for the full pattern guide.
The matching algorithm and most of the surrounding code in src/ocamlgrep/ originate from Nicolás's work on the upstream ocamlgrep project at LexiFi. Add an authorship line to each [.ml]/[.mli] header so the attribution travels with the code, independently of git history or the LICENSE file.
|
I’m not a maintainer (or even a contributor) to Merlin, but I must admit I was delighted to read the proposal to integrate In my opinion, ocamlgrep should expose an API under
I’m quite surprised by the dependency lattice that was chosen. |
|
Thanks for the feedback. I was originally hoping that the only module ocamlgrep would bring in would be the |
Previously [Scan.incremental_search] knew how to walk [_build/] itself and process each cmt file it encountered, which conflated two concerns and made [Scan] dune-aware. Per the discussion on PR #1, factor this so that: - [Paths.collect_cmt_files] (new) does the dune-specific [_build/] walk and returns the list of cmt files; - [Scan.incremental_search] now takes that list as input and iterates over it without touching the filesystem. This makes [Scan] a leaf processor: a future caller (e.g. the merlin subcommand using [Mconfig.build_path], or a CLI invocation that takes an explicit cmt-file list) can plug into the matcher without going through [Paths] or its dune assumptions. Both call sites (the standalone binary and the merlin subcommand) are updated to call [Paths.collect_cmt_files] then [Scan.incremental_search]. User-visible behavior is unchanged.
Replace [Paths.identify_dune_project] with [Paths.find], which delegates to merlin's existing [Mconfig_dot.find_project_context] to locate the project root. The previous implementation walked up looking for a [_build/] directory; the new one walks up looking for a [dune-project], [dune-workspace], or [.merlin] file -- the same heuristic the rest of merlin uses. [Mconfig_dot.context] is abstract in its [.mli], so we read what we need from the second return value (the path to the project config file): [dirname] is the project root, [basename] tells us whether the configurator is dune or [.merlin]. [.merlin]-located projects are not yet supported -- the cmt enumeration in [Paths.collect_cmt_files] still assumes a dune [_build/] layout. Such cases now return a clear [Error] instead of silently failing partway through. This addresses one of the two duplications flagged in the PR review: project-root detection. The other (dune-specific cmt enumeration) remains in [Paths.collect_cmt_files] with the existing TODO; lifting that to a shared merlin abstraction is a larger design discussion (probably involving a new dot-protocol directive).
Introduce a new merlin-lib.project library (src/project/) housing
[Dune_workspace], a structured view of the output of
dune describe workspace --format=csexp --lang 0.1
The inner records are derived with [@@deriving of_sexp]; the
top-level entry-list dispatch is hand-written so unknown tags from
future dune --lang versions can be ignored without breaking us.
Use it instead of Paths.collect_cmt_files in both ocamlgrep entry
points (the standalone binary and the merlin subcommand). The
filesystem walk is gone: dune is now the source of truth for what
compiled and where, which:
- filters out external dependencies via the (local true) field,
rather than by chance-of-not-being-under-_build/;
- sees dune-generated modules (.ml-gen) with their proper
impl paths;
- relies on dune's --lang 0.1 stability contract instead of an
implicit assumption about _build/ layout.
Adds ppx_sexp_conv and sexplib0 as merlin-lib.opam dependencies
(via the ppx preprocess on the new library only).
…erge
Three changes:
* Use [string option] directly for [module_.{impl,intf,cmt,cmti}]
in [Dune_workspace] -- as suggested on the PR. sexp_conv encodes
[None] as [()] and [Some x] as [(x)], which is exactly how dune
encodes its optional path fields. Drops the [impl_path] /
[cmt_path] etc. accessor functions; callers can read the option
fields directly.
* Catch [Unix.Unix_error(ENOENT, _, _)] when spawning dune so a
missing [dune] binary becomes a friendly "could not find `dune`
in PATH" message instead of a raw OCaml exception trace. Other
Unix errors get a generic but informative message via
[Unix.error_message].
* Collapse [Paths.find] + [Paths.init] into a single
[Paths.init ()]. Callers always needed both back to back, and
the previous split was easy to forget.
|
I made changes but I'm still working on it. I'll let you know when I think it's in a good state. |
Per the PR comment: ocamlgrep_finding should use Location.t like
the rest of merlin's protocol. Apply the same change to
Scan.finding so we don't have to construct/deconstruct positions
across the lib/protocol boundary.
Scan.finding = { loc : Location.t; lines : string list }
ocamlgrep_finding = { loc : Location.t; lines : string list }
The loc carries the user-friendly project-relative source path in
loc_start.pos_fname (Scan rewrites it, so it's not whatever the
compiler recorded in the cmt). The renderer in the standalone
binary unpacks line/column from the positions; query_json uses
the existing [with_location ~with_file:true] helper, which
produces the same JSON shape we had before:
{ "file": ..., "start": {"line", "col"}, "end": {"line", "col"},
"lines": [...] }
so the cram snapshots are unchanged.
Keep only the [merlin-lib.ocamlgrep] library and the [ocamlgrep] merlin subcommand for now -- this is what the reviewer comments have been focused on, and bundling a separate opam package [ocamlgrep] in the same PR makes it longer to review. The previous state (with the standalone binary, its opam package, and a smoke test) is preserved on the [ocamlgrep-with-executable] branch in this fork, ready to revive once the library/subcommand split has landed upstream. Removed: - ocamlgrep.opam - src/ocamlgrep/bin/ (executable: main.ml + dune) - tests/test-dirs/ocamlgrep/standalone.t/ (cram smoke test) - the corresponding cram-deps stanza in tests/dune Updated the merlin subcommand's help to drop its dangling reference to [ocamlgrep --help]; users wanting the full pattern syntax are pointed at the upstream LexiFi/ocamlgrep README.
Add the [?context] parameter to [Dune_workspace.describe] (declared in the .mli) by passing through [--context <name>] to dune. dune parses both [--root] and [--context] positionally after [describe workspace], so the flag goes at the end alongside [--root].
Paths and Scan are general-purpose cmt-scanning infrastructure, not specific to ocamlgrep. Moving them to src/project/ (merlin-lib.project) makes them reusable by other tools. Scan is now polymorphic in the finding type: incremental_search takes a search function of type Cmt_format.cmt_infos -> source:string -> src_lines:string array -> 'a list and emits ''a event' values. Scan knows nothing about findings. The ocamlgrep-specific pieces move into Match: - 'finding' type (loc + context lines) - 'parse_query': parse a query string with Merlin's vendored parser - 'search_findings': ready-to-use search function for Scan, wrapping search_cmt and converting raw locations to findings Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Match is the only module that was left in src/ocamlgrep/lib/. Moving it to src/analysis/ integrates it properly with the rest of merlin's analysis layer, which already carries all the required compiler-lib dependencies (ocaml_typing, ocaml_preprocess, etc.). The merlin-lib.ocamlgrep public library is removed. query_commands drops the ocamlgrep dep and references Match directly via -open Merlin_analysis. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The standalone ocamlgrep binary is gone, so match.mli is now the canonical reference for the query language. The module docstring covers quick start, the full pattern syntax (wildcards, identifiers, application, type constraints, match/record set semantics, field access), examples, and the output format -- adapted from the --help text of the backup-branch executable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
query_protocol.ml: document that line numbers are 1-based and column offsets are 0-based in ocamlgrep_finding, following merlin conventions. match.mli: replace the text-format output description (which belonged to the now-removed standalone binary) with a description of the JSON output produced by the merlin subcommand, including the field layout and the line/col conventions. The Location.t change in query_protocol and the multiline JSON representation in query_json were already addressed in earlier commits. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All matching functions now take the query pattern as the first argument and the typed program as the second, matching the grep convention and making partial application on the pattern natural. The internal match-with constructs are also flipped so the parse-tree constructor appears on the left and the typed-tree constructor on the right, consistent with the new (pattern, program) reading order. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The name 'Match' is too generic in an OCaml codebase. 'Expr_search' mirrors the existing 'Type_search' module: Type_search searches by type, Expr_search searches by expression structure. A future Pattern_search module could extend the same idea to OCaml patterns. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Status update:
Please keep in mind that none of the above was really discussed with the project maintainers and can be removed, changed, etc. I could also split the PR into multiple PRs if it's easier to review. I'm looking forward to meeting with the project maintainers to get guidance on what's desirable and what isn't. I'm also going to rewrite the PR description by hand which will be more concise but will better reflect my understanding of the issues. |
Replace catch-all wildcard patterns with explicit enumerations of all constructors so that the compiler emits a warning when a future version of compiler-libs adds a new constructor. - path_matches_lident: enumerate (Lident|Ldot|Lapply) x (Pident|Pdot|Papply|Pextra_ty) - constructor_match: enumerate (Lident|Ldot|Lapply) x (Lident|Ldot|Lapply) - match_expr: enumerate all Pexp_* x Texp_* combinations - match_typ: enumerate all Ptyp_* x type_desc combinations - match_pat: enumerate all Ppat_* (GADT right side left as _) - match_pat_expr: enumerate all Pexp_* (GADT right side left as _) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Missing cmt files (Sys_error on read) now return false from process_one_cmt instead of propagating the exception. incremental_search tallies found vs total and emits a single warning reporting the coverage percentage and how many files are missing, with a hint to run 'dune build @check'. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Relative pp_source paths silently failed Sys.file_exists when the LSP server's CWD was not the project root, causing all 198 scanned files to return zero findings. Use Filename.concat paths.project_root to make all filesystem operations CWD-independent.
When cmt_sourcefile points into _build/default/ (which dune uses for preprocessed or ppx-expanded files), redirect to the real source in the project tree to avoid reading binary build artifacts as source.
When cmt_sourcefile points to a .pp.ml file, dune may have placed a binary OCaml AST blob there (for ppx processing). Reading it produces garbage. Instead, find the original .ml in the project source tree and skip the digest check (which compares against the preprocessed file, not the original). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| "sexplib0" | ||
| "ppx_sexp_conv" |
There was a problem hiding this comment.
We are very conservative when it comes to adding new dependencies to merlin, and depending on base is definitely not acceptable.
There was a problem hiding this comment.
This is for src/project/dune_workspace.ml.
The dependency on ppx_sexp_conv is the culprit. The runtime lib we really need here is csexp which is already a dependency of merlin.
To remove the dependency on ppx_sexp_conv, I tried copy-pasting the ml code generated by ppx_sexp_conv but it was ugly so I ended up with a simpler handrolled implementation (which incidentally is what Claude Code had done in the very first version of this PR before I told it to use ppx_sexp_conv instead). I think the result is reasonable now.
| ; Generate standard_library.ml exposing the OCaml standard library path. | ||
| ; The same trick is used in src/kernel/dune. Merlin's vendored Config | ||
| ; module strips the upstream Config.standard_library field, so we recover | ||
| ; it at build time from %{ocaml-config:standard_library}. | ||
| (rule | ||
| (targets standard_library.ml) | ||
| (action (write-file %{targets} | ||
| "let path = {|%{ocaml-config:standard_library}|}"))) |
There was a problem hiding this comment.
Didn't check but this is suspicious, can't we get the information from Merlin_kernel ?
| let collect_cmi_dirs paths = | ||
| let res = ref [] in | ||
| let rec walk dir = | ||
| Array.iter (fun entry -> | ||
| let entry = Filename.concat dir entry in | ||
| if Sys.is_directory entry then begin | ||
| if Filename.basename entry = "byte" | ||
| && Array.exists | ||
| (fun name -> Filename.check_suffix name ".cmi") | ||
| (Sys.readdir entry) | ||
| then | ||
| res := entry :: !res; | ||
| walk entry | ||
| end | ||
| ) (Sys.readdir dir) | ||
| in | ||
| walk paths.build_source_root; | ||
| List.rev !res |
There was a problem hiding this comment.
Why do you need this in addition to the output of dune describe ? I thought the goal was to avoid walking the tree ourselves.
There was a problem hiding this comment.
This is inherited from the original implementation. I need to figure out the cmi situation and load paths.
| The [Mconfig_dot.context] is abstract in the .mli, so we read what | ||
| we need from the second return value, which is the path of the | ||
| config file: its directory is the project root, its basename tells | ||
| us whether the configurator is dune or .merlin. |
There was a problem hiding this comment.
Since Merlin already does it as parts of the pipeline, maybe we could simply add this information to the pipeline's config and reuse it.
| This module handles the mechanics of reading cmt files, verifying | ||
| source digests, mapping compiler-internal paths back to | ||
| project-relative display paths, and emitting structured events. | ||
| The actual search logic and the finding type are supplied by the | ||
| caller. |
There was a problem hiding this comment.
I am not going deeper right now. As we discussed in the meeting, you should reuse the Cmt_cache when appropriate and might be able to reuse source-matching functionalities from the Locate module.
| { Query_protocol.findings = List.rev !findings; | ||
| warnings = List.rev !warnings | ||
| } |
There was a problem hiding this comment.
There are many way the query can fail, and some of these errors are directly imputable to the user, like query parsing errors. We should not rely on failwith and the generic catch all of merlin for these, but report them as part of the response to the query.
There was a problem hiding this comment.
This is what we get in VSCode currently for a malformed search query:
How should user errors be signaled to merlin? I think a simple exception called User_error of string would do, while all other exceptions would be treated as internal errors, similar to HTTP's 4xx vs. 5xx error families.
The exception User_error "Could not parse search expression." would be shown as just Could not parse search expression. instead of Failure ("Could not parse search expression.").
There was a problem hiding this comment.
What I did is add an errors field to the JSON response... but not to the OCaml type which is a (ocamlgrep_result, string) result. I'm not convinced that this is a great solution but it bypasses the questions of (1) representing Ok/Error in JSON and (2) a more generic error handling mechanism that would differentiate user errors from internal errors.
ppx_sexp_conv pulls in 'base' which the merlin maintainers want to avoid. The generated record-parsing boilerplate is replaced by a small set of direct Csexp.t helpers: field / string_field / bool_field / strings_field / string_opt_field / pairs_of / modules_field These work directly on Csexp.t values — the sexp_of_csexp conversion step and the Sexplib0 dependency are both gone. Unknown fields are silently ignored (more forward-compatible than the ppx behaviour). Parse errors are turned into (Error msg) via a try/with in parse_csexp. sexplib0 and ppx_sexp_conv removed from src/project/dune. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

This pull request is a tentative integration of the ocamlgrep project into merlin as brought up last week.
I used Claude Code to make an initial basic integration and then I guided it to fit better in merlin. Consider that the existing code works but needs to be refined or redone until we're happy with it. It's not considered final in any way. Please treat it as a collection of suggestions given my current understanding of the project which is improving every day.
The main pieces of this PR are:
src/projectlibrary that usesdune describe workspaceto list the cmt files that are available for scanning. This is a different approach than what was used in the original ocamlgrep project. It is reusable for other project-wide scans.Expr_searchmodule (src/analysis/expr_search.mli) that takes an OCaml expression (with an optional type annotation and placeholders such as__or__42) and finds matching expressions of the expected type in a typed tree (from a cmt file). In the original ocamlgrep project, this was theMatchmodule.ocamlgrepsubcommand, which ignores standard input and scans all the available cmt files usingExpr_search.Originally, the PR included a standalone
ocamlgrepcommand that was essentially the same as invokingocamlmerlin single ocamlgrep -query ... < /dev/nullbut shorter to invoke and defaulting to human-friendly text output. This can be done provided as an external package or added to the merlin repo later.