Skip to content

Add ocamlgrep: structural search of OCaml code#2064

Open
mjambon wants to merge 43 commits into
ocaml:mainfrom
mjambon:ocamlgrep
Open

Add ocamlgrep: structural search of OCaml code#2064
mjambon wants to merge 43 commits into
ocaml:mainfrom
mjambon:ocamlgrep

Conversation

@mjambon
Copy link
Copy Markdown
Contributor

@mjambon mjambon commented May 6, 2026

This pull request is a tentative integration of the ocamlgrep project into merlin as brought up last week.

I used Claude Code to make an initial basic integration and then I guided it to fit better in merlin. Consider that the existing code works but needs to be refined or redone until we're happy with it. It's not considered final in any way. Please treat it as a collection of suggestions given my current understanding of the project which is improving every day.

The main pieces of this PR are:

  • an src/project library that uses dune describe workspace to list the cmt files that are available for scanning. This is a different approach than what was used in the original ocamlgrep project. It is reusable for other project-wide scans.
  • the Expr_search module (src/analysis/expr_search.mli) that takes an OCaml expression (with an optional type annotation and placeholders such as __ or __42) and finds matching expressions of the expected type in a typed tree (from a cmt file). In the original ocamlgrep project, this was the Match module.
  • the ocamlgrep subcommand, which ignores standard input and scans all the available cmt files using Expr_search.
  • cram tests to check that ocamlgrep finds what it's supposed to.

Originally, the PR included a standalone ocamlgrep command that was essentially the same as invoking ocamlmerlin single ocamlgrep -query ... < /dev/null but shorter to invoke and defaulting to human-friendly text output. This can be done provided as an external package or added to the merlin repo later.

mjambon added 2 commits May 6, 2026 01:32
Imports the LexiFi ocamlgrep tool as a sub-package of the merlin
repository, both as a [merlin-lib.ocamlgrep] library and as a
standalone [ocamlgrep] executable, plus an [ocamlmerlin ocamlgrep]
subcommand for editor integrations. The matching logic is the
upstream Match module from the LexiFi project, lightly adapted to
build against merlin's vendored compiler-libs (parser_raw / lexer_raw
in place of Parse, Typemod.initial_env in place of Compmisc, and a
build-time-generated Standard_library module in place of
Config.standard_library).

Bringing this under merlin's roof gives ocamlgrep merlin's per-OCaml-
version branch story and release pipeline (currently it is pinned to
a single OCaml minor version), and makes the matcher reusable from
other merlin-aware tooling.

The new [ocamlmerlin single ocamlgrep -query <pat>] subcommand
ignores the buffer on stdin, walks the project's _build/ tree for
.cmt files, and returns findings as JSON. Cram tests cover basic
matches, wildcards, type-constrained matches, parse errors, and a
smoke test for the standalone CLI.
Add a TODO in paths.ml flagging that walking _build/ ourselves is
the upstream-ocamlgrep approach but not the merlin-idiomatic one:
ocaml-index lets the build system (dune via @ocaml-index) enumerate
cmt files instead of discovering them by filesystem walk. Worth
revisiting if ocamlgrep wants to support build systems other than
Dune, or to reuse an existing .ocaml-index file.
@mjambon mjambon marked this pull request as draft May 6, 2026 02:49
Comment thread src/commands/query_json.ml Outdated
Comment thread src/frontend/query_protocol.ml
mjambon added 2 commits May 6, 2026 03:40
A finding can span more than one source line, but the previous
representation collapsed everything to the first line: column ranges
were truncated at the first line break, and the snippet was a single
context string. Two consecutive multi-line findings would be
indistinguishable in plain text.

Replace [Scan.finding] with a proper region:
  start_line, start_col, end_line, end_col, lines

The corresponding [ocamlgrep_finding] record in [query_protocol.ml]
mirrors this, and the JSON shape becomes
  { file, start: {line, col}, end: {line, col}, lines: [...] }
matching merlin's existing location convention.

The standalone CLI renders findings in OCaml-compiler style:

    foo.ml:5:10-22:
    5 |   let x = List.length xs

    foo.ml:6:2-8:9:
    6 |   match x with
    7 |   | None -> None
    8 |   | Some y -> Some y

Each finding starts with an unambiguous [file:start-end:] header so
no separator is needed between consecutive findings (per the
discussion on PR #1).

The cram tests now pipe the JSON through a small jq filter that
renders the same Format A, keeping the snapshots readable while
exercising the JSON shape end-to-end. The wildcards test was tweaked
to span three source lines and demonstrate the multi-line header.
Replace the one-line [Usage: ocamlgrep <pattern>] message with a
full guide that covers every form of the pattern language:
plain and numbered metavariables (__, __1, ...), suffix matching
of identifiers, optional-argument controls, type constraints, set
semantics for match/record clauses, and the [__.id] field shortcut.

Includes the example invocations from the upstream README, the
rendered output format, the [NO_COLOR] convention, and a pointer
to the merlin subcommand for JSON output.

Also extend the merlin [ocamlgrep] subcommand's help to mention
numbered metavariables (the original text only mentioned [__]),
to describe the JSON shape it returns, and to point users at
[ocamlgrep --help] for the full pattern guide.
@mjambon mjambon marked this pull request as ready for review May 6, 2026 04:02
The matching algorithm and most of the surrounding code in
src/ocamlgrep/ originate from Nicolás's work on the upstream
ocamlgrep project at LexiFi. Add an authorship line to each
[.ml]/[.mli] header so the attribution travels with the code,
independently of git history or the LICENSE file.
@gr-im
Copy link
Copy Markdown

gr-im commented May 6, 2026

I’m not a maintainer (or even a contributor) to Merlin, but I must admit I was delighted to read the proposal to integrate ocamlgrep into Merlin. However, I’m quite surprised because I have a strong feeling that things were done… in the wrong order.

In my opinion, ocamlgrep should expose an API under merlin-lib (taking advantage of the project concept, which is duplicated by both Merlin and ocamlgrep) by standardizing paths and locations, and should be invoked:

  • as a binary (ocamlgrep)
  • and as an analyze phase.

I’m quite surprised by the dependency lattice that was chosen.

@mjambon
Copy link
Copy Markdown
Contributor Author

mjambon commented May 6, 2026

Thanks for the feedback. I was originally hoping that the only module ocamlgrep would bring in would be the Match module and that everything related to scanning files would reuse (or add) shared merlin functionality. Let me dig deeper and iterate on this.

mjambon added 3 commits May 6, 2026 20:59
Previously [Scan.incremental_search] knew how to walk [_build/]
itself and process each cmt file it encountered, which conflated
two concerns and made [Scan] dune-aware. Per the discussion on
PR #1, factor this so that:

  - [Paths.collect_cmt_files] (new) does the dune-specific [_build/]
    walk and returns the list of cmt files;
  - [Scan.incremental_search] now takes that list as input and
    iterates over it without touching the filesystem.

This makes [Scan] a leaf processor: a future caller (e.g. the
merlin subcommand using [Mconfig.build_path], or a CLI invocation
that takes an explicit cmt-file list) can plug into the matcher
without going through [Paths] or its dune assumptions.

Both call sites (the standalone binary and the merlin subcommand)
are updated to call [Paths.collect_cmt_files] then
[Scan.incremental_search]. User-visible behavior is unchanged.
Replace [Paths.identify_dune_project] with [Paths.find], which
delegates to merlin's existing [Mconfig_dot.find_project_context]
to locate the project root. The previous implementation walked up
looking for a [_build/] directory; the new one walks up looking
for a [dune-project], [dune-workspace], or [.merlin] file -- the
same heuristic the rest of merlin uses.

[Mconfig_dot.context] is abstract in its [.mli], so we read what
we need from the second return value (the path to the project
config file): [dirname] is the project root, [basename] tells us
whether the configurator is dune or [.merlin].

[.merlin]-located projects are not yet supported -- the cmt
enumeration in [Paths.collect_cmt_files] still assumes a dune
[_build/] layout. Such cases now return a clear [Error] instead
of silently failing partway through.

This addresses one of the two duplications flagged in the PR
review: project-root detection. The other (dune-specific cmt
enumeration) remains in [Paths.collect_cmt_files] with the
existing TODO; lifting that to a shared merlin abstraction is a
larger design discussion (probably involving a new dot-protocol
directive).
Introduce a new merlin-lib.project library (src/project/) housing
[Dune_workspace], a structured view of the output of

    dune describe workspace --format=csexp --lang 0.1

The inner records are derived with [@@deriving of_sexp]; the
top-level entry-list dispatch is hand-written so unknown tags from
future dune --lang versions can be ignored without breaking us.

Use it instead of Paths.collect_cmt_files in both ocamlgrep entry
points (the standalone binary and the merlin subcommand). The
filesystem walk is gone: dune is now the source of truth for what
compiled and where, which:

  - filters out external dependencies via the (local true) field,
    rather than by chance-of-not-being-under-_build/;
  - sees dune-generated modules (.ml-gen) with their proper
    impl paths;
  - relies on dune's --lang 0.1 stability contract instead of an
    implicit assumption about _build/ layout.

Adds ppx_sexp_conv and sexplib0 as merlin-lib.opam dependencies
(via the ppx preprocess on the new library only).
Comment thread src/project/dune_workspace.mli Outdated
Comment thread src/project/dune_workspace.ml Outdated
…erge

Three changes:

* Use [string option] directly for [module_.{impl,intf,cmt,cmti}]
  in [Dune_workspace] -- as suggested on the PR. sexp_conv encodes
  [None] as [()] and [Some x] as [(x)], which is exactly how dune
  encodes its optional path fields. Drops the [impl_path] /
  [cmt_path] etc. accessor functions; callers can read the option
  fields directly.

* Catch [Unix.Unix_error(ENOENT, _, _)] when spawning dune so a
  missing [dune] binary becomes a friendly "could not find `dune`
  in PATH" message instead of a raw OCaml exception trace. Other
  Unix errors get a generic but informative message via
  [Unix.error_message].

* Collapse [Paths.find] + [Paths.init] into a single
  [Paths.init ()]. Callers always needed both back to back, and
  the previous split was easy to forget.
@mjambon
Copy link
Copy Markdown
Contributor Author

mjambon commented May 7, 2026

I made changes but I'm still working on it. I'll let you know when I think it's in a good state.

Comment thread src/frontend/query_protocol.ml
mjambon and others added 8 commits May 8, 2026 03:38
Per the PR comment: ocamlgrep_finding should use Location.t like
the rest of merlin's protocol. Apply the same change to
Scan.finding so we don't have to construct/deconstruct positions
across the lib/protocol boundary.

  Scan.finding         = { loc : Location.t; lines : string list }
  ocamlgrep_finding    = { loc : Location.t; lines : string list }

The loc carries the user-friendly project-relative source path in
loc_start.pos_fname (Scan rewrites it, so it's not whatever the
compiler recorded in the cmt). The renderer in the standalone
binary unpacks line/column from the positions; query_json uses
the existing [with_location ~with_file:true] helper, which
produces the same JSON shape we had before:

  { "file": ..., "start": {"line", "col"}, "end": {"line", "col"},
    "lines": [...] }

so the cram snapshots are unchanged.
Keep only the [merlin-lib.ocamlgrep] library and the [ocamlgrep]
merlin subcommand for now -- this is what the reviewer comments
have been focused on, and bundling a separate opam package
[ocamlgrep] in the same PR makes it longer to review.

The previous state (with the standalone binary, its opam package,
and a smoke test) is preserved on the [ocamlgrep-with-executable]
branch in this fork, ready to revive once the library/subcommand
split has landed upstream.

Removed:
  - ocamlgrep.opam
  - src/ocamlgrep/bin/ (executable: main.ml + dune)
  - tests/test-dirs/ocamlgrep/standalone.t/ (cram smoke test)
  - the corresponding cram-deps stanza in tests/dune

Updated the merlin subcommand's help to drop its dangling
reference to [ocamlgrep --help]; users wanting the full pattern
syntax are pointed at the upstream LexiFi/ocamlgrep README.
Add the [?context] parameter to [Dune_workspace.describe] (declared
in the .mli) by passing through [--context <name>] to dune. dune
parses both [--root] and [--context] positionally after [describe
workspace], so the flag goes at the end alongside [--root].
Paths and Scan are general-purpose cmt-scanning infrastructure, not
specific to ocamlgrep.  Moving them to src/project/ (merlin-lib.project)
makes them reusable by other tools.

Scan is now polymorphic in the finding type: incremental_search takes a
search function of type
  Cmt_format.cmt_infos -> source:string -> src_lines:string array -> 'a list
and emits ''a event' values.  Scan knows nothing about findings.

The ocamlgrep-specific pieces move into Match:
- 'finding' type (loc + context lines)
- 'parse_query': parse a query string with Merlin's vendored parser
- 'search_findings': ready-to-use search function for Scan, wrapping
  search_cmt and converting raw locations to findings

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Match is the only module that was left in src/ocamlgrep/lib/.  Moving
it to src/analysis/ integrates it properly with the rest of merlin's
analysis layer, which already carries all the required compiler-lib
dependencies (ocaml_typing, ocaml_preprocess, etc.).

The merlin-lib.ocamlgrep public library is removed.  query_commands
drops the ocamlgrep dep and references Match directly via
-open Merlin_analysis.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The standalone ocamlgrep binary is gone, so match.mli is now the
canonical reference for the query language.  The module docstring
covers quick start, the full pattern syntax (wildcards, identifiers,
application, type constraints, match/record set semantics, field
access), examples, and the output format -- adapted from the --help
text of the backup-branch executable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread src/analysis/match.mli Outdated
query_protocol.ml: document that line numbers are 1-based and column
offsets are 0-based in ocamlgrep_finding, following merlin conventions.

match.mli: replace the text-format output description (which belonged
to the now-removed standalone binary) with a description of the JSON
output produced by the merlin subcommand, including the field layout
and the line/col conventions.

The Location.t change in query_protocol and the multiline JSON
representation in query_json were already addressed in earlier commits.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mjambon and others added 4 commits May 9, 2026 00:57
All matching functions now take the query pattern as the first argument
and the typed program as the second, matching the grep convention and
making partial application on the pattern natural.

The internal match-with constructs are also flipped so the parse-tree
constructor appears on the left and the typed-tree constructor on the
right, consistent with the new (pattern, program) reading order.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The name 'Match' is too generic in an OCaml codebase.  'Expr_search'
mirrors the existing 'Type_search' module: Type_search searches by
type, Expr_search searches by expression structure.  A future
Pattern_search module could extend the same idea to OCaml patterns.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mjambon
Copy link
Copy Markdown
Contributor Author

mjambon commented May 9, 2026

Status update:

  1. I removed the package that was providing an ocamlgrep executable. We can add it later or keep it outside of merlin; it would simply depend on merlin-lib.
  2. The main matching module where a pattern (expression AST) is scanned against a typedtree is now known as src/analysis/expr_search.mli.
  3. I didn't find general-purpose full-project scanning functionality so I adapted what we had in ocamlgrep and moved it to a new src/project folder. It offers iteration over cmt files which is used by the ocamlgrep subcommand.
  4. Identifying the cmt files for a Dune project is now done by invoking dune describe workspace. It produces well-defined output that is now available to access all the files known to a Dune build. It was previously done without invoking Dune by making assumptions about where it stores build products.

Please keep in mind that none of the above was really discussed with the project maintainers and can be removed, changed, etc. I could also split the PR into multiple PRs if it's easier to review. I'm looking forward to meeting with the project maintainers to get guidance on what's desirable and what isn't.

I'm also going to rewrite the PR description by hand which will be more concise but will better reflect my understanding of the issues.

Comment thread src/analysis/expr_search.ml Outdated
mjambon and others added 6 commits May 10, 2026 21:34
Replace catch-all wildcard patterns with explicit enumerations of all
constructors so that the compiler emits a warning when a future version
of compiler-libs adds a new constructor.

- path_matches_lident: enumerate (Lident|Ldot|Lapply) x (Pident|Pdot|Papply|Pextra_ty)
- constructor_match: enumerate (Lident|Ldot|Lapply) x (Lident|Ldot|Lapply)
- match_expr: enumerate all Pexp_* x Texp_* combinations
- match_typ: enumerate all Ptyp_* x type_desc combinations
- match_pat: enumerate all Ppat_* (GADT right side left as _)
- match_pat_expr: enumerate all Pexp_* (GADT right side left as _)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Missing cmt files (Sys_error on read) now return false from
process_one_cmt instead of propagating the exception.
incremental_search tallies found vs total and emits a single warning
reporting the coverage percentage and how many files are missing,
with a hint to run 'dune build @check'.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Relative pp_source paths silently failed Sys.file_exists when the
LSP server's CWD was not the project root, causing all 198 scanned
files to return zero findings. Use Filename.concat paths.project_root
to make all filesystem operations CWD-independent.
When cmt_sourcefile points into _build/default/ (which dune uses for
preprocessed or ppx-expanded files), redirect to the real source in
the project tree to avoid reading binary build artifacts as source.
When cmt_sourcefile points to a .pp.ml file, dune may have placed a
binary OCaml AST blob there (for ppx processing). Reading it produces
garbage. Instead, find the original .ml in the project source tree
and skip the digest check (which compares against the preprocessed
file, not the original).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread merlin-lib.opam Outdated
Comment on lines +17 to +18
"sexplib0"
"ppx_sexp_conv"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are very conservative when it comes to adding new dependencies to merlin, and depending on base is definitely not acceptable.

Copy link
Copy Markdown
Contributor Author

@mjambon mjambon May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for src/project/dune_workspace.ml.

The dependency on ppx_sexp_conv is the culprit. The runtime lib we really need here is csexp which is already a dependency of merlin.

To remove the dependency on ppx_sexp_conv, I tried copy-pasting the ml code generated by ppx_sexp_conv but it was ugly so I ended up with a simpler handrolled implementation (which incidentally is what Claude Code had done in the very first version of this PR before I told it to use ppx_sexp_conv instead). I think the result is reasonable now.

Comment thread src/project/dune
Comment on lines +20 to +27
; Generate standard_library.ml exposing the OCaml standard library path.
; The same trick is used in src/kernel/dune. Merlin's vendored Config
; module strips the upstream Config.standard_library field, so we recover
; it at build time from %{ocaml-config:standard_library}.
(rule
(targets standard_library.ml)
(action (write-file %{targets}
"let path = {|%{ocaml-config:standard_library}|}")))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't check but this is suspicious, can't we get the information from Merlin_kernel ?

Comment thread src/project/paths.mli Outdated
Comment thread src/project/paths.ml
Comment on lines +62 to +79
let collect_cmi_dirs paths =
let res = ref [] in
let rec walk dir =
Array.iter (fun entry ->
let entry = Filename.concat dir entry in
if Sys.is_directory entry then begin
if Filename.basename entry = "byte"
&& Array.exists
(fun name -> Filename.check_suffix name ".cmi")
(Sys.readdir entry)
then
res := entry :: !res;
walk entry
end
) (Sys.readdir dir)
in
walk paths.build_source_root;
List.rev !res
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this in addition to the output of dune describe ? I thought the goal was to avoid walking the tree ourselves.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inherited from the original implementation. I need to figure out the cmi situation and load paths.

Comment thread src/project/paths.ml
Comment on lines +86 to +89
The [Mconfig_dot.context] is abstract in the .mli, so we read what
we need from the second return value, which is the path of the
config file: its directory is the project root, its basename tells
us whether the configurator is dune or .merlin.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Merlin already does it as parts of the pipeline, maybe we could simply add this information to the pipeline's config and reuse it.

Comment thread src/analysis/expr_search.mli
Comment thread src/project/scan.mli
Comment on lines +11 to +15
This module handles the mechanics of reading cmt files, verifying
source digests, mapping compiler-internal paths back to
project-relative display paths, and emitting structured events.
The actual search logic and the finding type are supplied by the
caller.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not going deeper right now. As we discussed in the meeting, you should reuse the Cmt_cache when appropriate and might be able to reuse source-matching functionalities from the Locate module.

Comment thread src/frontend/query_commands.ml Outdated
Comment thread src/frontend/query_commands.ml Outdated
Comment on lines +956 to +958
{ Query_protocol.findings = List.rev !findings;
warnings = List.rev !warnings
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many way the query can fail, and some of these errors are directly imputable to the user, like query parsing errors. We should not rely on failwith and the generic catch all of merlin for these, but report them as part of the response to the query.

Copy link
Copy Markdown
Contributor Author

@mjambon mjambon May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what we get in VSCode currently for a malformed search query:

Image

How should user errors be signaled to merlin? I think a simple exception called User_error of string would do, while all other exceptions would be treated as internal errors, similar to HTTP's 4xx vs. 5xx error families.

The exception User_error "Could not parse search expression." would be shown as just Could not parse search expression. instead of Failure ("Could not parse search expression.").

Copy link
Copy Markdown
Contributor Author

@mjambon mjambon May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I did is add an errors field to the JSON response... but not to the OCaml type which is a (ocamlgrep_result, string) result. I'm not convinced that this is a great solution but it bypasses the questions of (1) representing Ok/Error in JSON and (2) a more generic error handling mechanism that would differentiate user errors from internal errors.

Here's what it looks like in VSCode now:
image

Comment thread src/commands/new_commands.ml
mjambon and others added 15 commits May 20, 2026 21:26
ppx_sexp_conv pulls in 'base' which the merlin maintainers want to
avoid. The generated record-parsing boilerplate is replaced by a small
set of direct Csexp.t helpers:

  field / string_field / bool_field / strings_field /
  string_opt_field / pairs_of / modules_field

These work directly on Csexp.t values — the sexp_of_csexp conversion
step and the Sexplib0 dependency are both gone. Unknown fields are
silently ignored (more forward-compatible than the ppx behaviour).
Parse errors are turned into (Error msg) via a try/with in parse_csexp.

sexplib0 and ppx_sexp_conv removed from src/project/dune.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants