Speed up parsing by parsonsmatt · Pull Request #73 · dustin/graphex

parsonsmatt · 2026-02-26T00:48:50Z

This PR speeds up parsing of Haskell modules and drops the time for generating the graph substantially on our work codebase.

$ time graphex-cabal
graphex-cabal  372.44s user 10.79s system 2334% cpu 16.415 total
$ time ~/.local/bin/graphex cabal --discover-all > dist/mwb.json
~/.local/bin/graphex cabal --discover-all > dist/mwb.json  6.24s user 3.59s system 471% cpu 2.085 total

With this change, we're dominated by time spent parsing the Cabal file.

parsonsmatt

For transparency, Claude assisted with this PR

parsonsmatt · 2026-02-26T00:49:40Z


 parseFileImports :: FilePath -> IO [Import]
-parseFileImports fp = do
-  contents <- removeBlockComments <$> TIO.readFile fp


One issue with the old parser is that we'd load the entire text file into memory, and then trash most of it, but not after trying to parse imports all throughout.

parsonsmatt · 2026-02-26T00:50:08Z

+  withFile fp ReadMode $ \h -> do
+    contents <- TLIO.hGetContents h
+    let !imports = extractImports contents
+    pure imports


This uses lazy IO, so we're not reading the entire file - just enough to parse the header and imports.

This is one of those nice lazy IO uses. Basically like Haskell mmap (maybe it does mmap under the hood?)

parsonsmatt · 2026-02-26T00:50:28Z

+extractImports :: TL.Text -> [Import]
+extractImports = mapMaybe (parseImportLine . TL.toStrict)
+               . filter isImportLine
+               . takeWhile isHeaderLine


This is what prevents us from loading the entire file.

ramirez7

Thank you! 🚀

ramirez7 · 2026-02-26T01:26:41Z

+    || TL.isPrefixOf "--"     s
+    || TL.isPrefixOf "{-#"    s
+    || TL.isPrefixOf "module " l
+    || TL.isPrefixOf "#"      s


There's a dummy project in this repo we use as a test case. So feel free to add files or futz with the ones there.

ramirez7 · 2026-02-26T01:25:09Z

+  withFile fp ReadMode $ \h -> do
+    contents <- TLIO.hGetContents h
+    let !imports = extractImports contents
+    pure imports


This is one of those nice lazy IO uses. Basically like Haskell mmap (maybe it does mmap under the hood?)

ramirez7 · 2026-02-26T01:30:16Z

-  manyTill anySingle (string "-}")
+-- | Remove lines that fall inside block comments.
+stripBlockComments :: [TL.Text] -> [TL.Text]
+stripBlockComments = go False


The fact that lazy IO rewards this direct style by doing what you want (composably with lines too!) is so cool.

Like, I guess you could implement this more explicitly with a streaming library, but list of text is much easier on the eyes 😁

I did consider using conduit or something to be really precise about it - and did in fact have resource exhausted issues with non-explicit closing! but I figured... hey... this is probably okay 😂

The takeWhile isHeaderLine approach silently dropped imports when it encountered unrecognized header patterns (standalone where, multi-line module exports, etc). Replace with simple filter over all lines - robust and still fast since the line-prefix check is essentially free. Also switch from lazy IO back to strict TIO.readFile to avoid file handle exhaustion under concurrency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add isCodeLine blacklist (data, type, newtype, class, instance, etc.) to stop scanning once we hit top-level declarations. Combined with lazy IO via withFile + hGetContents, this means only the import header of each file is read from disk. Wall clock on mwb2 (20K modules): 1.9s, down from 2.6s with strict IO. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

parsonsmatt

Prior version had some bugs, so I've reworked it and extended tests.

parsonsmatt · 2026-02-27T14:39:35Z

+extractImports :: TL.Text -> [Import]
+extractImports = mapMaybe (parseImportLine . TL.toStrict)
+               . filter isImportLine
+               . takeWhile (not . isCodeLine)
+               . stripBlockComments
+               . TL.lines


This is imo very cool about Haskell laziness. takeWhile (not . isCodeLine) . stripBlockComments at first looks like it'll go through and strip all block comments from the whole file. However, stripBlockComments is a good lazy consumer/producer which emits lines at a time, and takeWhile is also a good lazy consumer/producer. This means that we demand lines from takeWhile until the predicate triggers short-circuiting, at which point we stop loading the file into memory.

Locally, the use of lazy Text improves performance on our codebase from 2.6s to 1.9s.

parsonsmatt · 2026-02-27T14:41:17Z

+extractImports :: TL.Text -> [Import]
+extractImports = mapMaybe (parseImportLine . TL.toStrict)
+               . filter isImportLine
+               . takeWhile (not . isCodeLine)


The prior iteration had a bug where it tried to recognize a header line. But this was playing whackamole on our codebase, so we tried just strict text with no short circuit. Perf was still good - 2.6s vs 16s - but I felt like we could do better. Now, we terminate parsing/loading at the first sign of a non-header line, which is easier to recognize and also safer (if we recognize no code lines, we just parse the whole file - inefficient but correct).

parsonsmatt · 2026-02-27T15:02:38Z

+stripBlockComments :: [TL.Text] -> [TL.Text]
+stripBlockComments lines = foldr step (const []) lines False
+    where
+        step line rest inComment
+            | inComment =
+                rest (not $ TL.isInfixOf "-}" line)
+            | isOpenComment line =
+                rest (not $ TL.isInfixOf "-}" line)
+            | otherwise =
+                line : rest False


the CPS here is a little cute, but it retains laziness. unsure if this is better or if the go direct style manual recursion is nicer.

would also be 100% amenable to a non-lazy-IO solution with directly encoded streaming

parsonsmatt#1 I realized that this is one of those rare times when lazy State is viable. It seems we can use it + filterM and get something nice but equally performant!

Speed up parsing

d54698d

parsonsmatt commented Feb 26, 2026

View reviewed changes

ramirez7 approved these changes Feb 26, 2026

View reviewed changes

parsonsmatt and others added 7 commits February 25, 2026 18:52

fix bug for multiline imports

3684505

A few more tests cases

3704556

more parsing tests

b4a7d3b

legibility

8baee1f

clean up a bit

69df4bd

parsonsmatt commented Feb 27, 2026

View reviewed changes

parsonsmatt mentioned this pull request Feb 27, 2026

graphex hpack mode #74

Draft

ramirez7 mentioned this pull request Mar 2, 2026

Use lazy State for stripBlockComments parsonsmatt/graphex#1

Open

Conversation

parsonsmatt commented Feb 26, 2026

Uh oh!

parsonsmatt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ramirez7 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

parsonsmatt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants