Skip to content

Speed up parsing#73

Open
parsonsmatt wants to merge 8 commits into
dustin:mainfrom
parsonsmatt:mattp/speed-up-parsing
Open

Speed up parsing#73
parsonsmatt wants to merge 8 commits into
dustin:mainfrom
parsonsmatt:mattp/speed-up-parsing

Conversation

@parsonsmatt

Copy link
Copy Markdown

This PR speeds up parsing of Haskell modules and drops the time for generating the graph substantially on our work codebase.

$ time graphex-cabal
graphex-cabal  372.44s user 10.79s system 2334% cpu 16.415 total
$ time ~/.local/bin/graphex cabal --discover-all > dist/mwb.json
~/.local/bin/graphex cabal --discover-all > dist/mwb.json  6.24s user 3.59s system 471% cpu 2.085 total

With this change, we're dominated by time spent parsing the Cabal file.

@parsonsmatt parsonsmatt left a comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For transparency, Claude assisted with this PR

Comment thread src/Graphex/Parser.hs

parseFileImports :: FilePath -> IO [Import]
parseFileImports fp = do
contents <- removeBlockComments <$> TIO.readFile fp

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue with the old parser is that we'd load the entire text file into memory, and then trash most of it, but not after trying to parse imports all throughout.

Comment thread src/Graphex/Parser.hs Outdated
Comment on lines +103 to +106
withFile fp ReadMode $ \h -> do
contents <- TLIO.hGetContents h
let !imports = extractImports contents
pure imports

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses lazy IO, so we're not reading the entire file - just enough to parse the header and imports.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of those nice lazy IO uses. Basically like Haskell mmap (maybe it does mmap under the hood?)

Comment thread src/Graphex/Parser.hs Outdated
extractImports :: TL.Text -> [Import]
extractImports = mapMaybe (parseImportLine . TL.toStrict)
. filter isImportLine
. takeWhile isHeaderLine

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what prevents us from loading the entire file.

@ramirez7 ramirez7 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! 🚀

Comment thread src/Graphex/Parser.hs Outdated
|| TL.isPrefixOf "--" s
|| TL.isPrefixOf "{-#" s
|| TL.isPrefixOf "module " l
|| TL.isPrefixOf "#" s

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a dummy project in this repo we use as a test case. So feel free to add files or futz with the ones there.

Comment thread src/Graphex/Parser.hs Outdated
Comment on lines +103 to +106
withFile fp ReadMode $ \h -> do
contents <- TLIO.hGetContents h
let !imports = extractImports contents
pure imports

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of those nice lazy IO uses. Basically like Haskell mmap (maybe it does mmap under the hood?)

Comment thread src/Graphex/Parser.hs Outdated
manyTill anySingle (string "-}")
-- | Remove lines that fall inside block comments.
stripBlockComments :: [TL.Text] -> [TL.Text]
stripBlockComments = go False

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that lazy IO rewards this direct style by doing what you want (composably with lines too!) is so cool.

Like, I guess you could implement this more explicitly with a streaming library, but list of text is much easier on the eyes 😁

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider using conduit or something to be really precise about it - and did in fact have resource exhausted issues with non-explicit closing! but I figured... hey... this is probably okay 😂

parsonsmatt and others added 7 commits February 25, 2026 18:52
The takeWhile isHeaderLine approach silently dropped imports when it
encountered unrecognized header patterns (standalone where, multi-line
module exports, etc). Replace with simple filter over all lines -
robust and still fast since the line-prefix check is essentially free.

Also switch from lazy IO back to strict TIO.readFile to avoid file
handle exhaustion under concurrency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add isCodeLine blacklist (data, type, newtype, class, instance, etc.)
to stop scanning once we hit top-level declarations. Combined with
lazy IO via withFile + hGetContents, this means only the import header
of each file is read from disk.

Wall clock on mwb2 (20K modules): 1.9s, down from 2.6s with strict IO.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@parsonsmatt parsonsmatt left a comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior version had some bugs, so I've reworked it and extended tests.

Comment thread src/Graphex/Parser.hs
Comment on lines +65 to +70
extractImports :: TL.Text -> [Import]
extractImports = mapMaybe (parseImportLine . TL.toStrict)
. filter isImportLine
. takeWhile (not . isCodeLine)
. stripBlockComments
. TL.lines

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is imo very cool about Haskell laziness. takeWhile (not . isCodeLine) . stripBlockComments at first looks like it'll go through and strip all block comments from the whole file. However, stripBlockComments is a good lazy consumer/producer which emits lines at a time, and takeWhile is also a good lazy consumer/producer. This means that we demand lines from takeWhile until the predicate triggers short-circuiting, at which point we stop loading the file into memory.

Locally, the use of lazy Text improves performance on our codebase from 2.6s to 1.9s.

Comment thread src/Graphex/Parser.hs
extractImports :: TL.Text -> [Import]
extractImports = mapMaybe (parseImportLine . TL.toStrict)
. filter isImportLine
. takeWhile (not . isCodeLine)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prior iteration had a bug where it tried to recognize a header line. But this was playing whackamole on our codebase, so we tried just strict text with no short circuit. Perf was still good - 2.6s vs 16s - but I felt like we could do better. Now, we terminate parsing/loading at the first sign of a non-header line, which is easier to recognize and also safer (if we recognize no code lines, we just parse the whole file - inefficient but correct).

Comment thread src/Graphex/Parser.hs
Comment on lines +77 to +86
stripBlockComments :: [TL.Text] -> [TL.Text]
stripBlockComments lines = foldr step (const []) lines False
where
step line rest inComment
| inComment =
rest (not $ TL.isInfixOf "-}" line)
| isOpenComment line =
rest (not $ TL.isInfixOf "-}" line)
| otherwise =
line : rest False

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the CPS here is a little cute, but it retains laziness. unsure if this is better or if the go direct style manual recursion is nicer.

would also be 100% amenable to a non-lazy-IO solution with directly encoded streaming

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parsonsmatt#1 I realized that this is one of those rare times when lazy State is viable. It seems we can use it + filterM and get something nice but equally performant!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants