Speed up parsing#73
Conversation
parsonsmatt
left a comment
There was a problem hiding this comment.
For transparency, Claude assisted with this PR
|
|
||
| parseFileImports :: FilePath -> IO [Import] | ||
| parseFileImports fp = do | ||
| contents <- removeBlockComments <$> TIO.readFile fp |
There was a problem hiding this comment.
One issue with the old parser is that we'd load the entire text file into memory, and then trash most of it, but not after trying to parse imports all throughout.
| withFile fp ReadMode $ \h -> do | ||
| contents <- TLIO.hGetContents h | ||
| let !imports = extractImports contents | ||
| pure imports |
There was a problem hiding this comment.
This uses lazy IO, so we're not reading the entire file - just enough to parse the header and imports.
There was a problem hiding this comment.
This is one of those nice lazy IO uses. Basically like Haskell mmap (maybe it does mmap under the hood?)
| extractImports :: TL.Text -> [Import] | ||
| extractImports = mapMaybe (parseImportLine . TL.toStrict) | ||
| . filter isImportLine | ||
| . takeWhile isHeaderLine |
There was a problem hiding this comment.
This is what prevents us from loading the entire file.
| || TL.isPrefixOf "--" s | ||
| || TL.isPrefixOf "{-#" s | ||
| || TL.isPrefixOf "module " l | ||
| || TL.isPrefixOf "#" s |
There was a problem hiding this comment.
There's a dummy project in this repo we use as a test case. So feel free to add files or futz with the ones there.
| withFile fp ReadMode $ \h -> do | ||
| contents <- TLIO.hGetContents h | ||
| let !imports = extractImports contents | ||
| pure imports |
There was a problem hiding this comment.
This is one of those nice lazy IO uses. Basically like Haskell mmap (maybe it does mmap under the hood?)
| manyTill anySingle (string "-}") | ||
| -- | Remove lines that fall inside block comments. | ||
| stripBlockComments :: [TL.Text] -> [TL.Text] | ||
| stripBlockComments = go False |
There was a problem hiding this comment.
The fact that lazy IO rewards this direct style by doing what you want (composably with lines too!) is so cool.
Like, I guess you could implement this more explicitly with a streaming library, but list of text is much easier on the eyes 😁
There was a problem hiding this comment.
I did consider using conduit or something to be really precise about it - and did in fact have resource exhausted issues with non-explicit closing! but I figured... hey... this is probably okay 😂
The takeWhile isHeaderLine approach silently dropped imports when it encountered unrecognized header patterns (standalone where, multi-line module exports, etc). Replace with simple filter over all lines - robust and still fast since the line-prefix check is essentially free. Also switch from lazy IO back to strict TIO.readFile to avoid file handle exhaustion under concurrency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add isCodeLine blacklist (data, type, newtype, class, instance, etc.) to stop scanning once we hit top-level declarations. Combined with lazy IO via withFile + hGetContents, this means only the import header of each file is read from disk. Wall clock on mwb2 (20K modules): 1.9s, down from 2.6s with strict IO. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
parsonsmatt
left a comment
There was a problem hiding this comment.
Prior version had some bugs, so I've reworked it and extended tests.
| extractImports :: TL.Text -> [Import] | ||
| extractImports = mapMaybe (parseImportLine . TL.toStrict) | ||
| . filter isImportLine | ||
| . takeWhile (not . isCodeLine) | ||
| . stripBlockComments | ||
| . TL.lines |
There was a problem hiding this comment.
This is imo very cool about Haskell laziness. takeWhile (not . isCodeLine) . stripBlockComments at first looks like it'll go through and strip all block comments from the whole file. However, stripBlockComments is a good lazy consumer/producer which emits lines at a time, and takeWhile is also a good lazy consumer/producer. This means that we demand lines from takeWhile until the predicate triggers short-circuiting, at which point we stop loading the file into memory.
Locally, the use of lazy Text improves performance on our codebase from 2.6s to 1.9s.
| extractImports :: TL.Text -> [Import] | ||
| extractImports = mapMaybe (parseImportLine . TL.toStrict) | ||
| . filter isImportLine | ||
| . takeWhile (not . isCodeLine) |
There was a problem hiding this comment.
The prior iteration had a bug where it tried to recognize a header line. But this was playing whackamole on our codebase, so we tried just strict text with no short circuit. Perf was still good - 2.6s vs 16s - but I felt like we could do better. Now, we terminate parsing/loading at the first sign of a non-header line, which is easier to recognize and also safer (if we recognize no code lines, we just parse the whole file - inefficient but correct).
| stripBlockComments :: [TL.Text] -> [TL.Text] | ||
| stripBlockComments lines = foldr step (const []) lines False | ||
| where | ||
| step line rest inComment | ||
| | inComment = | ||
| rest (not $ TL.isInfixOf "-}" line) | ||
| | isOpenComment line = | ||
| rest (not $ TL.isInfixOf "-}" line) | ||
| | otherwise = | ||
| line : rest False |
There was a problem hiding this comment.
the CPS here is a little cute, but it retains laziness. unsure if this is better or if the go direct style manual recursion is nicer.
would also be 100% amenable to a non-lazy-IO solution with directly encoded streaming
There was a problem hiding this comment.
parsonsmatt#1 I realized that this is one of those rare times when lazy State is viable. It seems we can use it + filterM and get something nice but equally performant!
This PR speeds up parsing of Haskell modules and drops the time for generating the graph substantially on our work codebase.
With this change, we're dominated by time spent parsing the Cabal file.