Skip to content

Parsers combinators #53

@vinipsmaker

Description

@vinipsmaker

So, I've been keeping the topic of parsers combinators on the back of my head since ~2016 when you introduced me to the idea.

TBH I had some resistance to the approach of parser combinators because they're just too difficult to implement and of limited usefulness/reuse (or so I thought). However recently I was playing with re2c on a new project and in this new project I had the chance to declare several mini-parsers and jump from one to another while they reuse data pretty seamlessly. The idea immediately struck me. I was using parsers combinators.

Usually the talks around the topic focus on functional languages and the approaches I've seen are pretty complex (and limited too TBH). However on re2c I had the chance to combine parsers under an imperative approach and the result was pretty pleasant and much more natural on C++.

The experiment also made me realize that all this time I've been facing parsers combinators under the wrong angle. Usually parsers combinators are presented focusing on the reuse of matchers for mini-parsers and a few very simple combinators/glues. However it has weak appeal given the rules for mini-parsers (i.e. a few very simple regexes that could be implemented very easily by hand) (1) offer very little reuse and (2) only offer reuse for the easy logic. When I'm interested in reusing a parser I want to reuse the logic for the hard stuff (e.g. state/nested-level tracking on JSON parsers or the complex algorithm to determine the body length on HTTP). The hard logic is not on the matchers, but on the code for everything else. It's the logic from this big self-contained parser that I want to use, not a few matchers that I could have written myself.

That got me thinking: what is it that will enable me to reuse the logic from the big parser? What is it that re2c is doing to enable me to combine mini-parsers so seamlessly? The answer was right in front of me. Combining mini-parsers the re2c-way allows one to defer the handling of some token straight away to another mini-parser. Let's begin with a JSON example:

{ "hello": "world" }

Suppose you want a syntactic extension that allows one to encode dates in JSON documents:

{ "hello": %2021-07-07% }

What you want here is: when the token %2021-07-07% is reached, a different parser is invoked to handle it and then we advance the main parser as well. So, really, it's two things that are seen by the main parser:

  1. When it reaches a certain point, a hole appears. IOW, it's a chunked parser.
  2. When the hole is over, the state of the parser should be advanced as if it consumed a value token.

Trial.Protocol already offers #1. #2 can be emulated. For instance, before the parser for our syntactic date extension returns, it could call reader.next("null"). However I contend it's desirable to offer an interface that consumes the already-decoded token/value to skip some unnecessary logic. For instance, this mini date parser could be calling reader.next<token::null>() instead. Could we have this addition? I don't think it's a polemic addition to the interface.

That covers the main part of the problem. And then, for the last part, we have the matchers again: what to do on token::code::error_unexpected_token? If we're using a chunked parser then this should be a non-fatal error as it'll allow for a fallback mini-parser to be invoked to handle just that token. This change would allow one to reuse Trial.Protocol parser to parse JSON documents with comments, for instance. However this little point on non-fatal unexpected-token is polemic and it deserves way more thought than the previous points. In the meantime, could chunk_reader.next<token::null>() be added?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions