std-proposals: Re: Beyond regex

From: Justin Bassett <jbassett271_at_[hidden]>
Date: Tue, 29 Oct 2019 07:53:49 -0700

Scannerless parsing is an approach to parsing. However, lexing simplifies
the grammar, making it much easier for humans to understand. With lexing,
the parsing step thinks at a higher level, with the nonterminals being the
tokens output by the lexer. Without lexing, the parsers now simultaneously
doing the work of the lexer and the nonterminals are all the valid
characters.

It makes sense to do lexing first, as lexing is much simpler. Lexing
handles the regular language subsets of more complicated languages, and
parsing combines them together with a usually context free grammar.

I can't verify this claim, but I've heard that part of why we separate
lexing from parsing is for speed. Lexing can be done with a DFA, whereas
parsing cannot. So if we combine lexing and parsing in a scannerless
parser, we cannot yake advantage of DFAs in the regular language subsets of
our language.

See also: https://cs.stackexchange.com/a/39902

As for repeating the grammar twice, not so. The lexical "grammar" is much
simpler than the rest of the language (e.g. no nesting). It's basically a
table connecting regular expressions with the corresponding token, i.e.
trivial rules. The actual grammar has much more complicated structures.

--Justin

On Tue, Oct 29, 2019, 5:38 AM Domen Vrankar via Std-Proposals <
std-proposals_at_[hidden]> wrote:

> On Tue, Oct 29, 2019, 9:52 AM Dejan Milosavljevic via Std-Proposals <
> std-proposals_at_[hidden]> wrote:
>
>> 1. ... doesn't have any motivation.
>> - https://en.wikipedia.org/wiki/Lex_(software)
>> - https://en.wikipedia.org/wiki/Comparison_of_parser_generators
>> Why so many libraries? C, C#, C++, Java, Haskell, Eiffel, Go.
>>
>
> I do have a question regarding this first point and the proposal in
> general but just haven't gotten arround to asking.
>
> I'm speaking from memory so I could be wrong but when boost spirit x3 came
> out I remember the author saying something regarding lexers not being
> needed (as much?) so x3 doesn't have the support for it.
>
> Old spirit had it but even back then I preferred writing parsers without
> using it. I always felt that parsing flow should be: grammar -> ast ->
> working with ast
>
> And that tokenization beforehand is repeating the grammar writing step and
> muddying the code (making it longer, more spread out and thereby less
> readable/extensible/maintainable).
>
> I haven't checked the feasability of writing x3 style parsers with the
> proposed pattern matching but my guess is that it should eventually be
> possible. This would cover all my cases without needing a lexer library.
>
> Can you explain a bit better why having a lexer instead of writing x3
> grammar direclty would be preferred and in which cases?
>
> Thanks,
> Domen
>
>> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>

Received on 2019-10-29 09:56:17