C++ Logo


Advanced search

Re: Beyond regex

From: Domen Vrankar <domen.vrankar_at_[hidden]>
Date: Mon, 4 Nov 2019 20:11:28 +0100
On Tue, Oct 29, 2019 at 3:54 PM Justin Bassett via Std-Proposals <
std-proposals_at_[hidden]> wrote:

> Scannerless parsing is an approach to parsing. However, lexing simplifies
> the grammar, making it much easier for humans to understand. With lexing,
> the parsing step thinks at a higher level, with the nonterminals being the
> tokens output by the lexer. Without lexing, the parsers now simultaneously
> doing the work of the lexer and the nonterminals are all the valid
> characters.
> It makes sense to do lexing first, as lexing is much simpler. Lexing
> handles the regular language subsets of more complicated languages, and
> parsing combines them together with a usually context free grammar.
> I can't verify this claim, but I've heard that part of why we separate
> lexing from parsing is for speed. Lexing can be done with a DFA, whereas
> parsing cannot. So if we combine lexing and parsing in a scannerless
> parser, we cannot yake advantage of DFAs in the regular language subsets of
> our language.
> See also: https://cs.stackexchange.com/a/39902
> As for repeating the grammar twice, not so. The lexical "grammar" is much
> simpler than the rest of the language (e.g. no nesting). It's basically a
> table connecting regular expressions with the corresponding token, i.e.
> trivial rules. The actual grammar has much more complicated structures.

I understand all of this (didn't know about performance as I never measured
that part) but for me writing spirit X3 grammar rules was never something
that seemed like it needs grammar simplification while on the other hand
turtles all the way down logic of writing basic rules for single elements
and then combining those rules with the same syntax into more complex
grammars seemed like a great plus.

For that reason I'd be more comfortable if the author of the proposal would
add comparison examples with spirit X3 in the proposal to outline exactly
how much easier/harder it would be to write a complete parser with one and
the other (perhaps since this is only a lexer proposal it would be nice to
see how it would integrate with either spirit X3 or something like that).

Lexer is perhaps a nice basic building block but at least for me the useful
tool for parsing go like this: string compare -> regex -> grammar parser.
So even if/when lexer is useful I don't see it as a basic building block (I
don't expect many people using it to write their own parsers on top of it
so it seems like an extremely domain specific component) - it might just be
me but perhaps others also need convincing that this is not the case.


> --Justin
> On Tue, Oct 29, 2019, 5:38 AM Domen Vrankar via Std-Proposals <
> std-proposals_at_[hidden]> wrote:
>> On Tue, Oct 29, 2019, 9:52 AM Dejan Milosavljevic via Std-Proposals <
>> std-proposals_at_[hidden]> wrote:
>>> 1. ... doesn't have any motivation.
>>> - https://en.wikipedia.org/wiki/Lex_(software)
>>> - https://en.wikipedia.org/wiki/Comparison_of_parser_generators
>>> Why so many libraries? C, C#, C++, Java, Haskell, Eiffel, Go.
>> I do have a question regarding this first point and the proposal in
>> general but just haven't gotten arround to asking.
>> I'm speaking from memory so I could be wrong but when boost spirit x3
>> came out I remember the author saying something regarding lexers not being
>> needed (as much?) so x3 doesn't have the support for it.
>> Old spirit had it but even back then I preferred writing parsers without
>> using it. I always felt that parsing flow should be: grammar -> ast ->
>> working with ast
>> And that tokenization beforehand is repeating the grammar writing step
>> and muddying the code (making it longer, more spread out and thereby less
>> readable/extensible/maintainable).
>> I haven't checked the feasability of writing x3 style parsers with the
>> proposed pattern matching but my guess is that it should eventually be
>> possible. This would cover all my cases without needing a lexer library.
>> Can you explain a bit better why having a lexer instead of writing x3
>> grammar direclty would be preferred and in which cases?
>> Thanks,
>> Domen
>>> --
>> Std-Proposals mailing list
>> Std-Proposals_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals

Received on 2019-11-04 13:14:23