Scannerless parsing is an approach to parsing. However, lexing simplifies the grammar, making it much easier for humans to understand. With lexing, the parsing step thinks at a higher level, with the nonterminals being the tokens output by the lexer. Without lexing, the parsers now simultaneously doing the work of the lexer and the nonterminals are all the valid characters.

It makes sense to do lexing first, as lexing is much simpler. Lexing handles the regular language subsets of more complicated languages, and parsing combines them together with a usually context free grammar.

I can't verify this claim, but I've heard that part of why we separate lexing from parsing is for speed. Lexing can be done with a DFA, whereas parsing cannot. So if we combine lexing and parsing in a scannerless parser, we cannot yake advantage of DFAs in the regular language subsets of our language.

As for repeating the grammar twice, not so. The lexical "grammar" is much simpler than the rest of the language (e.g. no nesting). It's basically a table connecting regular expressions with the corresponding token, i.e. trivial rules. The actual grammar has much more complicated structures.

--Justin

On Tue, Oct 29, 2019, 5:38 AM Domen Vrankar via Std-Proposals <std-proposals@lists.isocpp.org> wrote:

On Tue, Oct 29, 2019, 9:52 AM Dejan Milosavljevic via Std-Proposals <std-proposals@lists.isocpp.org> wrote:
1. ... doesn't have any motivation.
- https://en.wikipedia.org/wiki/Lex_(software)
- https://en.wikipedia.org/wiki/Comparison_of_parser_generators
Why so many libraries? C, C#, C++, Java, Haskell, Eiffel, Go.

I do have a question regarding this first point and the proposal in general but just haven't gotten arround to asking.

I'm speaking from memory so I could be wrong but when boost spirit x3 came out I remember the author saying something regarding lexers not being needed (as much?) so x3 doesn't have the support for it.

Old spirit had it but even back then I preferred writing parsers without using it. I always felt that parsing flow should be: grammar -> ast -> working with ast

And that tokenization beforehand is repeating the grammar writing step and muddying the code (making it longer, more spread out and thereby less readable/extensible/maintainable).

I haven't checked the feasability of writing x3 style parsers with the proposed pattern matching but my guess is that it should eventually be possible. This would cover all my cases without needing a lexer library.

Can you explain a bit better why having a lexer instead of writing x3 grammar direclty would be preferred and in which cases?

Thanks,
Domen

--
Std-Proposals mailing list
Std-Proposals@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals