C++ Logo


Advanced search

Re: Beyond regex

From: Dejan Milosavljevic <dmilos_at_[hidden]>
Date: Thu, 7 Nov 2019 14:53:51 +0100
> It looks like this library creates the lexer at runtime, in contrast to the software 'lex', which generates the lexer at compile-time.
In this case that is the real question run-time(RT) vs.
compile-time(CT). Well known topic.
I see that as one of key point in here.

Lets start with reusing of `std::regex`. `std::regex` can be inspected
only in RT. So in this path we end at RT.
If we require CT new_regex must be compilable/inspect-able in CT.
We will have redundancy ( or new feature ), two regex implementation.
One for RT another for CT.

I guess we all want both.
Important feature must be cooperation between CT and RT libraries.

Appear that above comment are off-topic but it's define feature of any
kind libraries that touch subject like `std::regex`.

Spirit X3.
 - It self does not use std::regexp in any way. Predefined type int_,
char_, double_ tend to overcome that.
 - Cover Type-2 grammar, which require complexity of execution more
than constant. See https://en.wikipedia.org/wiki/Pushdown_automaton

Lex and this proposal.
 - Require usage of `std::regex`
 - Behave as Type-3 grammar. Require execution of constant complexity.
See https://en.wikipedia.org/wiki/Pushdown_automaton.

Lex( this proposal and general ) vs Spirit X3.
Even if Spirit::X3 implement cooperation with std::regex this two
libraries can not be compared.
Type-2 vs Type-3 grammar.

Let me exaggerate this comparison: Try to compare: `std::vector` and
`std::regex`. Which one is better?

On Mon, Nov 4, 2019 at 8:12 PM Domen Vrankar via Std-Proposals
<std-proposals_at_[hidden]> wrote:
> On Tue, Oct 29, 2019 at 3:54 PM Justin Bassett via Std-Proposals <std-proposals_at_[hidden]> wrote:
>> Scannerless parsing is an approach to parsing. However, lexing simplifies the grammar, making it much easier for humans to understand. With lexing, the parsing step thinks at a higher level, with the nonterminals being the tokens output by the lexer. Without lexing, the parsers now simultaneously doing the work of the lexer and the nonterminals are all the valid characters.
>> It makes sense to do lexing first, as lexing is much simpler. Lexing handles the regular language subsets of more complicated languages, and parsing combines them together with a usually context free grammar.
>> I can't verify this claim, but I've heard that part of why we separate lexing from parsing is for speed. Lexing can be done with a DFA, whereas parsing cannot. So if we combine lexing and parsing in a scannerless parser, we cannot yake advantage of DFAs in the regular language subsets of our language.
>> See also: https://cs.stackexchange.com/a/39902
>> As for repeating the grammar twice, not so. The lexical "grammar" is much simpler than the rest of the language (e.g. no nesting). It's basically a table connecting regular expressions with the corresponding token, i.e. trivial rules. The actual grammar has much more complicated structures.
> I understand all of this (didn't know about performance as I never measured that part) but for me writing spirit X3 grammar rules was never something that seemed like it needs grammar simplification while on the other hand turtles all the way down logic of writing basic rules for single elements and then combining those rules with the same syntax into more complex grammars seemed like a great plus.
> For that reason I'd be more comfortable if the author of the proposal would add comparison examples with spirit X3 in the proposal to outline exactly how much easier/harder it would be to write a complete parser with one and the other (perhaps since this is only a lexer proposal it would be nice to see how it would integrate with either spirit X3 or something like that).
> Lexer is perhaps a nice basic building block but at least for me the useful tool for parsing go like this: string compare -> regex -> grammar parser. So even if/when lexer is useful I don't see it as a basic building block (I don't expect many people using it to write their own parsers on top of it so it seems like an extremely domain specific component) - it might just be me but perhaps others also need convincing that this is not the case.
> Regards,
> Domen
>> --Justin
>> On Tue, Oct 29, 2019, 5:38 AM Domen Vrankar via Std-Proposals <std-proposals_at_[hidden]> wrote:
>>> On Tue, Oct 29, 2019, 9:52 AM Dejan Milosavljevic via Std-Proposals <std-proposals_at_[hidden]> wrote:
>>>> 1. ... doesn't have any motivation.
>>>> - https://en.wikipedia.org/wiki/Lex_(software)
>>>> - https://en.wikipedia.org/wiki/Comparison_of_parser_generators
>>>> Why so many libraries? C, C#, C++, Java, Haskell, Eiffel, Go.
>>> I do have a question regarding this first point and the proposal in general but just haven't gotten arround to asking.
>>> I'm speaking from memory so I could be wrong but when boost spirit x3 came out I remember the author saying something regarding lexers not being needed (as much?) so x3 doesn't have the support for it.
>>> Old spirit had it but even back then I preferred writing parsers without using it. I always felt that parsing flow should be: grammar -> ast -> working with ast
>>> And that tokenization beforehand is repeating the grammar writing step and muddying the code (making it longer, more spread out and thereby less readable/extensible/maintainable).
>>> I haven't checked the feasability of writing x3 style parsers with the proposed pattern matching but my guess is that it should eventually be possible. This would cover all my cases without needing a lexer library.
>>> Can you explain a bit better why having a lexer instead of writing x3 grammar direclty would be preferred and in which cases?
>>> Thanks,
>>> Domen
>>> --
>>> Std-Proposals mailing list
>>> Std-Proposals_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>> --
>> Std-Proposals mailing list
>> Std-Proposals_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals

Received on 2019-11-07 07:56:20