liaison: Re: [wg14/wg21 liaison] (SC22WG14.19243) grammar incompatibilities with lambdas

From: Aaron Ballman <aaron_at_[hidden]>
Date: Mon, 12 Apr 2021 09:44:59 -0400

On Mon, Apr 12, 2021 at 8:51 AM Jens Gustedt <jens.gustedt_at_[hidden]> wrote:
>
> Aaron,
>
> on Mon, 12 Apr 2021 07:59:03 -0400 you (Aaron Ballman
> <aaron_at_[hidden]>) wrote:
>
> > > Then the token sequence
> > >
> > > `[` `something` `]`
> > >
> > > can be two things, namely either a designator or the start of a
> > > lambda. Both of these can appear in the same context in an
> > > initialization of an array.
> > >
> > > This already needs a lookahead that is 2 or 3 tokens to
> > > disambiguate, and integrating that into the parser already needs
> > > some lifting. (And if `something` is not an ICE for an
> > > implementation, they still must add a disambiguation rule, here.)
> > >
> > > The attempted introduction of designated intitializers into C++
> > > should produce the same ambiguity. But here `something` is
> > > definitively an ICE.
> > >
> > > If we add attributes to the picture, things really become
> > > "interesting". In the LALR grammar this adds about 20 shift/reduce
> > > conflicts. The token sequence
> > >
> > > `int` `A` `[` `[` `HAL`
> > >
> > > could be introducing a declaration of
> > >
> > > - a VLA where the bound will be given by a lambda
> > > (disabiguated by a following `]`, `=` or `,` token)
> > >
> > > - an `int` object with a vendor attribute for vendor "HAL" to
> > > the identifier `A` (disambiguated by a following `∷` token)
> > >
> >
> > I agree that introducing lambda syntax to C would cause a parsing
> > ambiguity there. C++ has the same, it could be a regular array where
> > the bound is given by a constexpr lambda or an int object with an
> > attribute.
> >
> > C++ makes this unambiguously an attribute per [dcl.attr.grammar]p7 and
> > we do not have any such disambiguation rule yet for C (I seem to
> > recall bringing this up in one of our many discussions about the
> > syntax, and I *think* the rationale was because we didn't know of any
> > current ambiguities in C that would require the rule).
> >
> > FWIW, in C++ users can disambiguate themselves using parentheses.
> > e.g., int a [([HAL]() constexpr { return 12; }())]; // array, not
> > attribute
>
> Indeed that would be a way for applications to clearly mark their
> intent.
>
> > > The high amount of shift/reduce conflicts come from the fact that
> > > there are already so many different possibilities for VLA, and even
> > > in two places (regular declarators and abstract declarators), and
> > > that the attribute also has two possibilities, namely also to start
> > > with a standard attribute. The worst I think is
> > >
> > > `int` `A` `[` `[` `deprecated`
> > >
> > > It could be introducing
> > >
> > > - a VLA where the bound will be given by a lambda
> > > (disabiguated by a following `=` or `,` token)
> > >
> > > - the sequence
> > >
> > > `int` `A` `[` `[` `deprecated` `]`
> > >
> > > which in turn could be introducing
> > >
> > > - a VLA where the bound will be given by a lambda
> > > (disabiguated by a following `(`, `[` or `{` token)
> > >
> > > - a deprecated `int` object `A` (disambiguated by a
> > > following `]` token)
> > >
> > > It is nowhere enshrined that C has to stay with a LALR grammar, but
> > > I think if we abandon that possibility we should at least make such
> > > a decision knowingly. What the examples above show
> > >
> > > - making attribute names keywords does not help much
> > > because of vendor specific attributes
> > >
> > > - the real culprit is the token sequence `[` `[` which
> > > introduces all of these conflicts
> > >
> > > For the latter, I tested to introduce `[[` as a token for the start
> > > of attributes, and all the ambiguity disappears nicely. It has to be
> > > noted that this sequence cannot appear in a valid C17 program, so
> > > any change that we make for `[` `[` in a row does not impact
> > > existing C code. The only impact for users of C23 would be that
> > > when they want to use a lambda in an array bound (which is a new
> > > feature) they'd have to put spaces between the `[` `[`.
> >
> > C's maximal munch rule (6.6p4) would cause problems for
> > implementations that also support C-derivative languages like
> > Objective-C, where the [[ tokens appear *very* frequently due to the
> > message passing syntax that they use. We'd effectively have to "undo"
> > the formation of that token, similar to the mess we already have to go
> > through for undoing turning >> into > and > in some circumstances in
> > C++. In C++, this was pretty reasonable because the >> into two >
> > tokens only occurs in very specific contexts with declarations,
> > whereas [[ in Objective-C appears naturally as part of expressions
> > that get used much more frequently and so it's less clear to me how
> > palatable such a change would be.
>
> If we go like that, implementations that don't have these problems
> because they don't implement other languages with these double
> brackets could still use a `[[` token and map the token pair `[` `[`
> to that special token.

I think that could be a plausible implementation strategy.

> It is a bit user unfriendly because by basic experience with C++
> people would probably assume that separating the two `[` should
> suffice to disambiguate.

One edge case to forming a single token is that it could technically
break code like:

[ // I don't know why
[attr]
] // anyone would ever do this,
int i; // but code formatting tools sometimes do awful things.

While this is a contrived example, a more problematic area I could
imagine cropping up is with tools like c-reduce where the fuzzing
nature of the mutations may lead to some surprisingly different parse
meanings between single and multiple tokens.

> > > Doing so would introduce a surface incompatibilty with C++. On the
> > > other hand, my guess would be that C++ better have the same sort of
> > > disambiguation strategy, because now a called lambda can be a
> > > integer constant expression for them. So for C++ you could replace
> > > VLA above by array, and you'd be in the same sort of mess.
> >
> > C++ disambiguates differently and rather than using a new token that
> > C++ doesn't have, I'd hope that we could explore using the same
> > disambiguation strategy as C++ has already used because there's
> > significant implementation experience with the C++ formulation and
> > some known implementation concerns with the introduction of a new [[
> > token (at least for some C implementations).
>
> In this particular case C++ experience for the syntax is not so
> convincing, because the grammar concerning `[` is finally a bit
> different. We have different constructs with different properties
> (VLA, designators).

C++ has these same(ish) syntactic constructs (plus some more). You
don't need a VLA to hit the problem in C++, a constexpr lambda will
have the same ambiguity. Array designators are an interesting
C-specific problem area though.

> But if that is wanted I can add such a rule to the basic lambda paper.

At the very least, calling out the problem in the lambda paper would
be a great idea so that we don't forget we need to solve the problem
*somehow*. Hopefully we can get a good idea of how through the
reflectors so that the paper can propose a preferred approach.

Personally, I think the C++ approach with two tokens is a bit awkward
because https://eel.is/c++draft/dcl.attr.grammar#7 comes *awfully*
close to making [[ behave like a single token in practice, except that
whitespace is allowed between the square brackets. However, given how
common attributes appear in header files, I would want the same
parsing behavior in C and C++ code rather than "same-ish" parsing
behavior with some exceptions. Given that lambdas are expressions and
expressions can always be surrounded in parens to disambiguate from an
attribute, and the potential parsing complications for [[ vs [ [ for
some implementations, I think the current C++ formulation is my
preference even if I think it's a bit weird.

~Aaron

>
> Jens
>
> --
> :: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
> :: ::::::::::::::: office Strasbourg : +33 368854536 ::
> :: :::::::::::::::::::::: gsm France : +33 651400183 ::
> :: ::::::::::::::: gsm international : +49 15737185122 ::
> :: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::

Received on 2021-04-12 08:45:19