liaison: Re: [wg14/wg21 liaison] (SC22WG14.19230) grammar incompatibilities with lambdas

From: Aaron Ballman <aaron_at_[hidden]>
Date: Mon, 12 Apr 2021 07:59:03 -0400

Correcting the liaison email address so they're included.

On Sun, Apr 11, 2021 at 5:12 AM Jens Gustedt <jens.gustedt_at_[hidden]> wrote:
>
> Hello (WG and liaison),
> I made some experiment of integrating lambda expressions into the C
> grammar, namely the LALR parser for yacc and lex that can be found
> here:
>
> http://www.quut.com/c/ANSI-C-grammar-y-2011.html
>
> and I have some observations to share. (sorry, this will be a bit
> long)
>
> That grammar currently only has two shift/reduce conflicts, namely
> "dangling else" and the completely avoidable "_Atomic specifier" mess.
>
> First, to say that again, the lambda syntax is not my personal choice
> of preference, but I am trying to go there for compatibility reasons,
> only. I think introducing a new concept syntactically by reusing an
> existing punctuator token is one of the worst ideas that C++ had, and
> that they seem to be repeating with joy. This already started way back
> when `<` was reused for templates, but using `[` for attributes and
> lambdas has the same level of ⟦insert(your favorite curse)⟧.
>
> Lambda expressions already have a interesting incompatibility with
> designators in initializers compared to C17, which I only found out by
> the integration mentioned above. If we are in a world where
> (implementation-defined) some constant objects can also be ICE, such
> as
>
> int const something = 42;

<tangent>Perhaps I'm misunderstanding 6.6p10, but I don't believe that
gives you leave to turn this into an *integer constant expression*
just a *constant expression*. ICE is a more specific term that has
further semantic meaning elsewhere, and I was not under the impression
implementations were allowed to define new kinds of integer constant
expressions. e.g., I think it's invalid for an implementation to
decide these are identical declarations: int foo[something]; and int
foo[42];</tangent>

> Then the token sequence
>
> `[` `something` `]`
>
> can be two things, namely either a designator or the start of a
> lambda. Both of these can appear in the same context in an
> initialization of an array.
>
> This already needs a lookahead that is 2 or 3 tokens to disambiguate,
> and integrating that into the parser already needs some lifting. (And
> if `something` is not an ICE for an implementation, they still must
> add a disambiguation rule, here.)
>
> The attempted introduction of designated intitializers into C++ should
> produce the same ambiguity. But here `something` is definitively an
> ICE.
>
> If we add attributes to the picture, things really become
> "interesting". In the LALR grammar this adds about 20 shift/reduce
> conflicts. The token sequence
>
> `int` `A` `[` `[` `HAL`
>
> could be introducing a declaration of
>
> - a VLA where the bound will be given by a lambda (disabiguated
> by a following `]`, `=` or `,` token)
>
> - an `int` object with a vendor attribute for vendor "HAL" to
> the identifier `A` (disambiguated by a following `∷` token)

I agree that introducing lambda syntax to C would cause a parsing
ambiguity there. C++ has the same, it could be a regular array where
the bound is given by a constexpr lambda or an int object with an
attribute.

C++ makes this unambiguously an attribute per [dcl.attr.grammar]p7 and
we do not have any such disambiguation rule yet for C (I seem to
recall bringing this up in one of our many discussions about the
syntax, and I *think* the rationale was because we didn't know of any
current ambiguities in C that would require the rule).

FWIW, in C++ users can disambiguate themselves using parentheses.
e.g., int a [([HAL]() constexpr { return 12; }())]; // array, not
attribute

> The high amount of shift/reduce conflicts come from the fact that
> there are already so many different possibilities for VLA, and even in
> two places (regular declarators and abstract declarators), and that
> the attribute also has two possibilities, namely also to start with a
> standard attribute. The worst I think is
>
> `int` `A` `[` `[` `deprecated`
>
> It could be introducing
>
> - a VLA where the bound will be given by a lambda (disabiguated
> by a following `=` or `,` token)
>
> - the sequence
>
> `int` `A` `[` `[` `deprecated` `]`
>
> which in turn could be introducing
>
> - a VLA where the bound will be given by a lambda
> (disabiguated by a following `(`, `[` or `{` token)
>
> - a deprecated `int` object `A` (disambiguated by a
> following `]` token)
>
> It is nowhere enshrined that C has to stay with a LALR grammar, but I
> think if we abandon that possibility we should at least make such a
> decision knowingly. What the examples above show
>
> - making attribute names keywords does not help much because
> of vendor specific attributes
>
> - the real culprit is the token sequence `[` `[` which
> introduces all of these conflicts
>
> For the latter, I tested to introduce `[[` as a token for the start of
> attributes, and all the ambiguity disappears nicely. It has to be
> noted that this sequence cannot appear in a valid C17 program, so any
> change that we make for `[` `[` in a row does not impact existing C
> code. The only impact for users of C23 would be that when they want to
> use a lambda in an array bound (which is a new feature) they'd have to
> put spaces between the `[` `[`.

C's maximal munch rule (6.6p4) would cause problems for
implementations that also support C-derivative languages like
Objective-C, where the [[ tokens appear *very* frequently due to the
message passing syntax that they use. We'd effectively have to "undo"
the formation of that token, similar to the mess we already have to go
through for undoing turning >> into > and > in some circumstances in
C++. In C++, this was pretty reasonable because the >> into two >
tokens only occurs in very specific contexts with declarations,
whereas [[ in Objective-C appears naturally as part of expressions
that get used much more frequently and so it's less clear to me how
palatable such a change would be.

> Doing so would introduce a surface incompatibilty with C++. On the
> other hand, my guess would be that C++ better have the same sort of
> disambiguation strategy, because now a called lambda can be a integer
> constant expression for them. So for C++ you could replace VLA above
> by array, and you'd be in the same sort of mess.

C++ disambiguates differently and rather than using a new token that
C++ doesn't have, I'd hope that we could explore using the same
disambiguation strategy as C++ has already used because there's
significant implementation experience with the C++ formulation and
some known implementation concerns with the introduction of a new [[
token (at least for some C implementations).

~Aaron

>
> Jens
>
> --
> :: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
> :: ::::::::::::::: office Strasbourg : +33 368854536 ::
> :: :::::::::::::::::::::: gsm France : +33 651400183 ::
> :: ::::::::::::::: gsm international : +49 15737185122 ::
> :: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::

Received on 2021-04-12 06:59:18