liaison: Re: [wg14/wg21 liaison] (SC22WG14.19230) grammar incompatibilities with lambdas

From: Uecker, Martin <Martin.Uecker_at_[hidden]>
Date: Sun, 11 Apr 2021 11:15:22 +0000

Am Sonntag, den 11.04.2021, 11:12 +0200 schrieb Jens Gustedt:
> Hello (WG and liaison),
> I made some experiment of integrating lambda expressions into the C
> grammar, namely the LALR parser for yacc and lex that can be found
> here:
>
> http://www.quut.com/c/ANSI-C-grammar-y-2011.html
>
> and I have some observations to share. (sorry, this will be a bit
> long)
>
> That grammar currently only has two shift/reduce conflicts, namely
> "dangling else" and the completely avoidable "_Atomic specifier" mess.
>
> First, to say that again, the lambda syntax is not my personal choice
> of preference, but I am trying to go there for compatibility reasons,
> only. I think introducing a new concept syntactically by reusing an
> existing punctuator token is one of the worst ideas that C++ had, and
> that they seem to be repeating with joy. This already started way back
> when `<` was reused for templates, but using `[` for attributes and
> lambdas has the same level of ⟦insert(your favorite curse)⟧.

I think the C++ compilers already work completely different
in the FE than many C compilers. We really can not consider
features of the C++ grammar "implementation experience" in
the context of C.

> Lambda expressions already have a interesting incompatibility with
> designators in initializers compared to C17, which I only found out by
> the integration mentioned above. If we are in a world where
> (implementation-defined) some constant objects can also be ICE, such
> as
>
> int const something = 42;
>
> Then the token sequence
>
> `[` `something` `]`
>
> can be two things, namely either a designator or the start of a
> lambda. Both of these can appear in the same context in an
> initialization of an array.
>
> This already needs a lookahead that is 2 or 3 tokens to disambiguate,
> and integrating that into the parser already needs some lifting. (And
> if `something` is not an ICE for an implementation, they still must
> add a disambiguation rule, here.)

With type-generic lambdas, it will also not be clear what is an ICE
and what not during parsing.

> The attempted introduction of designated intitializers into C++ should
> produce the same ambiguity. But here `something` is definitively an
> ICE.
>
> If we add attributes to the picture, things really become
> "interesting". In the LALR grammar this adds about 20 shift/reduce
> conflicts. The token sequence
>
> `int` `A` `[` `[` `HAL`
>
> could be introducing a declaration of
>
> - a VLA where the bound will be given by a lambda (disabiguated
> by a following `]`, `=` or `,` token)
>
> - an `int` object with a vendor attribute for vendor "HAL" to
> the identifier `A` (disambiguated by a following `∷` token)
>
> The high amount of shift/reduce conflicts come from the fact that
> there are already so many different possibilities for VLA, and even in
> two places (regular declarators and abstract declarators), and that
> the attribute also has two possibilities, namely also to start with a
> standard attribute. The worst I think is
>
> `int` `A` `[` `[` `deprecated`
>
> It could be introducing
>
> - a VLA where the bound will be given by a lambda (disabiguated
> by a following `=` or `,` token)
>
> - the sequence
>
> `int` `A` `[` `[` `deprecated` `]`
>
> which in turn could be introducing
>
> - a VLA where the bound will be given by a lambda
> (disabiguated by a following `(`, `[` or `{` token)
>
> - a deprecated `int` object `A` (disambiguated by a
> following `]` token)
>
> It is nowhere enshrined that C has to stay with a LALR grammar, but I
> think if we abandon that possibility we should at least make such a
> decision knowingly. What the examples above show
>
> - making attribute names keywords does not help much because
> of vendor specific attributes
>
> - the real culprit is the token sequence `[` `[` which
> introduces all of these conflicts
>
> For the latter, I tested to introduce `[[` as a token for the start of
> attributes, and all the ambiguity disappears nicely. It has to be
> noted that this sequence cannot appear in a valid C17 program, so any
> change that we make for `[` `[` in a row does not impact existing C
> code. The only impact for users of C23 would be that when they want to
> use a lambda in an array bound (which is a new feature) they'd have to
> put spaces between the `[` `[`.
>
> Doing so would introduce a surface incompatibilty with C++. On the
> other hand, my guess would be that C++ better have the same sort of
> disambiguation strategy, because now a called lambda can be a integer
> constant expression for them. So for C++ you could replace VLA above
> by array, and you'd be in the same sort of mess.

My main concern with all this is less about some finite look ahead
but something else:

In C is is possible to do type checking during parsing and directly
produce an intermediate code (or in theory even machine code).
There is no need to have an AST and some compiler do not have this.
Without goto we could go more or less directly to SSA.

With type generic lambdas this is not possible anymore. You then
need to be able to parse arbitrary code without type information
and perform the type checking later when the lambda is specialized.

   [](auto x){ ... arbitrary c code with unknown types ... }

What is an ICE may also depends on type and folding then also
needs to be delayed.

At the moment only the preprocessor moves tokens around and
parsing, type checking, and ICE folding can be done in one pass
with limited look ahead.

Note, that this is a superficious issue introduced by the lambda
syntax. As the use case for type-generic lambdas is that they
are produced by a macro are instantiated directly, this could
be done differenty, i.e. the statement expressions of GCC do
not have this problem:

#define foo(x) \
({ \
  auto y = (x); // type generic argument
  ... arbitrary code using y ... \
})

After expnading the macro the code can be parsed with full type
information because the argument is inserted by the preprocessor
at a position where it can be parsed frst.

In contrast, with type-generic lambdas:

#define foo [](auto y){ \
  ... arbitrary code using y ... \
}

the code needs to be parsed without knowing the type of 'y'
and only then is the argument is parsed which yields this
information.

We already have many compiler which are at C99 and only slowly
catching up. I would rather avoid loosing part of the community
(which is underrepresented in WG14) by adding the requirement
to rewrite the FE for C23.

Best,
Martin

Received on 2021-04-11 06:15:40