liaison: Re: [wg14/wg21 liaison] adding punctuator tokens

From: Tom Scogland <scogland1_at_[hidden]>
Date: Thu, 15 Apr 2021 09:47:05 -0700

Trying to stick to strictly technical issues, or at least challenges to
that implementation, there are some source codebases that do (or at
least have) used characters in these sets in ways that likely would no
longer function with translations such as the one you propose for code
point x2237. The first example that comes to mind, that’s public and
easy to reference, is in the bootstrap path for go from c through it’s
original toolchain where the middle dot character was used throughout
the code to allow c to look like it has module namespacing:
https://github.com/golang/go/blob/402d3590b54e4a0df9fb51ed14b2999e85ce0b76/src/pkg/runtime/chan.c#L155

If the middle dot becomes a period, or anything other than a valid
identifier character, that code will break. This is not a common
practice, but I’ve also seen the Pa (ᐸ) and Po (ᐳ) symbols made to
make generated function names “look like generics.”

That’s not to say necessarily that something shouldn’t be done here,
but sadly existing code does exist that could be broken by decisions in
this space. If it’s something the committees decide we want to do,
learning from previous (somewhat successful, somewhat painful)
experiences from Fortress and more successfully and recently from Julia
which allows unicode characters almost arbitrarily, but which does
assign meanings to a good number of symbols through it’s parser here:
https://github.com/JuliaLang/julia/blob/4996445df37e526dac2772e333caf82f1ea987f0/src/julia-parser.scm#L6

I was surprised to find it doesn’t include anything for Pa, Po or
middle dot actually. It does however define the
Proportion character “∷” as a comparison operator, possibly
because it’s from the mathematical block or possibly because the
classic use along with Ratio “∶” would suggest its use in
expressions like a∶b ∷ c∶d to express, or perhaps test,
proportional ratios rather than as a separator or otherwise equivalent
to two colons.

Honestly I think it’s things like that which make this a harder
problem more than the technical challenge of implementing it. Deciding
what all of the characters should mean is not a trivial task, and
frequently results in differing opinions.

-Tom

On 15 Apr 2021, at 1:43, Jens Gustedt via Liaison wrote:

> Hi everybody,
> unfortunately for all the discussions over the years that I tried to
> initiate about this subject, it seems that I never (even in recent
> discussions with some of you) had technical feedback why adding new
> punctuators to C (or C++) would not be possible.
>
> What I did get where "we don't like that", "I never would use that" or
> "over my dead body" comments, but (at least that is my impression) I
> never had somebody pointing at a real technical difficulty.
>
> So what is this all about? This is about moving C and C++ into the
> 21st century. We have roughly 32 years of standardized C, now. Imagine
> in 32 years from now people would still not be able to use normal
> technical characters in their preferred programming language.
>
> Also, this is not about forcing implementations that do not have these
> characters in their source or execution character set, to integrate
> them or to change anything. But this is about implementations that
> already have these characters in their extended source character set
> (in particular those that use UTF-8, UTF-16 or UTF-32) to accept these
> characters as punctuator tokens.
>
> The "only" thing that implementations that have such a character would
> have to add is to produce the correct punctuator token in translation
> phase 3. So for example if the extended source character set has the
> character `∷` (codepoint x2237) this should result in a token that
> is
> the same or equivalent to `::` (two adjacent colons). Whatever they do
> then for the rare case of stringification of these tokens I wouldn't
> care much, as long as they are consistent with themselves.
>
> There is no conflict with existing C implementations or C source code
> out there that I know of. (And my guess is that for C++ this should be
> the same.)
>
> - All implementations have to accept these characters already in
> their extended source and execution character set, because in
> string or character literals they can be entered with the
> `\u2237` or `\U00002237` notation. So at the worst,
> stringification of such a token will result in some weird `¿`
> character in a wide string, no big deal.
>
> - Codepoints that are punctuators are not allowed in identifiers,
> so currently these characters can only be used in (wide) string
> or character literals.
>
> A positive side effect for implementations that have these characters
> and for the users that use them would be that parsing of C and C++
> would become easier.
>
> - For C (and probably also C++) adding `⟦` would avoid the
> ambiguity that we are about to introduce between array bounds,
> attributes and lambdas.
>
> - For C++, having `‹` and `›` could resolve the ambiguity of
> `<`
> and `>` as relational operators and as template delimiters, or
> the even weirder semantic clash between two closing templates
> `››` and a `>>` token.
>
> So I am very eager to hear your opinion about *technical* difficulties
> with this, but I'd also very much appreciate if we could not expand
> this to a general culture war about personal preferences, other
> languages than English in sources, or the keys that you have on your
> keyboard.
>
> Thanks
> Jens
>
> --
> :: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
> :: ::::::::::::::: office Strasbourg : +33 368854536 ::
> :: :::::::::::::::::::::: gsm France : +33 651400183 ::
> :: ::::::::::::::: gsm international : +49 15737185122 ::
> :: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::
> _______________________________________________
> Liaison mailing list
> Liaison_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/liaison
> Link to this post: http://lists.isocpp.org/liaison/2021/04/0432.php

Received on 2021-04-15 11:47:15