liaison: Re: [wg14/wg21 liaison] adding punctuator tokens

From: Steve Downey <sdowney_at_[hidden]>
Date: Thu, 15 Apr 2021 14:16:20 -0400

Middle dot (U+00B7) is currently an identifier, and remains so in C++
with P1947, so it shouldn't be an issue. Proportion (U+2237) is part of
Mathematical Symbols rather than punctuation, and isn't in the identifier
list, so isn't going to conflict.
However, giving alternate spellings for code is inevitably going to cause
confusion. I don't believe we really need more ways of spelling the same
thing. Digraphs solve a narrow technical problem, but the most common use
is creating confusing compiler errors.

Typing difficulty aside, I'd also rather see the math symbols remain
available for use as potential new operators, or as an extension to create
new infix operations.

https://isocpp.org/files/papers/P1949R7.html - C++ Identifier Syntax using
Unicode Standard Annex 31

On Thu, Apr 15, 2021 at 12:47 PM Tom Scogland via Liaison <
liaison_at_[hidden]> wrote:

> Trying to stick to strictly technical issues, or at least challenges to
> that implementation, there are some source codebases that do (or at
> least have) used characters in these sets in ways that likely would no
> longer function with translations such as the one you propose for code
> point x2237. The first example that comes to mind, that’s public and
> easy to reference, is in the bootstrap path for go from c through it’s
> original toolchain where the middle dot character was used throughout
> the code to allow c to look like it has module namespacing:
>
> https://github.com/golang/go/blob/402d3590b54e4a0df9fb51ed14b2999e85ce0b76/src/pkg/runtime/chan.c#L155
>
> If the middle dot becomes a period, or anything other than a valid
> identifier character, that code will break. This is not a common
> practice, but I’ve also seen the Pa (ᐸ) and Po (ᐳ) symbols made to
> make generated function names “look like generics.”
>
> That’s not to say necessarily that something shouldn’t be done here,
> but sadly existing code does exist that could be broken by decisions in
> this space. If it’s something the committees decide we want to do,
> learning from previous (somewhat successful, somewhat painful)
> experiences from Fortress and more successfully and recently from Julia
> which allows unicode characters almost arbitrarily, but which does
> assign meanings to a good number of symbols through it’s parser here:
>
> https://github.com/JuliaLang/julia/blob/4996445df37e526dac2772e333caf82f1ea987f0/src/julia-parser.scm#L6
>
> I was surprised to find it doesn’t include anything for Pa, Po or
> middle dot actually. It does however define the
> Proportion character “∷” as a comparison operator, possibly
> because it’s from the mathematical block or possibly because the
> classic use along with Ratio “∶” would suggest its use in
> expressions like a∶b ∷ c∶d to express, or perhaps test,
> proportional ratios rather than as a separator or otherwise equivalent
> to two colons.
>
> Honestly I think it’s things like that which make this a harder
> problem more than the technical challenge of implementing it. Deciding
> what all of the characters should mean is not a trivial task, and
> frequently results in differing opinions.
>
> -Tom
>
> On 15 Apr 2021, at 1:43, Jens Gustedt via Liaison wrote:
>
> > Hi everybody,
> > unfortunately for all the discussions over the years that I tried to
> > initiate about this subject, it seems that I never (even in recent
> > discussions with some of you) had technical feedback why adding new
> > punctuators to C (or C++) would not be possible.
> >
> > What I did get where "we don't like that", "I never would use that" or
> > "over my dead body" comments, but (at least that is my impression) I
> > never had somebody pointing at a real technical difficulty.
> >
> > So what is this all about? This is about moving C and C++ into the
> > 21st century. We have roughly 32 years of standardized C, now. Imagine
> > in 32 years from now people would still not be able to use normal
> > technical characters in their preferred programming language.
> >
> > Also, this is not about forcing implementations that do not have these
> > characters in their source or execution character set, to integrate
> > them or to change anything. But this is about implementations that
> > already have these characters in their extended source character set
> > (in particular those that use UTF-8, UTF-16 or UTF-32) to accept these
> > characters as punctuator tokens.
> >
> > The "only" thing that implementations that have such a character would
> > have to add is to produce the correct punctuator token in translation
> > phase 3. So for example if the extended source character set has the
> > character `∷` (codepoint x2237) this should result in a token that
> > is
> > the same or equivalent to `::` (two adjacent colons). Whatever they do
> > then for the rare case of stringification of these tokens I wouldn't
> > care much, as long as they are consistent with themselves.
> >
> > There is no conflict with existing C implementations or C source code
> > out there that I know of. (And my guess is that for C++ this should be
> > the same.)
> >
> > - All implementations have to accept these characters already in
> > their extended source and execution character set, because in
> > string or character literals they can be entered with the
> > `\u2237` or `\U00002237` notation. So at the worst,
> > stringification of such a token will result in some weird `¿`
> > character in a wide string, no big deal.
> >
> > - Codepoints that are punctuators are not allowed in identifiers,
> > so currently these characters can only be used in (wide) string
> > or character literals.
> >
> > A positive side effect for implementations that have these characters
> > and for the users that use them would be that parsing of C and C++
> > would become easier.
> >
> > - For C (and probably also C++) adding `⟦` would avoid the
> > ambiguity that we are about to introduce between array bounds,
> > attributes and lambdas.
> >
> > - For C++, having `‹` and `›` could resolve the ambiguity of
> > `<`
> > and `>` as relational operators and as template delimiters, or
> > the even weirder semantic clash between two closing templates
> > `››` and a `>>` token.
> >
> > So I am very eager to hear your opinion about *technical* difficulties
> > with this, but I'd also very much appreciate if we could not expand
> > this to a general culture war about personal preferences, other
> > languages than English in sources, or the keys that you have on your
> > keyboard.
> >
> > Thanks
> > Jens
> >
> > --
> > :: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
> > :: ::::::::::::::: office Strasbourg : +33 368854536 ::
> > :: :::::::::::::::::::::: gsm France : +33 651400183 ::
> > :: ::::::::::::::: gsm international : +49 15737185122 ::
> > :: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::
> > _______________________________________________
> > Liaison mailing list
> > Liaison_at_[hidden]
> > Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/liaison
> > Link to this post: http://lists.isocpp.org/liaison/2021/04/0432.php
> _______________________________________________
> Liaison mailing list
> Liaison_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/liaison
> Link to this post: http://lists.isocpp.org/liaison/2021/04/0445.php
>

Received on 2021-04-15 13:16:41