sg16: Re: [SG16] Redefining Lexing in terms of Unicode

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 28 May 2020 17:22:21 -0400

On 5/28/20 11:08 AM, Corentin via SG16 wrote:
>
>
> On Thu, May 28, 2020, 16:55 Hubert Tong
> <hubert.reinterpretcast_at_[hidden]
> <mailto:hubert.reinterpretcast_at_[hidden]>> wrote:
>
> Please also address all the uses of this term in the
> library section.
>
>
> There would be no change, although the basic source character
> appears in library in a few places ( it would be redefined as
> basic execution encoding)
>
> At least allowing NUL where it was prohibited is probably not
> intended.
>
>
> Sure and I think being more explicit in library would be an improvement.
>
> * Source character set is redefined as being the
> Unicode character set
>
> It seems like we're encouraging homoglyph issues. Do we
> expect open source projects to maintain coding guidelines
> that restrict characters outside the ASCII range?
>
>
> This change would't modify the set of characters that can
> appear in a source file.
>
> Let's not underestimate the impact of making things "first class
> citizens" of the language where they were not such before.
>
>
> Do we really expect people to ever type \uxxxx in C++20.

What has changed in C++20 that would negate prior motivation for the
feature?

Tom.

> They wouldn't be more or less first class citizen as they are today
> given we would not changing the requirements on characters must be
> supported by the physical character set
>
> It seems that encoding the stuff outside the basic source
> character set as UCNs in headers is exactly how one would
> avoid per-header encoding selection given the practical
> reality.
>
> The practical reality being that most encodings that are
> intermixed have the same encoded value for most of the
> members of the basic source character set.
> Thus, we have the concept of a basic source character set
> (and we also have digraphs and C's iso646.h).
> Therefore, although not absolutely true, simply avoiding
> characters outside the basic source character set (and
> those requiring digraphs, etc.) is generally good enough
> for allowing headers to be included for compilation with
> source specified (via command line, etc.) as being in
> different encodings.
>
>
> We could define a "portable subset". interestingly, i don't
> think this is currently the case?
>
> It's portable within "families" of encodings.
>
>
> This wouldn't change
>
> As in the current wording does not prevent a physical
> character set that doesn't contain the letter "a", for example
>
> Sure, for users that don't want to spell `char` or `template`...
> Otherwise, the physical character set might be based on one that
> does not contain the letter "a", but the compiler likely is (in
> effect) defining one that does have "a".
>
> This change wouldn't modify the portability of headers.
>
> Changing what user can or cannot do will change user behaviour,
> which can change the portability of headers.
>
> This mitigation for the problem you identified is not
> guaranteed. Lacking such mitigation, developers would be
> forced by libraries to switch to Unicode source even if
> they do not wish to.
> (Okay, there is such a thing as compiler extensions for
> __asm__ names, but their usability is limited).
>
>
> Agreed. We need to establish whether universal character names
> are actually used in identifiers in production code
>
> The body of production code in the world is not exactly something
> WG 21 has full access to.
>
>

Received on 2020-05-28 16:25:31