sg16: Re: [SG16] Redefining Lexing in terms of Unicode

From: Corentin <corentin.jabot_at_[hidden]>
Date: Thu, 28 May 2020 17:08:29 +0200

On Thu, May 28, 2020, 16:55 Hubert Tong <hubert.reinterpretcast_at_[hidden]>
wrote:

> Please also address all the uses of this term in the library section.
>>>
>>
>> There would be no change, although the basic source character appears in
>> library in a few places ( it would be redefined as basic execution encoding)
>>
> At least allowing NUL where it was prohibited is probably not intended.
>

Sure and I think being more explicit in library would be an improvement.

>
>
>>
>>
>>>
>>>> - Source character set is redefined as being the Unicode character
>>>> set
>>>>
>>>> It seems like we're encouraging homoglyph issues. Do we expect open
>>> source projects to maintain coding guidelines that restrict characters
>>> outside the ASCII range?
>>>
>>
>> This change would't modify the set of characters that can appear in a
>> source file.
>>
> Let's not underestimate the impact of making things "first class citizens"
> of the language where they were not such before.
>

Do we really expect people to ever type \uxxxx in C++20.
They wouldn't be more or less first class citizen as they are today given
we would not changing the requirements on characters must be supported by
the physical character set

>
>
>>
>>
>>> It seems that encoding the stuff outside the basic source character set
>>> as UCNs in headers is exactly how one would avoid per-header encoding
>>> selection given the practical reality.
>>>
>> The practical reality being that most encodings that are intermixed have
>>> the same encoded value for most of the members of the basic source
>>> character set.
>>> Thus, we have the concept of a basic source character set (and we also
>>> have digraphs and C's iso646.h).
>>> Therefore, although not absolutely true, simply avoiding characters
>>> outside the basic source character set (and those requiring digraphs, etc.)
>>> is generally good enough for allowing headers to be included for
>>> compilation with source specified (via command line, etc.) as being in
>>> different encodings.
>>>
>>
>> We could define a "portable subset". interestingly, i don't think this is
>> currently the case?
>>
> It's portable within "families" of encodings.
>

This wouldn't change

>
>
>> As in the current wording does not prevent a physical character set that
>> doesn't contain the letter "a", for example
>>
> Sure, for users that don't want to spell `char` or `template`...
> Otherwise, the physical character set might be based on one that does not
> contain the letter "a", but the compiler likely is (in effect) defining one
> that does have "a".
>
>
>> This change wouldn't modify the portability of headers.
>>
> Changing what user can or cannot do will change user behaviour, which can
> change the portability of headers.
>
>
>>
>>
>>> This mitigation for the problem you identified is not guaranteed.
>>> Lacking such mitigation, developers would be forced by libraries to switch
>>> to Unicode source even if they do not wish to.
>>> (Okay, there is such a thing as compiler extensions for __asm__ names,
>>> but their usability is limited).
>>>
>>
>> Agreed. We need to establish whether universal character names are
>> actually used in identifiers in production code
>>
> The body of production code in the world is not exactly something WG 21
> has full access to.
>

Received on 2020-05-28 10:11:47