sg16: Re: [SG16] Redefining Lexing in terms of Unicode

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Thu, 28 May 2020 10:55:21 -0400

>
> Please also address all the uses of this term in the library section.
>>
>
> There would be no change, although the basic source character appears in
> library in a few places ( it would be redefined as basic execution encoding)
>
At least allowing NUL where it was prohibited is probably not intended.

>
>
>>
>>> - Source character set is redefined as being the Unicode character
>>> set
>>>
>>> It seems like we're encouraging homoglyph issues. Do we expect open
>> source projects to maintain coding guidelines that restrict characters
>> outside the ASCII range?
>>
>
> This change would't modify the set of characters that can appear in a
> source file.
>
Let's not underestimate the impact of making things "first class citizens"
of the language where they were not such before.

>
>
>> It seems that encoding the stuff outside the basic source character set
>> as UCNs in headers is exactly how one would avoid per-header encoding
>> selection given the practical reality.
>>
> The practical reality being that most encodings that are intermixed have
>> the same encoded value for most of the members of the basic source
>> character set.
>> Thus, we have the concept of a basic source character set (and we also
>> have digraphs and C's iso646.h).
>> Therefore, although not absolutely true, simply avoiding characters
>> outside the basic source character set (and those requiring digraphs, etc.)
>> is generally good enough for allowing headers to be included for
>> compilation with source specified (via command line, etc.) as being in
>> different encodings.
>>
>
> We could define a "portable subset". interestingly, i don't think this is
> currently the case?
>
It's portable within "families" of encodings.

> As in the current wording does not prevent a physical character set that
> doesn't contain the letter "a", for example
>
Sure, for users that don't want to spell `char` or `template`...
Otherwise, the physical character set might be based on one that does not
contain the letter "a", but the compiler likely is (in effect) defining one
that does have "a".

> This change wouldn't modify the portability of headers.
>
Changing what user can or cannot do will change user behaviour, which can
change the portability of headers.

>
>
>> This mitigation for the problem you identified is not guaranteed. Lacking
>> such mitigation, developers would be forced by libraries to switch to
>> Unicode source even if they do not wish to.
>> (Okay, there is such a thing as compiler extensions for __asm__ names,
>> but their usability is limited).
>>
>
> Agreed. We need to establish whether universal character names are
> actually used in identifiers in production code
>
The body of production code in the world is not exactly something WG 21 has
full access to.

Received on 2020-05-28 09:58:45