C++ Logo

sg16

Advanced search

Re: [SG16] [isocpp-core] To which extent characters can be replaced or removed in phase 1?

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 28 May 2020 12:19:53 -0400
Ok, that sounds like your preference is to prohibit Unicode
normalization during translation phase 1. I am not opposed to such a
restriction. Note that such normalization could not occur afterwards
without introducing a new translation phase. (In practice,
implementations could always choose to provide such normalization under
translation phase 1 by defining an implementation-defined
"UTF-8-with-auto-normalization" encoding).

Tom.

On 5/28/20 12:13 PM, Alisdair Meredith wrote:
> My suggestion will faithfully reproduce UTF-8 encoded source
> using UCNs for anything not in the basic source character set.
> Normalization would come after that, and should not be pertinent
> at the level of my proposal, unless other work happening in
> parallel already demands it.
>
> AlisdairM
>
>> On May 28, 2020, at 17:09, Tom Honermann via Core
>> <core_at_[hidden] <mailto:core_at_[hidden]>> wrote:
>>
>> On 5/28/20 9:48 AM, Corentin via Core wrote:
>>> I realized this has further implications when the physical source is
>>> Unicode encoded.
>>> Even restricting a mapping to a representation of the same
>>> abstract character, an implementation could during phase 1, apply
>>> arbitrary Unicode normalization (LATIN SMALL LETTER E + ACUTE
>>> ACCENT and LATIN SMALL LETTER E WITH ACUTE are the same abstract
>>> character).
>>> This has interesting ramification for P1949 which make non nfc
>>> identifiers ill-formed.
>>
>> We discussed the possibilities of implementations choosing to
>> NFC-normalize Unicode encoded source files during translation phase 1
>> at least once during the discussions of P1949. The conclusion was
>> that it is ok to do so, but should be discouraged because there are
>> legitimate use cases for programmers writing non-NFC-normalized text
>> in string literals.
>>
>> Alisdair, if you proceed with a paper to restrict/specify translation
>> phase 1 behavior for UTF-8 or other Unicode encoded source files, I
>> think it would make sense to address whether Unicode normalization of
>> any form should be prohibited, permitted, or required.
>>
>> Tom.
>>
>>>
>>> At the same time I don't think we want to change the normalization
>>> of string literals when the physical source is Unicode encoded, but
>>> a normalization form has to be chosen when going from Unicode to non
>>> Unicode (usually NFC)
>>> So maybe we should specify that if the source encoding encodes the
>>> Unicode character set the mapping must be an identity function for
>>> each codepoint.
>>>
>>> On Thu, 28 May 2020 at 14:50, Corentin <corentin.jabot_at_[hidden]
>>> <mailto:corentin.jabot_at_[hidden]>> wrote:
>>>
>>> Hello,
>>>
>>> This GCC issue
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38433 argues that
>>> it is valid
>>> for an implementation to remove trailing whitespaces as part of
>>> the implementation defined mapping described in
>>> translation phase 1. [lex.phases]
>>>
>>> Is it the intent of that wording?
>>> Should it be specified that this implementation defined mapping
>>> should preserve the semantic of each abstract character present
>>> in the physical source file?
>>> If not, is it a valid implementation to perform arbitrary text
>>> transformation in phase 1 such as replacing "private" by
>>> "public" or replacing all "e" by a "z" ?
>>>
>>> Thanks,
>>>
>>> Corentin
>>>
>>>
>>> For reference here is the definition of abstract character in
>>> Unicode 13
>>> http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G2212
>>>
>>> Abstract character: A unit of information used for the
>>> organization, control, or representation of textual data.
>>> • When representing data, the nature of that data is generally
>>> symbolic as
>>> opposed to some other kind of data (for example, aural or
>>> visual). Examples of
>>> such symbolic data include letters, ideographs, digits,
>>> punctuation, technical
>>> symbols, and dingbats.
>>> • An abstract character has no concrete form and should not be
>>> confused with a
>>> glyph.
>>> • An abstract character does not necessarily correspond to what
>>> a user thinks of
>>> as a “character” and should not be confused with a grapheme.
>>> • The abstract characters encoded by the Unicode Standard are
>>> known as Unicode abstract characters.
>>> • Abstract characters not directly encoded by the Unicode
>>> Standard can often be
>>> represented by the use of combining character sequences.
>>>
>>>
>>> _______________________________________________
>>> Core mailing list
>>> Core_at_[hidden]
>>> Subscription:https://lists.isocpp.org/mailman/listinfo.cgi/core
>>> Link to this post:http://lists.isocpp.org/core/2020/05/9155.php
>>
>>
>> _______________________________________________
>> Core mailing list
>> Core_at_[hidden] <mailto:Core_at_[hidden]>
>> Subscription:https://lists.isocpp.org/mailman/listinfo.cgi/core
>> Link to this post:http://lists.isocpp.org/core/2020/05/9159.php
>


Received on 2020-05-28 11:22:59