C++ Logo


Advanced search

Re: [SG16] [isocpp-core] To which extent characters can be replaced or removed in phase 1?

From: Alisdair Meredith <alisdairm_at_[hidden]>
Date: Thu, 28 May 2020 17:24:07 +0100
Not quite - my preference is to say nothing at all about normalization
in phase 1, and simply mandate a standard faithful UTF-8 to basic
source character mapping, that subsequent phases may decide has
illegal character sequences, or might choose to normalize, etc.

I do not want to impinge on vendors’ freedom to implementation
define as they like, merely require additional support for exactly
one fully specified source-to-basic encoding. I feel that is a
relatively low-cost sell, that should have no backwards compatibility
concerns. Going further scares me for losing consensus for a
proposal that vendors would not like. It is also quite possible
that I am being too conservative ;)


> On May 28, 2020, at 17:19, Tom Honermann <tom_at_[hidden]> wrote:
> Ok, that sounds like your preference is to prohibit Unicode normalization during translation phase 1. I am not opposed to such a restriction. Note that such normalization could not occur afterwards without introducing a new translation phase. (In practice, implementations could always choose to provide such normalization under translation phase 1 by defining an implementation-defined "UTF-8-with-auto-normalization" encoding).
> Tom.
> On 5/28/20 12:13 PM, Alisdair Meredith wrote:
>> My suggestion will faithfully reproduce UTF-8 encoded source
>> using UCNs for anything not in the basic source character set.
>> Normalization would come after that, and should not be pertinent
>> at the level of my proposal, unless other work happening in
>> parallel already demands it.
>> AlisdairM
>>> On May 28, 2020, at 17:09, Tom Honermann via Core <core_at_[hidden] <mailto:core_at_[hidden]>> wrote:
>>> On 5/28/20 9:48 AM, Corentin via Core wrote:
>>>> I realized this has further implications when the physical source is Unicode encoded.
>>>> Even restricting a mapping to a representation of the same abstract character, an implementation could during phase 1, apply arbitrary Unicode normalization (LATIN SMALL LETTER E + ACUTE ACCENT and LATIN SMALL LETTER E WITH ACUTE are the same abstract character).
>>>> This has interesting ramification for P1949 which make non nfc identifiers ill-formed.
>>> We discussed the possibilities of implementations choosing to NFC-normalize Unicode encoded source files during translation phase 1 at least once during the discussions of P1949. The conclusion was that it is ok to do so, but should be discouraged because there are legitimate use cases for programmers writing non-NFC-normalized text in string literals.
>>> Alisdair, if you proceed with a paper to restrict/specify translation phase 1 behavior for UTF-8 or other Unicode encoded source files, I think it would make sense to address whether Unicode normalization of any form should be prohibited, permitted, or required.
>>> Tom.
>>>> At the same time I don't think we want to change the normalization of string literals when the physical source is Unicode encoded, but a normalization form has to be chosen when going from Unicode to non Unicode (usually NFC)
>>>> So maybe we should specify that if the source encoding encodes the Unicode character set the mapping must be an identity function for each codepoint.
>>>> On Thu, 28 May 2020 at 14:50, Corentin <corentin.jabot_at_[hidden] <mailto:corentin.jabot_at_[hidden]>> wrote:
>>>> Hello,
>>>> This GCC issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38433 <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38433> argues that it is valid
>>>> for an implementation to remove trailing whitespaces as part of the implementation defined mapping described in translation phase 1. [lex.phases]
>>>> Is it the intent of that wording?
>>>> Should it be specified that this implementation defined mapping should preserve the semantic of each abstract character present in the physical source file?
>>>> If not, is it a valid implementation to perform arbitrary text transformation in phase 1 such as replacing "private" by "public" or replacing all "e" by a "z" ?
>>>> Thanks,
>>>> Corentin
>>>> For reference here is the definition of abstract character in Unicode 13
>>>> http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G2212 <http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G2212>
>>>> Abstract character: A unit of information used for the organization, control, or representation of textual data.
>>>> • When representing data, the nature of that data is generally symbolic as
>>>> opposed to some other kind of data (for example, aural or visual). Examples of
>>>> such symbolic data include letters, ideographs, digits, punctuation, technical
>>>> symbols, and dingbats.
>>>> • An abstract character has no concrete form and should not be confused with a
>>>> glyph.
>>>> • An abstract character does not necessarily correspond to what a user thinks of
>>>> as a “character” and should not be confused with a grapheme.
>>>> • The abstract characters encoded by the Unicode Standard are known as Unicode abstract characters.
>>>> • Abstract characters not directly encoded by the Unicode Standard can often be
>>>> represented by the use of combining character sequences.
>>>> _______________________________________________
>>>> Core mailing list
>>>> Core_at_[hidden] <mailto:Core_at_[hidden]>
>>>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core <https://lists.isocpp.org/mailman/listinfo.cgi/core>
>>>> Link to this post: http://lists.isocpp.org/core/2020/05/9155.php <http://lists.isocpp.org/core/2020/05/9155.php>
>>> _______________________________________________
>>> Core mailing list
>>> Core_at_[hidden] <mailto:Core_at_[hidden]>
>>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core <https://lists.isocpp.org/mailman/listinfo.cgi/core>
>>> Link to this post: http://lists.isocpp.org/core/2020/05/9159.php <http://lists.isocpp.org/core/2020/05/9159.php>

Received on 2020-05-28 11:27:15