C++ Logo


Advanced search

Re: [SG16] [isocpp-core] To which extent characters can be replaced or removed in phase 1?

From: Corentin <corentin.jabot_at_[hidden]>
Date: Thu, 28 May 2020 18:30:40 +0200
On Thu, May 28, 2020, 18:24 Alisdair Meredith <alisdairm_at_[hidden]> wrote:

> Not quite - my preference is to say nothing at all about normalization
> in phase 1, and simply mandate a standard faithful UTF-8 to basic
> source character mapping, that subsequent phases may decide has
> illegal character sequences, or might choose to normalize, etc.

Doing normalization in phase 1 would qualify as "standard faithful utf8 to
basic source character mapping".
If we want normalization not to happen it would need to be explicitly

And that is even if phase 1 was modified to not permit any arbitrary

> I do not want to impinge on vendors’ freedom to implementation
> define as they like, merely require additional support for exactly
> one fully specified source-to-basic encoding. I feel that is a
> relatively low-cost sell, that should have no backwards compatibility
> concerns. Going further scares me for losing consensus for a
> proposal that vendors would not like. It is also quite possible
> that I am being too conservative ;)
> AlisdairM
> On May 28, 2020, at 17:19, Tom Honermann <tom_at_[hidden]> wrote:
> Ok, that sounds like your preference is to prohibit Unicode normalization
> during translation phase 1. I am not opposed to such a restriction. Note
> that such normalization could not occur afterwards without introducing a
> new translation phase. (In practice, implementations could always choose
> to provide such normalization under translation phase 1 by defining an
> implementation-defined "UTF-8-with-auto-normalization" encoding).
> Tom.
> On 5/28/20 12:13 PM, Alisdair Meredith wrote:
> My suggestion will faithfully reproduce UTF-8 encoded source
> using UCNs for anything not in the basic source character set.
> Normalization would come after that, and should not be pertinent
> at the level of my proposal, unless other work happening in
> parallel already demands it.
> AlisdairM
> On May 28, 2020, at 17:09, Tom Honermann via Core <core_at_[hidden]>
> wrote:
> On 5/28/20 9:48 AM, Corentin via Core wrote:
> I realized this has further implications when the physical source is
> Unicode encoded.
> Even restricting a mapping to a representation of the same
> abstract character, an implementation could during phase 1, apply arbitrary
> LETTER E WITH ACUTE are the same abstract character).
> This has interesting ramification for P1949 which make non nfc identifiers
> ill-formed.
> We discussed the possibilities of implementations choosing to
> NFC-normalize Unicode encoded source files during translation phase 1 at
> least once during the discussions of P1949. The conclusion was that it is
> ok to do so, but should be discouraged because there are legitimate use
> cases for programmers writing non-NFC-normalized text in string literals.
> Alisdair, if you proceed with a paper to restrict/specify translation
> phase 1 behavior for UTF-8 or other Unicode encoded source files, I think
> it would make sense to address whether Unicode normalization of any form
> should be prohibited, permitted, or required.
> Tom.
> At the same time I don't think we want to change the normalization of
> string literals when the physical source is Unicode encoded, but a
> normalization form has to be chosen when going from Unicode to non Unicode
> (usually NFC)
> So maybe we should specify that if the source encoding encodes the Unicode
> character set the mapping must be an identity function for each codepoint.
> On Thu, 28 May 2020 at 14:50, Corentin <corentin.jabot_at_[hidden]> wrote:
>> Hello,
>> This GCC issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38433 argues
>> that it is valid
>> for an implementation to remove trailing whitespaces as part of the
>> implementation defined mapping described in translation phase 1.
>> [lex.phases]
>> Is it the intent of that wording?
>> Should it be specified that this implementation defined mapping should
>> preserve the semantic of each abstract character present in the physical
>> source file?
>> If not, is it a valid implementation to perform arbitrary text
>> transformation in phase 1 such as replacing "private" by "public" or
>> replacing all "e" by a "z" ?
>> Thanks,
>> Corentin
>> For reference here is the definition of abstract character in Unicode 13
>> http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G2212
>> Abstract character: A unit of information used for the organization,
>> control, or representation of textual data.
>> • When representing data, the nature of that data is generally symbolic as
>> opposed to some other kind of data (for example, aural or visual).
>> Examples of
>> such symbolic data include letters, ideographs, digits, punctuation,
>> technical
>> symbols, and dingbats.
>> • An abstract character has no concrete form and should not be confused
>> with a
>> glyph.
>> • An abstract character does not necessarily correspond to what a user
>> thinks of
>> as a “character” and should not be confused with a grapheme.
>> • The abstract characters encoded by the Unicode Standard are known as
>> Unicode abstract characters.
>> • Abstract characters not directly encoded by the Unicode Standard can
>> often be
>> represented by the use of combining character sequences.
> _______________________________________________
> Core mailing listCore_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2020/05/9155.php
> _______________________________________________
> Core mailing list
> Core_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2020/05/9159.php

Received on 2020-05-28 11:33:58