I realized this has further implications when the physical source is Unicode encoded.

Even restricting a mapping to a representation of the same abstract character, an implementation could during phase 1, apply arbitrary Unicode normalization (LATIN SMALL LETTER E + ACUTE ACCENT and LATIN SMALL LETTER E WITH ACUTE are the same abstract character).

This has interesting ramification for P1949 which make non nfc identifiers ill-formed.

At the same time I don't think we want to change the normalization of string literals when the physical source is Unicode encoded, but a normalization form has to be chosen when going from Unicode to non Unicode (usually NFC)

So maybe we should specify that if the source encoding encodes the Unicode character set the mapping must be an identity function for each codepoint.

On Thu, 28 May 2020 at 14:50, Corentin <corentin.jabot@gmail.com> wrote:

Hello,

This GCC issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38433 argues that it is valid
for an implementation to remove trailing whitespaces as part of the implementation defined mapping described in translation phase 1. [lex.phases]

Is it the intent of that wording?
Should it be specified that this implementation defined mapping should preserve the semantic of each abstract character present in the physical source file?
If not, is it a valid implementation to perform arbitrary text transformation in phase 1 such as replacing "private" by "public" or replacing all "e" by a "z" ?

Thanks,

Corentin

For reference here is the definition of abstract character in Unicode 13
http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G2212

Abstract character: A unit of information used for the organization, control, or representation of textual data.
• When representing data, the nature of that data is generally symbolic as
opposed to some other kind of data (for example, aural or visual). Examples of
such symbolic data include letters, ideographs, digits, punctuation, technical
symbols, and dingbats.
• An abstract character has no concrete form and should not be confused with a
glyph.
• An abstract character does not necessarily correspond to what a user thinks of
as a “character” and should not be confused with a grapheme.
• The abstract characters encoded by the Unicode Standard are known as Unicode abstract characters.
• Abstract characters not directly encoded by the Unicode Standard can often be
represented by the use of combining character sequences.