I don't know what's the intend for source input, but when considering IO, the C standard says

> Data read in from a text stream will necessarily compare equal to the data that were earlier

> written out to that stream only if: the data consist only of printing characters and the control

> characters horizontal tab and new-line; no new-line character is immediately preceded by

> space characters; and the last character is a new-line character. Whether space characters

> that are written out immediately before a new-line character appear when read in is

> implementation-defined.

and I seem to remember implementations (VMS, IBM mainframes?) which are using fixed length

record padded with spaces as text format and thus are unable to make the difference between

desired and non desired spaces at end of line. If we want to gather for them, we have to allow

implementations to ignore spaces before end of line. Are they still important for us is an open

question. For me, VMS left my world 25 years ago and I've never been involved with mainframes.

Yours,

-- Jean-Marc

De : Core <core-bounces@lists.isocpp.org> de la part de Corentin via Core <core@lists.isocpp.org>
Envoyé : jeudi 28 mai 2020 14:50
À : C++ Core Language Working Group
Cc : Corentin; SG16; Alisdair Meredith
Objet : [SPAM] [isocpp-core] To which extent characters can be replaced or removed in phase 1?

Hello,

This GCC issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38433 argues that it is valid

for an implementation to remove trailing whitespaces as part of the implementation defined mapping described in translation phase 1. [lex.phases]

Is it the intent of that wording?

Should it be specified that this implementation defined mapping should preserve the semantic of each abstract character present in the physical source file?

If not, is it a valid implementation to perform arbitrary text transformation in phase 1 such as replacing "private" by "public" or replacing all "e" by a "z" ?

Thanks,

Corentin

For reference here is the definition of abstract character in Unicode 13

http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G2212

Abstract character: A unit of information used for the organization, control, or representation of textual data.
• When representing data, the nature of that data is generally symbolic as
opposed to some other kind of data (for example, aural or visual). Examples of
such symbolic data include letters, ideographs, digits, punctuation, technical
symbols, and dingbats.
• An abstract character has no concrete form and should not be confused with a
glyph.
• An abstract character does not necessarily correspond to what a user thinks of
as a “character” and should not be confused with a grapheme.
• The abstract characters encoded by the Unicode Standard are known as Unicode abstract characters.
• Abstract characters not directly encoded by the Unicode Standard can often be
represented by the use of combining character sequences.