On Thu, May 28, 2020 at 1:29 PM Tom Honermann <tom@honermann.net> wrote:

On 5/28/20 3:43 PM, Richard Smith via Core wrote:

On Thu, 28 May 2020, 12:17 Corentin, <corentin.jabot@gmail.com> wrote:

On Thu, 28 May 2020 at 20:39, Richard Smith <richardsmith@google.com> wrote:

On Thu, 28 May 2020, 05:50 Corentin via Core, <core@lists.isocpp.org> wrote:

Hello,

This GCC issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38433 argues that it is valid
for an implementation to remove trailing whitespaces as part of the implementation defined mapping described in translation phase 1. [lex.phases]

Is it the intent of that wording?

Should it be specified that this implementation defined mapping should preserve the semantic of each abstract character present in the physical source file?

If not, is it a valid implementation to perform arbitrary text transformation in phase 1 such as replacing "private" by "public" or replacing all "e" by a "z" ?

Yes, that is absolutely valid and intended today. We intentionally permit trigraph replacement here, as agreed by EWG. And implementations take advantage of this in other ways too; Clang (for example) replaces Unicode whitespace with spaces (outside of literals) in this phase.

... also, there is no guarantee that the source file is even originally text in any meaningful way before this implementation-defined mapping. A valid implementation could perform OCR on image files and go straight from PNG to a sequence of basic source characters.

The problem is that "the compiler can do absolutely anything in phase 1" prevents us from:

Mandating that a compiler should at least be able to read utf8-encoded files (previous attempt http://open-std.org/JTC1/SC22/WG21/docs/papers/2012/n3463.html )

Mandating that files that use the Unicode character set are not arbitrarily transformed (normalized for example)

I am also concerned that this reduces portability (the same file can be read completely differently by different implementations and as Alidstair pointed out, this causes a real issue for trailing whitespaces)

I think there are separate questions here:

* Should a conforming implementation be required to accept source code represented as text files encoded in UTF-8?

* Should a conforming implementation be permitted to accept other things, and if so, how arbitrary is that choice?

I'm inclined to think the answer to the first question should be yes. We should have some notion of a portable C++ source file, and without a known fixed encoding it's hard to argue that such a thing exists. For that encoding we should agree on the handling of trailing whitespace etc (though I think ignoring it outside of literals, as clang and GCC do, is the right thing -- source code that has the same appearance should have the same behaviour).

The statement in the parenthetical has a broad scope and rather profound consequences. Ignoring white space is one aspect of it, but taking it to an extreme would mean implementing support for Unicode canonical equivalence and compatibility equivalence (at least for some characters) from UAX #15, and treating confusables from UTS #39 as the same character. Should we treat à and à as the same character? P1949 purports to make the latter ill-formed in an identifier. What if the latter appears in a literal? Should we treat ; the same as ;? The compilation performance implications of doing so would be significant.

I think we should treat the parenthetical as a goal (and one that we expect to only approximate), not as a hard requirement. And yes, I think we should either treat combined and non-combined forms as equivalent or reject one of them (take this as a refinement to my parenthetical -- I think it's fine for us to reject code that is visually identical to valid source code, but it's much less reasonable for it to be valid with different behavior).

For text inside literals, the pragmatic answer that we retain the original source form should presumably win out -- "preserve string literal contents" is a more important goal than "visually indistinguishable source files behave the same". We should probably either reject greek question marks or treat them as semicolons, depending on whether we perform canonical decomposition or not, but they should ideally not be valid and mean something other than a semicolon (outside of literals, per the above).

Given that many editors routinely remove trailing whitespace on save, and that it is usually invisible, allowing it to have a semantic effect seems questionable.

Tom.
(I'm inclined to think the answer to the second question should be yes, too, with few or no restrictions. But perhaps treating such cases as a conforming extension is fine.)

I suppose the tricky part is getting rules for this that have any formal meaning. An implementation can do whatever it likes *before* phase 1 to identify the initial contents of a source file, so requiring UTF-8 has the same escape hatch we currently have, just without the documentation requirement. And I don't think we can require anything about physical files on disk, because that really does cut into existing implementation practice (eg, builds from VFS / editor buffers, interactive use in C++ interpreters, some forms of remote compilation servers).

It might help here to distinguish between what is C++ code, and what a conforming implementation must accept. We would presumably want valid code written on classroom whiteboards to be considered C++, even if all implementations are required to accept only octet sequences encoded in UTF-8 (which the whiteboard code would presumably not be!).

Thanks,

Corentin

For reference here is the definition of abstract character in Unicode 13

http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G2212

Abstract character: A unit of information used for the organization, control, or representation of textual data.
• When representing data, the nature of that data is generally symbolic as
opposed to some other kind of data (for example, aural or visual). Examples of
such symbolic data include letters, ideographs, digits, punctuation, technical
symbols, and dingbats.
• An abstract character has no concrete form and should not be confused with a
glyph.
• An abstract character does not necessarily correspond to what a user thinks of
as a “character” and should not be confused with a grapheme.
• The abstract characters encoded by the Unicode Standard are known as Unicode abstract characters.
• Abstract characters not directly encoded by the Unicode Standard can often be
represented by the use of combining character sequences.

_______________________________________________
Core mailing list
Core@lists.isocpp.org
Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
Link to this post: http://lists.isocpp.org/core/2020/05/9153.php
_______________________________________________
Core mailing list
Core@lists.isocpp.org
Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
Link to this post: http://lists.isocpp.org/core/2020/05/9169.php