On Thu, 28 May 2020 at 21:43, Richard Smith <richardsmith@google.com> wrote:
On Thu, 28 May 2020, 12:17 Corentin, <corentin.jabot@gmail.com> wrote:
On Thu, 28 May 2020 at 20:39, Richard Smith <richardsmith@google.com> wrote:
On Thu, 28 May 2020, 05:50 Corentin via Core, <core@lists.isocpp.org> wrote:
Hello, 

This GCC issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38433 argues that it is valid
for an implementation to remove trailing whitespaces as part of the implementation defined mapping described in translation phase 1. [lex.phases]

Is it the intent of that wording?
Should it be specified that this implementation defined mapping should preserve the semantic of each abstract character present in the physical source file?
If not, is it a valid implementation to perform arbitrary text transformation in phase 1 such as replacing "private" by "public" or replacing all "e" by a "z" ?

Yes, that is absolutely valid and intended today. We intentionally permit trigraph replacement here, as agreed by EWG. And implementations take advantage of this in other ways too; Clang (for example) replaces Unicode whitespace with spaces (outside of literals) in this phase.

... also, there is no guarantee that the source file is even originally text in any meaningful way before this implementation-defined mapping. A valid implementation could perform OCR on image files and go straight from PNG to a sequence of basic source characters.


The problem is that "the compiler can do absolutely anything in phase 1" prevents us from:

I am also concerned that this reduces portability (the same file can be read completely differently by different implementations and as Alidstair pointed out, this causes a real issue for trailing whitespaces) 

I think there are separate questions here:

* Should a conforming implementation be required to accept source code represented as text files encoded in UTF-8?
* Should a conforming implementation be permitted to accept other things, and if so, how arbitrary is that choice?

I'm inclined to think the answer to the first question should be yes. We should have some notion of a portable C++ source file, and without a known fixed encoding it's hard to argue that such a thing exists. For that encoding we should agree on the handling of trailing whitespace etc (though I think ignoring it outside of literals, as clang and GCC do, is the right thing -- source code that has the same appearance should have the same behaviour).

I think we should separate conversion to the internal encoding from other transformations such as removing trailing whitespace.
I would for example prefer that if something is done to trailing whitespaces, it be done in phase 2 as the same time as new line handling.

In practice the encoding of a file should not influence how trailing whitespaces are handled   

 

(I'm inclined to think the answer to the second question should be yes, too, with few or no restrictions. But perhaps treating such cases as a conforming extension is fine.)

I suppose the tricky part is getting rules for this that have any formal meaning. An implementation can do whatever it likes *before* phase 1 to identify the initial contents of a source file, so requiring UTF-8 has the same escape hatch we currently have, just without the documentation requirement. And I don't think we can require anything about physical files on disk, because that really does cut into existing implementation practice (eg, builds from VFS / editor buffers, interactive use in C++ interpreters, some forms of remote compilation servers).

This is why I was suggesting wording that in term of sequence of abstract characters ( capital letter A is still capital letter A in  piece of paper of white board) - of course that would not preclude a phase 0 that would do arbitrary things. 

It might help here to distinguish between what is C++ code, and what a conforming implementation must accept. We would presumably want valid code written on classroom whiteboards to be considered C++, even if all implementations are required to accept only octet sequences encoded in UTF-8 (which the whiteboard code would presumably not be!).

Thanks, 

Corentin


For reference here is the definition of abstract character in Unicode 13 

Abstract character: A unit of information used for the organization, control, or representation of textual data.
• When representing data, the nature of that data is generally symbolic as
opposed to some other kind of data (for example, aural or visual). Examples of
such symbolic data include letters, ideographs, digits, punctuation, technical
symbols, and dingbats.
• An abstract character has no concrete form and should not be confused with a
glyph.
• An abstract character does not necessarily correspond to what a user thinks of
as a “character” and should not be confused with a grapheme.
• The abstract characters encoded by the Unicode Standard are known as Unicode abstract characters.
• Abstract characters not directly encoded by the Unicode Standard can often be
represented by the use of combining character sequences.
_______________________________________________
Core mailing list
Core@lists.isocpp.org
Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
Link to this post: http://lists.isocpp.org/core/2020/05/9153.php