Date: Thu, 28 May 2020 21:58:59 +0200
On Thu, 28 May 2020 at 21:43, Richard Smith <richardsmith_at_[hidden]> wrote:
> On Thu, 28 May 2020, 12:17 Corentin, <corentin.jabot_at_[hidden]> wrote:
>
>> On Thu, 28 May 2020 at 20:39, Richard Smith <richardsmith_at_[hidden]>
>> wrote:
>>
>>> On Thu, 28 May 2020, 05:50 Corentin via Core, <core_at_[hidden]>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> This GCC issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38433 argues
>>>> that it is valid
>>>> for an implementation to remove trailing whitespaces as part of the
>>>> implementation defined mapping described in translation phase 1.
>>>> [lex.phases]
>>>>
>>>> Is it the intent of that wording?
>>>> Should it be specified that this implementation defined mapping should
>>>> preserve the semantic of each abstract character present in the physical
>>>> source file?
>>>> If not, is it a valid implementation to perform arbitrary text
>>>> transformation in phase 1 such as replacing "private" by "public" or
>>>> replacing all "e" by a "z" ?
>>>>
>>>
>>> Yes, that is absolutely valid and intended today. We intentionally
>>> permit trigraph replacement here, as agreed by EWG. And implementations
>>> take advantage of this in other ways too; Clang (for example) replaces
>>> Unicode whitespace with spaces (outside of literals) in this phase.
>>>
>>> ... also, there is no guarantee that the source file is even originally
>>> text in any meaningful way before this implementation-defined mapping. A
>>> valid implementation could perform OCR on image files and go straight from
>>> PNG to a sequence of basic source characters.
>>>
>>
>>
>> The problem is that "the compiler can do absolutely anything in phase 1"
>> prevents us from:
>>
>> - Mandating that a compiler should at least be able to read
>> utf8-encoded files (previous attempt
>> http://open-std.org/JTC1/SC22/WG21/docs/papers/2012/n3463.html )
>> - Mandating that files that use the Unicode character set are not
>> arbitrarily transformed (normalized for example)
>>
>>
>> I am also concerned that this reduces portability (the same file can be
>> read completely differently by different implementations and as Alidstair
>> pointed out, this causes a real issue for trailing whitespaces)
>>
>
> I think there are separate questions here:
>
> * Should a conforming implementation be required to accept source code
> represented as text files encoded in UTF-8?
> * Should a conforming implementation be permitted to accept other things,
> and if so, how arbitrary is that choice?
>
> I'm inclined to think the answer to the first question should be yes. We
> should have some notion of a portable C++ source file, and without a known
> fixed encoding it's hard to argue that such a thing exists. For that
> encoding we should agree on the handling of trailing whitespace etc (though
> I think ignoring it outside of literals, as clang and GCC do, is the right
> thing -- source code that has the same appearance should have the same
> behaviour).
>
I think we should separate conversion to the internal encoding from other
transformations such as removing trailing whitespace.
I would for example prefer that if something is done to trailing
whitespaces, it be done in phase 2 as the same time as new line handling.
In practice the encoding of a file should not influence how trailing
whitespaces are handled
>
> (I'm inclined to think the answer to the second question should be yes,
> too, with few or no restrictions. But perhaps treating such cases as a
> conforming extension is fine.)
>
> I suppose the tricky part is getting rules for this that have any formal
> meaning. An implementation can do whatever it likes *before* phase 1 to
> identify the initial contents of a source file, so requiring UTF-8 has the
> same escape hatch we currently have, just without the documentation
> requirement. And I don't think we can require anything about physical files
> on disk, because that really does cut into existing implementation practice
> (eg, builds from VFS / editor buffers, interactive use in C++ interpreters,
> some forms of remote compilation servers).
>
This is why I was suggesting wording that in term of sequence of abstract
characters ( capital letter A is still capital letter A in piece of paper
of white board) - of course that would not preclude a phase 0 that would do
arbitrary things.
It might help here to distinguish between what is C++ code, and what a
> conforming implementation must accept. We would presumably want valid code
> written on classroom whiteboards to be considered C++, even if all
> implementations are required to accept only octet sequences encoded in
> UTF-8 (which the whiteboard code would presumably not be!).
>
> Thanks,
>>>>
>>>> Corentin
>>>>
>>>>
>>>> For reference here is the definition of abstract character in Unicode
>>>> 13
>>>> http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G2212
>>>>
>>>> Abstract character: A unit of information used for the organization,
>>>> control, or representation of textual data.
>>>> • When representing data, the nature of that data is generally symbolic
>>>> as
>>>> opposed to some other kind of data (for example, aural or visual).
>>>> Examples of
>>>> such symbolic data include letters, ideographs, digits, punctuation,
>>>> technical
>>>> symbols, and dingbats.
>>>> • An abstract character has no concrete form and should not be confused
>>>> with a
>>>> glyph.
>>>> • An abstract character does not necessarily correspond to what a user
>>>> thinks of
>>>> as a “character” and should not be confused with a grapheme.
>>>> • The abstract characters encoded by the Unicode Standard are known as
>>>> Unicode abstract characters.
>>>> • Abstract characters not directly encoded by the Unicode Standard can
>>>> often be
>>>> represented by the use of combining character sequences.
>>>> _______________________________________________
>>>> Core mailing list
>>>> Core_at_[hidden]
>>>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
>>>> Link to this post: http://lists.isocpp.org/core/2020/05/9153.php
>>>>
>>>
> On Thu, 28 May 2020, 12:17 Corentin, <corentin.jabot_at_[hidden]> wrote:
>
>> On Thu, 28 May 2020 at 20:39, Richard Smith <richardsmith_at_[hidden]>
>> wrote:
>>
>>> On Thu, 28 May 2020, 05:50 Corentin via Core, <core_at_[hidden]>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> This GCC issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38433 argues
>>>> that it is valid
>>>> for an implementation to remove trailing whitespaces as part of the
>>>> implementation defined mapping described in translation phase 1.
>>>> [lex.phases]
>>>>
>>>> Is it the intent of that wording?
>>>> Should it be specified that this implementation defined mapping should
>>>> preserve the semantic of each abstract character present in the physical
>>>> source file?
>>>> If not, is it a valid implementation to perform arbitrary text
>>>> transformation in phase 1 such as replacing "private" by "public" or
>>>> replacing all "e" by a "z" ?
>>>>
>>>
>>> Yes, that is absolutely valid and intended today. We intentionally
>>> permit trigraph replacement here, as agreed by EWG. And implementations
>>> take advantage of this in other ways too; Clang (for example) replaces
>>> Unicode whitespace with spaces (outside of literals) in this phase.
>>>
>>> ... also, there is no guarantee that the source file is even originally
>>> text in any meaningful way before this implementation-defined mapping. A
>>> valid implementation could perform OCR on image files and go straight from
>>> PNG to a sequence of basic source characters.
>>>
>>
>>
>> The problem is that "the compiler can do absolutely anything in phase 1"
>> prevents us from:
>>
>> - Mandating that a compiler should at least be able to read
>> utf8-encoded files (previous attempt
>> http://open-std.org/JTC1/SC22/WG21/docs/papers/2012/n3463.html )
>> - Mandating that files that use the Unicode character set are not
>> arbitrarily transformed (normalized for example)
>>
>>
>> I am also concerned that this reduces portability (the same file can be
>> read completely differently by different implementations and as Alidstair
>> pointed out, this causes a real issue for trailing whitespaces)
>>
>
> I think there are separate questions here:
>
> * Should a conforming implementation be required to accept source code
> represented as text files encoded in UTF-8?
> * Should a conforming implementation be permitted to accept other things,
> and if so, how arbitrary is that choice?
>
> I'm inclined to think the answer to the first question should be yes. We
> should have some notion of a portable C++ source file, and without a known
> fixed encoding it's hard to argue that such a thing exists. For that
> encoding we should agree on the handling of trailing whitespace etc (though
> I think ignoring it outside of literals, as clang and GCC do, is the right
> thing -- source code that has the same appearance should have the same
> behaviour).
>
I think we should separate conversion to the internal encoding from other
transformations such as removing trailing whitespace.
I would for example prefer that if something is done to trailing
whitespaces, it be done in phase 2 as the same time as new line handling.
In practice the encoding of a file should not influence how trailing
whitespaces are handled
>
> (I'm inclined to think the answer to the second question should be yes,
> too, with few or no restrictions. But perhaps treating such cases as a
> conforming extension is fine.)
>
> I suppose the tricky part is getting rules for this that have any formal
> meaning. An implementation can do whatever it likes *before* phase 1 to
> identify the initial contents of a source file, so requiring UTF-8 has the
> same escape hatch we currently have, just without the documentation
> requirement. And I don't think we can require anything about physical files
> on disk, because that really does cut into existing implementation practice
> (eg, builds from VFS / editor buffers, interactive use in C++ interpreters,
> some forms of remote compilation servers).
>
This is why I was suggesting wording that in term of sequence of abstract
characters ( capital letter A is still capital letter A in piece of paper
of white board) - of course that would not preclude a phase 0 that would do
arbitrary things.
It might help here to distinguish between what is C++ code, and what a
> conforming implementation must accept. We would presumably want valid code
> written on classroom whiteboards to be considered C++, even if all
> implementations are required to accept only octet sequences encoded in
> UTF-8 (which the whiteboard code would presumably not be!).
>
> Thanks,
>>>>
>>>> Corentin
>>>>
>>>>
>>>> For reference here is the definition of abstract character in Unicode
>>>> 13
>>>> http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G2212
>>>>
>>>> Abstract character: A unit of information used for the organization,
>>>> control, or representation of textual data.
>>>> • When representing data, the nature of that data is generally symbolic
>>>> as
>>>> opposed to some other kind of data (for example, aural or visual).
>>>> Examples of
>>>> such symbolic data include letters, ideographs, digits, punctuation,
>>>> technical
>>>> symbols, and dingbats.
>>>> • An abstract character has no concrete form and should not be confused
>>>> with a
>>>> glyph.
>>>> • An abstract character does not necessarily correspond to what a user
>>>> thinks of
>>>> as a “character” and should not be confused with a grapheme.
>>>> • The abstract characters encoded by the Unicode Standard are known as
>>>> Unicode abstract characters.
>>>> • Abstract characters not directly encoded by the Unicode Standard can
>>>> often be
>>>> represented by the use of combining character sequences.
>>>> _______________________________________________
>>>> Core mailing list
>>>> Core_at_[hidden]
>>>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
>>>> Link to this post: http://lists.isocpp.org/core/2020/05/9153.php
>>>>
>>>
Received on 2020-05-28 15:02:17