sg16: Re: [SG16] [isocpp-core] To which extent characters can be replaced or removed in phase 1?

From: Richard Smith <richardsmith_at_[hidden]>
Date: Thu, 28 May 2020 12:43:26 -0700

On Thu, 28 May 2020, 12:17 Corentin, <corentin.jabot_at_[hidden]> wrote:

> On Thu, 28 May 2020 at 20:39, Richard Smith <richardsmith_at_[hidden]>
> wrote:
>
>> On Thu, 28 May 2020, 05:50 Corentin via Core, <core_at_[hidden]>
>> wrote:
>>
>>> Hello,
>>>
>>> This GCC issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38433 argues
>>> that it is valid
>>> for an implementation to remove trailing whitespaces as part of the
>>> implementation defined mapping described in translation phase 1.
>>> [lex.phases]
>>>
>>> Is it the intent of that wording?
>>> Should it be specified that this implementation defined mapping should
>>> preserve the semantic of each abstract character present in the physical
>>> source file?
>>> If not, is it a valid implementation to perform arbitrary text
>>> transformation in phase 1 such as replacing "private" by "public" or
>>> replacing all "e" by a "z" ?
>>>
>>
>> Yes, that is absolutely valid and intended today. We intentionally permit
>> trigraph replacement here, as agreed by EWG. And implementations take
>> advantage of this in other ways too; Clang (for example) replaces Unicode
>> whitespace with spaces (outside of literals) in this phase.
>>
>> ... also, there is no guarantee that the source file is even originally
>> text in any meaningful way before this implementation-defined mapping. A
>> valid implementation could perform OCR on image files and go straight from
>> PNG to a sequence of basic source characters.
>>
>
>
> The problem is that "the compiler can do absolutely anything in phase 1"
> prevents us from:
>
> - Mandating that a compiler should at least be able to read
> utf8-encoded files (previous attempt
> http://open-std.org/JTC1/SC22/WG21/docs/papers/2012/n3463.html )
> - Mandating that files that use the Unicode character set are not
> arbitrarily transformed (normalized for example)
>
>
> I am also concerned that this reduces portability (the same file can be
> read completely differently by different implementations and as Alidstair
> pointed out, this causes a real issue for trailing whitespaces)
>

I think there are separate questions here:

* Should a conforming implementation be required to accept source code
represented as text files encoded in UTF-8?
* Should a conforming implementation be permitted to accept other things,
and if so, how arbitrary is that choice?

I'm inclined to think the answer to the first question should be yes. We
should have some notion of a portable C++ source file, and without a known
fixed encoding it's hard to argue that such a thing exists. For that
encoding we should agree on the handling of trailing whitespace etc (though
I think ignoring it outside of literals, as clang and GCC do, is the right
thing -- source code that has the same appearance should have the same
behaviour).

(I'm inclined to think the answer to the second question should be yes,
too, with few or no restrictions. But perhaps treating such cases as a
conforming extension is fine.)

I suppose the tricky part is getting rules for this that have any formal
meaning. An implementation can do whatever it likes *before* phase 1 to
identify the initial contents of a source file, so requiring UTF-8 has the
same escape hatch we currently have, just without the documentation
requirement. And I don't think we can require anything about physical files
on disk, because that really does cut into existing implementation practice
(eg, builds from VFS / editor buffers, interactive use in C++ interpreters,
some forms of remote compilation servers).

It might help here to distinguish between what is C++ code, and what a
conforming implementation must accept. We would presumably want valid code
written on classroom whiteboards to be considered C++, even if all
implementations are required to accept only octet sequences encoded in
UTF-8 (which the whiteboard code would presumably not be!).

Thanks,
>>>
>>> Corentin
>>>
>>>
>>> For reference here is the definition of abstract character in Unicode 13
>>> http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G2212
>>>
>>> Abstract character: A unit of information used for the organization,
>>> control, or representation of textual data.
>>> • When representing data, the nature of that data is generally symbolic
>>> as
>>> opposed to some other kind of data (for example, aural or visual).
>>> Examples of
>>> such symbolic data include letters, ideographs, digits, punctuation,
>>> technical
>>> symbols, and dingbats.
>>> • An abstract character has no concrete form and should not be confused
>>> with a
>>> glyph.
>>> • An abstract character does not necessarily correspond to what a user
>>> thinks of
>>> as a “character” and should not be confused with a grapheme.
>>> • The abstract characters encoded by the Unicode Standard are known as
>>> Unicode abstract characters.
>>> • Abstract characters not directly encoded by the Unicode Standard can
>>> often be
>>> represented by the use of combining character sequences.
>>> _______________________________________________
>>> Core mailing list
>>> Core_at_[hidden]
>>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
>>> Link to this post: http://lists.isocpp.org/core/2020/05/9153.php
>>>
>>

Received on 2020-05-28 14:46:45