sg16: Re: [SG16] [isocpp-core] To which extent characters can be replaced or removed in phase 1?

From: Corentin <corentin.jabot_at_[hidden]>
Date: Thu, 28 May 2020 22:42:36 +0200

On Thu, 28 May 2020 at 22:29, Tom Honermann <tom_at_[hidden]> wrote:

> On 5/28/20 3:43 PM, Richard Smith via Core wrote:
>
> On Thu, 28 May 2020, 12:17 Corentin, <corentin.jabot_at_[hidden]> wrote:
>
>> On Thu, 28 May 2020 at 20:39, Richard Smith <richardsmith_at_[hidden]>
>> wrote:
>>
>>> On Thu, 28 May 2020, 05:50 Corentin via Core, <core_at_[hidden]>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> This GCC issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38433 argues
>>>> that it is valid
>>>> for an implementation to remove trailing whitespaces as part of the
>>>> implementation defined mapping described in translation phase 1.
>>>> [lex.phases]
>>>>
>>>> Is it the intent of that wording?
>>>> Should it be specified that this implementation defined mapping should
>>>> preserve the semantic of each abstract character present in the physical
>>>> source file?
>>>> If not, is it a valid implementation to perform arbitrary text
>>>> transformation in phase 1 such as replacing "private" by "public" or
>>>> replacing all "e" by a "z" ?
>>>>
>>>
>>> Yes, that is absolutely valid and intended today. We intentionally
>>> permit trigraph replacement here, as agreed by EWG. And implementations
>>> take advantage of this in other ways too; Clang (for example) replaces
>>> Unicode whitespace with spaces (outside of literals) in this phase.
>>>
>>> ... also, there is no guarantee that the source file is even originally
>>> text in any meaningful way before this implementation-defined mapping. A
>>> valid implementation could perform OCR on image files and go straight from
>>> PNG to a sequence of basic source characters.
>>>
>>
>>
>> The problem is that "the compiler can do absolutely anything in phase 1"
>> prevents us from:
>>
>> - Mandating that a compiler should at least be able to read
>> utf8-encoded files (previous attempt
>> http://open-std.org/JTC1/SC22/WG21/docs/papers/2012/n3463.html )
>> - Mandating that files that use the Unicode character set are not
>> arbitrarily transformed (normalized for example)
>>
>>
>> I am also concerned that this reduces portability (the same file can be
>> read completely differently by different implementations and as Alidstair
>> pointed out, this causes a real issue for trailing whitespaces)
>>
>
> I think there are separate questions here:
>
> * Should a conforming implementation be required to accept source code
> represented as text files encoded in UTF-8?
> * Should a conforming implementation be permitted to accept other things,
> and if so, how arbitrary is that choice?
>
> I'm inclined to think the answer to the first question should be yes. We
> should have some notion of a portable C++ source file, and without a known
> fixed encoding it's hard to argue that such a thing exists. For that
> encoding we should agree on the handling of trailing whitespace etc (though
> I think ignoring it outside of literals, as clang and GCC do, is the right
> thing -- source code that has the same appearance should have the same
> behaviour).
>
> The statement in the parenthetical has a broad scope and rather profound
> consequences. Ignoring white space is one aspect of it, but taking it to
> an extreme would mean implementing support for Unicode canonical
> equivalence and compatibility equivalence (at least for some characters)
> from UAX #15 <https://unicode.org/reports/tr15/>, and treating
> confusables from UTS #39 <http://www.unicode.org/reports/tr39/> as the
> same character. Should we treat à and à as the same character? P1949
> purports to make the latter ill-formed in an identifier. What if the
> latter appears in a literal?
>

I recall we (sg16) decided to leave UTS #39
<http://www.unicode.org/reports/tr39/> as a QOI matter, and decided that
normalization should not happen. (à and à are the same (abstract)
character despite having different representations)

> Should we treat ; the same as ;? The compilation performance implications
> of doing so would be significant.
>

No - they are not the same character, they just happen to be represented by
similarly looking glyphs in some fonts

> Tom.
>
>
> (I'm inclined to think the answer to the second question should be yes,
> too, with few or no restrictions. But perhaps treating such cases as a
> conforming extension is fine.)
>
> I suppose the tricky part is getting rules for this that have any formal
> meaning. An implementation can do whatever it likes *before* phase 1 to
> identify the initial contents of a source file, so requiring UTF-8 has the
> same escape hatch we currently have, just without the documentation
> requirement. And I don't think we can require anything about physical files
> on disk, because that really does cut into existing implementation practice
> (eg, builds from VFS / editor buffers, interactive use in C++ interpreters,
> some forms of remote compilation servers).
>
> It might help here to distinguish between what is C++ code, and what a
> conforming implementation must accept. We would presumably want valid code
> written on classroom whiteboards to be considered C++, even if all
> implementations are required to accept only octet sequences encoded in
> UTF-8 (which the whiteboard code would presumably not be!).
>
> Thanks,
>>>>
>>>> Corentin
>>>>
>>>>
>>>> For reference here is the definition of abstract character in Unicode
>>>> 13
>>>> http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G2212
>>>>
>>>> Abstract character: A unit of information used for the organization,
>>>> control, or representation of textual data.
>>>> • When representing data, the nature of that data is generally symbolic
>>>> as
>>>> opposed to some other kind of data (for example, aural or visual).
>>>> Examples of
>>>> such symbolic data include letters, ideographs, digits, punctuation,
>>>> technical
>>>> symbols, and dingbats.
>>>> • An abstract character has no concrete form and should not be confused
>>>> with a
>>>> glyph.
>>>> • An abstract character does not necessarily correspond to what a user
>>>> thinks of
>>>> as a “character” and should not be confused with a grapheme.
>>>> • The abstract characters encoded by the Unicode Standard are known as
>>>> Unicode abstract characters.
>>>> • Abstract characters not directly encoded by the Unicode Standard can
>>>> often be
>>>> represented by the use of combining character sequences.
>>>> _______________________________________________
>>>> Core mailing list
>>>> Core_at_[hidden]
>>>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
>>>> Link to this post: http://lists.isocpp.org/core/2020/05/9153.php
>>>>
>>>
> _______________________________________________
> Core mailing listCore_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2020/05/9169.php
>
>
>

Received on 2020-05-28 15:45:54