sg16: Re: [SG16] [isocpp-core] To which extent characters can be replaced or removed in phase 1?

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 28 May 2020 17:33:42 -0400

On 5/28/20 4:42 PM, Corentin via SG16 wrote:
>
>
> On Thu, 28 May 2020 at 22:29, Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 5/28/20 3:43 PM, Richard Smith via Core wrote:
>> On Thu, 28 May 2020, 12:17 Corentin, <corentin.jabot_at_[hidden]
>> <mailto:corentin.jabot_at_[hidden]>> wrote:
>>
>> On Thu, 28 May 2020 at 20:39, Richard Smith
>> <richardsmith_at_[hidden] <mailto:richardsmith_at_[hidden]>> wrote:
>>
>> On Thu, 28 May 2020, 05:50 Corentin via Core,
>> <core_at_[hidden] <mailto:core_at_[hidden]>> wrote:
>>
>> Hello,
>>
>> This GCC issue
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38433 argues
>> that it is valid
>> for an implementation to remove trailing whitespaces
>> as part of the implementation defined mapping
>> described in translation phase 1. [lex.phases]
>>
>> Is it the intent of that wording?
>> Should it be specified that this implementation
>> defined mapping should preserve the semantic of each
>> abstract character present in the physical source file?
>> If not, is it a valid implementation to
>> perform arbitrary text transformation in phase 1 such
>> as replacing "private" by "public" or replacing all
>> "e" by a "z" ?
>>
>>
>> Yes, that is absolutely valid and intended today. We
>> intentionally permit trigraph replacement here, as agreed
>> by EWG. And implementations take advantage of this in
>> other ways too; Clang (for example) replaces Unicode
>> whitespace with spaces (outside of literals) in this phase.
>>
>> ... also, there is no guarantee that the source file is
>> even originally text in any meaningful way before this
>> implementation-defined mapping. A valid implementation
>> could perform OCR on image files and go straight from PNG
>> to a sequence of basic source characters.
>>
>>
>>
>> The problem is that "the compiler can do absolutely anything
>> in phase 1" prevents us from:
>>
>> * Mandating that a compiler should at least be able to read
>> utf8-encoded files (previous attempt
>> http://open-std.org/JTC1/SC22/WG21/docs/papers/2012/n3463.html )
>> * Mandating that files that use the Unicode character set
>> are not arbitrarily transformed (normalized for example)
>>
>>
>> I am also concerned that this reduces portability (the same
>> file can be read completely differently by different
>> implementations and as Alidstair pointed out, this causes a
>> real issue for trailing whitespaces)
>>
>>
>> I think there are separate questions here:
>>
>> * Should a conforming implementation be required to accept source
>> code represented as text files encoded in UTF-8?
>> * Should a conforming implementation be permitted to accept other
>> things, and if so, how arbitrary is that choice?
>>
>> I'm inclined to think the answer to the first question should be
>> yes. We should have some notion of a portable C++ source file,
>> and without a known fixed encoding it's hard to argue that such a
>> thing exists. For that encoding we should agree on the handling
>> of trailing whitespace etc (though I think ignoring it outside of
>> literals, as clang and GCC do, is the right thing -- source code
>> that has the same appearance should have the same behaviour).
>
> The statement in the parenthetical has a broad scope and rather
> profound consequences. Ignoring white space is one aspect of it,
> but taking it to an extreme would mean implementing support for
> Unicode canonical equivalence and compatibility equivalence (at
> least for some characters) from UAX #15
> <https://unicode.org/reports/tr15/>, and treating confusables from
> UTS #39 <http://www.unicode.org/reports/tr39/> as the same
> character. Should we treat à and à as the same character? P1949
> purports to make the latter ill-formed in an identifier. What if
> the latter appears in a literal?
>
>
> I recall we (sg16) decided to leave UTS #39
> <http://www.unicode.org/reports/tr39/> as a QOI matter, and decided
> that normalization should not happen. (à and à are the same
> (abstract) character despite having different representations)
>
> Should we treat ; the same as ;? The compilation performance
> implications of doing so would be significant.
>
>
> No - they are not the same character, they just happen to be
> represented by similarly looking glyphs in some fonts

I intended these as rhetorical questions; the point being that the line
separating invisible things that should or should not have semantic
significance is fuzzy. What razor should separate them?

Tom.

>
> Tom.
>
>>
>> (I'm inclined to think the answer to the second question should
>> be yes, too, with few or no restrictions. But perhaps treating
>> such cases as a conforming extension is fine.)
>>
>> I suppose the tricky part is getting rules for this that have any
>> formal meaning. An implementation can do whatever it likes
>> *before* phase 1 to identify the initial contents of a source
>> file, so requiring UTF-8 has the same escape hatch we currently
>> have, just without the documentation requirement. And I don't
>> think we can require anything about physical files on disk,
>> because that really does cut into existing implementation
>> practice (eg, builds from VFS / editor buffers, interactive use
>> in C++ interpreters, some forms of remote compilation servers).
>>
>> It might help here to distinguish between what is C++ code, and
>> what a conforming implementation must accept. We would presumably
>> want valid code written on classroom whiteboards to be considered
>> C++, even if all implementations are required to accept only
>> octet sequences encoded in UTF-8 (which the whiteboard code would
>> presumably not be!).
>>
>> Thanks,
>>
>> Corentin
>>
>>
>> For reference here is the definition of abstract
>> character in Unicode 13
>> http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G2212
>>
>> Abstract character: A unit of information used for
>> the organization, control, or representation of
>> textual data.
>> • When representing data, the nature of that data is
>> generally symbolic as
>> opposed to some other kind of data (for example,
>> aural or visual). Examples of
>> such symbolic data include letters, ideographs,
>> digits, punctuation, technical
>> symbols, and dingbats.
>> • An abstract character has no concrete form and
>> should not be confused with a
>> glyph.
>> • An abstract character does not necessarily
>> correspond to what a user thinks of
>> as a “character” and should not be confused with a
>> grapheme.
>> • The abstract characters encoded by the Unicode
>> Standard are known as Unicode abstract characters.
>> • Abstract characters not directly encoded by the
>> Unicode Standard can often be
>> represented by the use of combining character sequences.
>> _______________________________________________
>> Core mailing list
>> Core_at_[hidden] <mailto:Core_at_[hidden]>
>> Subscription:
>> https://lists.isocpp.org/mailman/listinfo.cgi/core
>> Link to this post:
>> http://lists.isocpp.org/core/2020/05/9153.php
>>
>>
>> _______________________________________________
>> Core mailing list
>> Core_at_[hidden] <mailto:Core_at_[hidden]>
>> Subscription:https://lists.isocpp.org/mailman/listinfo.cgi/core
>> Link to this post:http://lists.isocpp.org/core/2020/05/9169.php
>
>
>

Received on 2020-05-28 16:36:51