sg16: Re: [SG16] [isocpp-core] To which extent characters can be replaced or removed in phase 1?

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 28 May 2020 16:29:00 -0400

On 5/28/20 3:43 PM, Richard Smith via Core wrote:
> On Thu, 28 May 2020, 12:17 Corentin, <corentin.jabot_at_[hidden]
> <mailto:corentin.jabot_at_[hidden]>> wrote:
>
> On Thu, 28 May 2020 at 20:39, Richard Smith
> <richardsmith_at_[hidden] <mailto:richardsmith_at_[hidden]>> wrote:
>
> On Thu, 28 May 2020, 05:50 Corentin via Core,
> <core_at_[hidden] <mailto:core_at_[hidden]>> wrote:
>
> Hello,
>
> This GCC issue
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38433 argues
> that it is valid
> for an implementation to remove trailing whitespaces as
> part of the implementation defined mapping described in
> translation phase 1. [lex.phases]
>
> Is it the intent of that wording?
> Should it be specified that this implementation defined
> mapping should preserve the semantic of each abstract
> character present in the physical source file?
> If not, is it a valid implementation to perform arbitrary
> text transformation in phase 1 such as replacing
> "private" by "public" or replacing all "e" by a "z" ?
>
>
> Yes, that is absolutely valid and intended today. We
> intentionally permit trigraph replacement here, as agreed by
> EWG. And implementations take advantage of this in other ways
> too; Clang (for example) replaces Unicode whitespace with
> spaces (outside of literals) in this phase.
>
> ... also, there is no guarantee that the source file is even
> originally text in any meaningful way before this
> implementation-defined mapping. A valid implementation could
> perform OCR on image files and go straight from PNG to a
> sequence of basic source characters.
>
>
>
> The problem is that "the compiler can do absolutely anything in
> phase 1" prevents us from:
>
> * Mandating that a compiler should at least be able to read
> utf8-encoded files (previous attempt
> http://open-std.org/JTC1/SC22/WG21/docs/papers/2012/n3463.html )
> * Mandating that files that use the Unicode character set are
> not arbitrarily transformed (normalized for example)
>
>
> I am also concerned that this reduces portability (the same file
> can be read completely differently by different implementations
> and as Alidstair pointed out, this causes a real issue for
> trailing whitespaces)
>
>
> I think there are separate questions here:
>
> * Should a conforming implementation be required to accept source code
> represented as text files encoded in UTF-8?
> * Should a conforming implementation be permitted to accept other
> things, and if so, how arbitrary is that choice?
>
> I'm inclined to think the answer to the first question should be yes.
> We should have some notion of a portable C++ source file, and without
> a known fixed encoding it's hard to argue that such a thing exists.
> For that encoding we should agree on the handling of trailing
> whitespace etc (though I think ignoring it outside of literals, as
> clang and GCC do, is the right thing -- source code that has the same
> appearance should have the same behaviour).

The statement in the parenthetical has a broad scope and rather profound
consequences. Ignoring white space is one aspect of it, but taking it
to an extreme would mean implementing support for Unicode canonical
equivalence and compatibility equivalence (at least for some
characters) from UAX #15 <https://unicode.org/reports/tr15/>, and
treating confusables from UTS #39 <http://www.unicode.org/reports/tr39/>
as the same character. Should we treat à and à as the same character?
P1949 purports to make the latter ill-formed in an identifier. What if
the latter appears in a literal? Should we treat ; the same as ;? The
compilation performance implications of doing so would be significant.

Tom.

>
> (I'm inclined to think the answer to the second question should be
> yes, too, with few or no restrictions. But perhaps treating such cases
> as a conforming extension is fine.)
>
> I suppose the tricky part is getting rules for this that have any
> formal meaning. An implementation can do whatever it likes *before*
> phase 1 to identify the initial contents of a source file, so
> requiring UTF-8 has the same escape hatch we currently have, just
> without the documentation requirement. And I don't think we can
> require anything about physical files on disk, because that really
> does cut into existing implementation practice (eg, builds from VFS /
> editor buffers, interactive use in C++ interpreters, some forms of
> remote compilation servers).
>
> It might help here to distinguish between what is C++ code, and what a
> conforming implementation must accept. We would presumably want valid
> code written on classroom whiteboards to be considered C++, even if
> all implementations are required to accept only octet sequences
> encoded in UTF-8 (which the whiteboard code would presumably not be!).
>
> Thanks,
>
> Corentin
>
>
> For reference here is the definition of abstract character
> in Unicode 13
> http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G2212
>
> Abstract character: A unit of information used for the
> organization, control, or representation of textual data.
> • When representing data, the nature of that data is
> generally symbolic as
> opposed to some other kind of data (for example, aural or
> visual). Examples of
> such symbolic data include letters, ideographs, digits,
> punctuation, technical
> symbols, and dingbats.
> • An abstract character has no concrete form and should
> not be confused with a
> glyph.
> • An abstract character does not necessarily correspond to
> what a user thinks of
> as a “character” and should not be confused with a grapheme.
> • The abstract characters encoded by the Unicode Standard
> are known as Unicode abstract characters.
> • Abstract characters not directly encoded by the Unicode
> Standard can often be
> represented by the use of combining character sequences.
> _______________________________________________
> Core mailing list
> Core_at_[hidden] <mailto:Core_at_[hidden]>
> Subscription:
> https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post:
> http://lists.isocpp.org/core/2020/05/9153.php
>
>
> _______________________________________________
> Core mailing list
> Core_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2020/05/9169.php

Received on 2020-05-28 15:32:08