C++ Logo


Advanced search

Re: [SG16] [isocpp-core] To which extent characters can be replaced or removed in phase 1?

From: Tom Honermann <tom_at_[hidden]>
Date: Sat, 30 May 2020 01:11:11 -0400
On 5/28/20 6:13 PM, Richard Smith via SG16 wrote:
> On Thu, May 28, 2020 at 1:29 PM Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
> On 5/28/20 3:43 PM, Richard Smith via Core wrote:
>> On Thu, 28 May 2020, 12:17 Corentin, <corentin.jabot_at_[hidden]
>> <mailto:corentin.jabot_at_[hidden]>> wrote:
>> On Thu, 28 May 2020 at 20:39, Richard Smith
>> <richardsmith_at_[hidden] <mailto:richardsmith_at_[hidden]>> wrote:
>> On Thu, 28 May 2020, 05:50 Corentin via Core,
>> <core_at_[hidden] <mailto:core_at_[hidden]>> wrote:
>> Hello,
>> This GCC issue
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38433 argues
>> that it is valid
>> for an implementation to remove trailing whitespaces
>> as part of the implementation defined mapping
>> described in translation phase 1. [lex.phases]
>> Is it the intent of that wording?
>> Should it be specified that this implementation
>> defined mapping should preserve the semantic of each
>> abstract character present in the physical source file?
>> If not, is it a valid implementation to
>> perform arbitrary text transformation in phase 1 such
>> as replacing "private" by "public" or replacing all
>> "e" by a "z" ?
>> Yes, that is absolutely valid and intended today. We
>> intentionally permit trigraph replacement here, as agreed
>> by EWG. And implementations take advantage of this in
>> other ways too; Clang (for example) replaces Unicode
>> whitespace with spaces (outside of literals) in this phase.
>> ... also, there is no guarantee that the source file is
>> even originally text in any meaningful way before this
>> implementation-defined mapping. A valid implementation
>> could perform OCR on image files and go straight from PNG
>> to a sequence of basic source characters.
>> The problem is that "the compiler can do absolutely anything
>> in phase 1" prevents us from:
>> * Mandating that a compiler should at least be able to read
>> utf8-encoded files (previous attempt
>> http://open-std.org/JTC1/SC22/WG21/docs/papers/2012/n3463.html )
>> * Mandating that files that use the Unicode character set
>> are not arbitrarily transformed (normalized for example)
>> I am also concerned that this reduces portability (the same
>> file can be read completely differently by different
>> implementations and as Alidstair pointed out, this causes a
>> real issue for trailing whitespaces)
>> I think there are separate questions here:
>> * Should a conforming implementation be required to accept source
>> code represented as text files encoded in UTF-8?
>> * Should a conforming implementation be permitted to accept other
>> things, and if so, how arbitrary is that choice?
>> I'm inclined to think the answer to the first question should be
>> yes. We should have some notion of a portable C++ source file,
>> and without a known fixed encoding it's hard to argue that such a
>> thing exists. For that encoding we should agree on the handling
>> of trailing whitespace etc (though I think ignoring it outside of
>> literals, as clang and GCC do, is the right thing -- source code
>> that has the same appearance should have the same behaviour).
> The statement in the parenthetical has a broad scope and rather
> profound consequences. Ignoring white space is one aspect of it,
> but taking it to an extreme would mean implementing support for
> Unicode canonical equivalence and compatibility equivalence (at
> least for some characters) from UAX #15
> <https://unicode.org/reports/tr15/>, and treating confusables from
> UTS #39 <http://www.unicode.org/reports/tr39/> as the same
> character. Should we treat à and à as the same character? P1949
> purports to make the latter ill-formed in an identifier. What if
> the latter appears in a literal? Should we treat ; the same as
> ;? The compilation performance implications of doing so would be
> significant.
> I think we should treat the parenthetical as a goal (and one that we
> expect to only approximate), not as a hard requirement. And yes, I
> think we should either treat combined and non-combined forms as
> equivalent or reject one of them (take this as a refinement to my
> parenthetical -- I think it's fine for us to reject code that is
> visually identical to valid source code, but it's much less reasonable
> for it to be valid with different behavior).
I like this, I like this very much. It very nicely aligns with the
goals of P1949 (which SG16 approved without contention to forward to EWG
this week).
> For text inside literals, the pragmatic answer that we retain the
> original source form should presumably win out -- "preserve string
> literal contents" is a more important goal than "visually
> indistinguishable source files behave the same". We should probably
> either reject greek question marks or treat them as semicolons,
> depending on whether we perform canonical decomposition or not, but
> they should ideally not be valid and mean something other than a
> semicolon (outside of literals, per the above).
Agreed, and thank you for elaborating on how to balance the concerns in
and out of literals.
> Given that many editors routinely remove trailing whitespace on save,
> and that it is usually invisible, allowing it to have a semantic
> effect seems questionable.

Another good point. Thank you.


> Tom.
>> (I'm inclined to think the answer to the second question should
>> be yes, too, with few or no restrictions. But perhaps treating
>> such cases as a conforming extension is fine.)
>> I suppose the tricky part is getting rules for this that have any
>> formal meaning. An implementation can do whatever it likes
>> *before* phase 1 to identify the initial contents of a source
>> file, so requiring UTF-8 has the same escape hatch we currently
>> have, just without the documentation requirement. And I don't
>> think we can require anything about physical files on disk,
>> because that really does cut into existing implementation
>> practice (eg, builds from VFS / editor buffers, interactive use
>> in C++ interpreters, some forms of remote compilation servers).
>> It might help here to distinguish between what is C++ code, and
>> what a conforming implementation must accept. We would presumably
>> want valid code written on classroom whiteboards to be considered
>> C++, even if all implementations are required to accept only
>> octet sequences encoded in UTF-8 (which the whiteboard code would
>> presumably not be!).
>> Thanks,
>> Corentin
>> For reference here is the definition of abstract
>> character in Unicode 13
>> http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G2212
>> Abstract character: A unit of information used for
>> the organization, control, or representation of
>> textual data.
>> • When representing data, the nature of that data is
>> generally symbolic as
>> opposed to some other kind of data (for example,
>> aural or visual). Examples of
>> such symbolic data include letters, ideographs,
>> digits, punctuation, technical
>> symbols, and dingbats.
>> • An abstract character has no concrete form and
>> should not be confused with a
>> glyph.
>> • An abstract character does not necessarily
>> correspond to what a user thinks of
>> as a “character” and should not be confused with a
>> grapheme.
>> • The abstract characters encoded by the Unicode
>> Standard are known as Unicode abstract characters.
>> • Abstract characters not directly encoded by the
>> Unicode Standard can often be
>> represented by the use of combining character sequences.
>> _______________________________________________
>> Core mailing list
>> Core_at_[hidden] <mailto:Core_at_[hidden]>
>> Subscription:
>> https://lists.isocpp.org/mailman/listinfo.cgi/core
>> Link to this post:
>> http://lists.isocpp.org/core/2020/05/9153.php
>> _______________________________________________
>> Core mailing list
>> Core_at_[hidden] <mailto:Core_at_[hidden]>
>> Subscription:https://lists.isocpp.org/mailman/listinfo.cgi/core
>> Link to this post:http://lists.isocpp.org/core/2020/05/9169.php

Received on 2020-05-30 00:14:19