sg16: Re: [SG16] [isocpp-core] To which extent characters can be replaced or removed in phase 1?

From: Richard Smith <richardsmith_at_[hidden]>
Date: Thu, 28 May 2020 15:13:19 -0700

On Thu, May 28, 2020 at 1:29 PM Tom Honermann <tom_at_[hidden]> wrote:

> On 5/28/20 3:43 PM, Richard Smith via Core wrote:
>
> On Thu, 28 May 2020, 12:17 Corentin, <corentin.jabot_at_[hidden]> wrote:
>
>> On Thu, 28 May 2020 at 20:39, Richard Smith <richardsmith_at_[hidden]>
>> wrote:
>>
>>> On Thu, 28 May 2020, 05:50 Corentin via Core, <core_at_[hidden]>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> This GCC issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38433 argues
>>>> that it is valid
>>>> for an implementation to remove trailing whitespaces as part of the
>>>> implementation defined mapping described in translation phase 1.
>>>> [lex.phases]
>>>>
>>>> Is it the intent of that wording?
>>>> Should it be specified that this implementation defined mapping should
>>>> preserve the semantic of each abstract character present in the physical
>>>> source file?
>>>> If not, is it a valid implementation to perform arbitrary text
>>>> transformation in phase 1 such as replacing "private" by "public" or
>>>> replacing all "e" by a "z" ?
>>>>
>>>
>>> Yes, that is absolutely valid and intended today. We intentionally
>>> permit trigraph replacement here, as agreed by EWG. And implementations
>>> take advantage of this in other ways too; Clang (for example) replaces
>>> Unicode whitespace with spaces (outside of literals) in this phase.
>>>
>>> ... also, there is no guarantee that the source file is even originally
>>> text in any meaningful way before this implementation-defined mapping. A
>>> valid implementation could perform OCR on image files and go straight from
>>> PNG to a sequence of basic source characters.
>>>
>>
>>
>> The problem is that "the compiler can do absolutely anything in phase 1"
>> prevents us from:
>>
>> - Mandating that a compiler should at least be able to read
>> utf8-encoded files (previous attempt
>> http://open-std.org/JTC1/SC22/WG21/docs/papers/2012/n3463.html )
>> - Mandating that files that use the Unicode character set are not
>> arbitrarily transformed (normalized for example)
>>
>>
>> I am also concerned that this reduces portability (the same file can be
>> read completely differently by different implementations and as Alidstair
>> pointed out, this causes a real issue for trailing whitespaces)
>>
>
> I think there are separate questions here:
>
> * Should a conforming implementation be required to accept source code
> represented as text files encoded in UTF-8?
> * Should a conforming implementation be permitted to accept other things,
> and if so, how arbitrary is that choice?
>
> I'm inclined to think the answer to the first question should be yes. We
> should have some notion of a portable C++ source file, and without a known
> fixed encoding it's hard to argue that such a thing exists. For that
> encoding we should agree on the handling of trailing whitespace etc (though
> I think ignoring it outside of literals, as clang and GCC do, is the right
> thing -- source code that has the same appearance should have the same
> behaviour).
>
> The statement in the parenthetical has a broad scope and rather profound
> consequences. Ignoring white space is one aspect of it, but taking it to
> an extreme would mean implementing support for Unicode canonical
> equivalence and compatibility equivalence (at least for some characters)
> from UAX #15 <https://unicode.org/reports/tr15/>, and treating
> confusables from UTS #39 <http://www.unicode.org/reports/tr39/> as the
> same character. Should we treat à and à as the same character? P1949
> purports to make the latter ill-formed in an identifier. What if the
> latter appears in a literal? Should we treat ; the same as ;? The
> compilation performance implications of doing so would be significant.
>
I think we should treat the parenthetical as a goal (and one that we expect
to only approximate), not as a hard requirement. And yes, I think we should
either treat combined and non-combined forms as equivalent or reject one of
them (take this as a refinement to my parenthetical -- I think it's fine
for us to reject code that is visually identical to valid source code, but
it's much less reasonable for it to be valid with different behavior). For
text inside literals, the pragmatic answer that we retain the original
source form should presumably win out -- "preserve string literal contents"
is a more important goal than "visually indistinguishable source files
behave the same". We should probably either reject greek question marks or
treat them as semicolons, depending on whether we perform canonical
decomposition or not, but they should ideally not be valid and mean
something other than a semicolon (outside of literals, per the above).

Given that many editors routinely remove trailing whitespace on save, and
that it is usually invisible, allowing it to have a semantic effect seems
questionable.

> Tom.
>
>
> (I'm inclined to think the answer to the second question should be yes,
> too, with few or no restrictions. But perhaps treating such cases as a
> conforming extension is fine.)
>
> I suppose the tricky part is getting rules for this that have any formal
> meaning. An implementation can do whatever it likes *before* phase 1 to
> identify the initial contents of a source file, so requiring UTF-8 has the
> same escape hatch we currently have, just without the documentation
> requirement. And I don't think we can require anything about physical files
> on disk, because that really does cut into existing implementation practice
> (eg, builds from VFS / editor buffers, interactive use in C++ interpreters,
> some forms of remote compilation servers).
>
> It might help here to distinguish between what is C++ code, and what a
> conforming implementation must accept. We would presumably want valid code
> written on classroom whiteboards to be considered C++, even if all
> implementations are required to accept only octet sequences encoded in
> UTF-8 (which the whiteboard code would presumably not be!).
>
> Thanks,
>>>>
>>>> Corentin
>>>>
>>>>
>>>> For reference here is the definition of abstract character in Unicode
>>>> 13
>>>> http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G2212
>>>>
>>>> Abstract character: A unit of information used for the organization,
>>>> control, or representation of textual data.
>>>> • When representing data, the nature of that data is generally symbolic
>>>> as
>>>> opposed to some other kind of data (for example, aural or visual).
>>>> Examples of
>>>> such symbolic data include letters, ideographs, digits, punctuation,
>>>> technical
>>>> symbols, and dingbats.
>>>> • An abstract character has no concrete form and should not be confused
>>>> with a
>>>> glyph.
>>>> • An abstract character does not necessarily correspond to what a user
>>>> thinks of
>>>> as a “character” and should not be confused with a grapheme.
>>>> • The abstract characters encoded by the Unicode Standard are known as
>>>> Unicode abstract characters.
>>>> • Abstract characters not directly encoded by the Unicode Standard can
>>>> often be
>>>> represented by the use of combining character sequences.
>>>> _______________________________________________
>>>> Core mailing list
>>>> Core_at_[hidden]
>>>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
>>>> Link to this post: http://lists.isocpp.org/core/2020/05/9153.php
>>>>
>>>
> _______________________________________________
> Core mailing listCore_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2020/05/9169.php
>
>
>

Received on 2020-05-28 17:16:38