C++ Logo


Advanced search

Re: [SG16] [isocpp-core] To which extent characters can be replaced or removed in phase 1?

From: Corentin <corentin.jabot_at_[hidden]>
Date: Thu, 28 May 2020 21:17:25 +0200
On Thu, 28 May 2020 at 20:39, Richard Smith <richardsmith_at_[hidden]> wrote:

> On Thu, 28 May 2020, 05:50 Corentin via Core, <core_at_[hidden]>
> wrote:
>> Hello,
>> This GCC issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38433 argues
>> that it is valid
>> for an implementation to remove trailing whitespaces as part of the
>> implementation defined mapping described in translation phase 1.
>> [lex.phases]
>> Is it the intent of that wording?
>> Should it be specified that this implementation defined mapping should
>> preserve the semantic of each abstract character present in the physical
>> source file?
>> If not, is it a valid implementation to perform arbitrary text
>> transformation in phase 1 such as replacing "private" by "public" or
>> replacing all "e" by a "z" ?
> Yes, that is absolutely valid and intended today. We intentionally permit
> trigraph replacement here, as agreed by EWG. And implementations take
> advantage of this in other ways too; Clang (for example) replaces Unicode
> whitespace with spaces (outside of literals) in this phase.
> ... also, there is no guarantee that the source file is even originally
> text in any meaningful way before this implementation-defined mapping. A
> valid implementation could perform OCR on image files and go straight from
> PNG to a sequence of basic source characters.

The problem is that "the compiler can do absolutely anything in phase 1"
prevents us from:

   - Mandating that a compiler should at least be able to read utf8-encoded
   files (previous attempt
   http://open-std.org/JTC1/SC22/WG21/docs/papers/2012/n3463.html )
   - Mandating that files that use the Unicode character set are not
   arbitrarily transformed (normalized for example)

I am also concerned that this reduces portability (the same file can be
read completely differently by different implementations and as Alidstair
pointed out, this causes a real issue for trailing whitespaces)

> Thanks,
>> Corentin
>> For reference here is the definition of abstract character in Unicode 13
>> http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G2212
>> Abstract character: A unit of information used for the organization,
>> control, or representation of textual data.
>> • When representing data, the nature of that data is generally symbolic as
>> opposed to some other kind of data (for example, aural or visual).
>> Examples of
>> such symbolic data include letters, ideographs, digits, punctuation,
>> technical
>> symbols, and dingbats.
>> • An abstract character has no concrete form and should not be confused
>> with a
>> glyph.
>> • An abstract character does not necessarily correspond to what a user
>> thinks of
>> as a “character” and should not be confused with a grapheme.
>> • The abstract characters encoded by the Unicode Standard are known as
>> Unicode abstract characters.
>> • Abstract characters not directly encoded by the Unicode Standard can
>> often be
>> represented by the use of combining character sequences.
>> _______________________________________________
>> Core mailing list
>> Core_at_[hidden]
>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
>> Link to this post: http://lists.isocpp.org/core/2020/05/9153.php

Received on 2020-05-28 14:20:42