C++ Logo


Advanced search

Re: [SG16] [isocpp-core] To which extent characters can be replaced or removed in phase 1?

From: Richard Smith <richardsmith_at_[hidden]>
Date: Thu, 28 May 2020 11:39:16 -0700
On Thu, 28 May 2020, 05:50 Corentin via Core, <core_at_[hidden]> wrote:

> Hello,
> This GCC issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38433 argues
> that it is valid
> for an implementation to remove trailing whitespaces as part of the
> implementation defined mapping described in translation phase 1.
> [lex.phases]
> Is it the intent of that wording?
> Should it be specified that this implementation defined mapping should
> preserve the semantic of each abstract character present in the physical
> source file?
> If not, is it a valid implementation to perform arbitrary text
> transformation in phase 1 such as replacing "private" by "public" or
> replacing all "e" by a "z" ?

Yes, that is absolutely valid and intended today. We intentionally permit
trigraph replacement here, as agreed by EWG. And implementations take
advantage of this in other ways too; Clang (for example) replaces Unicode
whitespace with spaces (outside of literals) in this phase.

... also, there is no guarantee that the source file is even originally
text in any meaningful way before this implementation-defined mapping. A
valid implementation could perform OCR on image files and go straight from
PNG to a sequence of basic source characters.

> Corentin
> For reference here is the definition of abstract character in Unicode 13
> http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G2212
> Abstract character: A unit of information used for the organization,
> control, or representation of textual data.
> • When representing data, the nature of that data is generally symbolic as
> opposed to some other kind of data (for example, aural or visual).
> Examples of
> such symbolic data include letters, ideographs, digits, punctuation,
> technical
> symbols, and dingbats.
> • An abstract character has no concrete form and should not be confused
> with a
> glyph.
> • An abstract character does not necessarily correspond to what a user
> thinks of
> as a “character” and should not be confused with a grapheme.
> • The abstract characters encoded by the Unicode Standard are known as
> Unicode abstract characters.
> • Abstract characters not directly encoded by the Unicode Standard can
> often be
> represented by the use of combining character sequences.
> _______________________________________________
> Core mailing list
> Core_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2020/05/9153.php

Received on 2020-05-28 13:42:34