sg16: Re: [SG16] [isocpp-core] To which extent characters can be replaced or removed in phase 1?

From: Nathan Sidwell <nathan_at_[hidden]>
Date: Tue, 2 Jun 2020 12:28:00 -0400

On 5/28/20 8:50 AM, Corentin via Core wrote:
> Hello,
>
> This GCC issue https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38433 argues that it
> is valid
> for an implementation to remove trailing whitespaces as part of the
> implementation defined mapping described in translation phase 1. [lex.phases]
>
> Is it the intent of that wording?
> Should it be specified that this implementation defined mapping should preserve
> the semantic of each abstract character present in the physical source file?
> If not, is it a valid implementation to perform arbitrary text transformation in
> phase 1 such as replacing "private" by "public" or replacing all "e" by a "z" ?

FWI, I recently encountered this, in reimplementing raw string literal lexing.
We do not get that right when this extension is in play.
       /* GNU backslash whitespace newline extension. FIXME
   could be any sequence of non-vertical space. When we
   can properly restore any such sequence, we should
   mark this note as handled so _cpp_process_line_notes
   doesn't warn. */

I do not know how easy fixing that could be. I retained the comment in a
bug-compatible reworking and moved on.
>
> Thanks,
>
> Corentin
>
>
> For reference here is the definition of abstract character in Unicode 13
> http://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G2212
>
> Abstract character: A unit of information used for the organization, control, or
> representation of textual data.
> • When representing data, the nature of that data is generally symbolic as
> opposed to some other kind of data (for example, aural or visual). Examples of
> such symbolic data include letters, ideographs, digits, punctuation, technical
> symbols, and dingbats.
> • An abstract character has no concrete form and should not be confused with a
> glyph.
> • An abstract character does not necessarily correspond to what a user thinks of
> as a “character” and should not be confused with a grapheme.
> • The abstract characters encoded by the Unicode Standard are known as Unicode
> abstract characters.
> • Abstract characters not directly encoded by the Unicode Standard can often be
> represented by the use of combining character sequences.
>
> _______________________________________________
> Core mailing list
> Core_at_[hidden]
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/core
> Link to this post: http://lists.isocpp.org/core/2020/05/9153.php
>

-- 
Nathan Sidwell

Received on 2020-06-02 11:31:23