sg16: Re: [SG16] Is it an error to encounter a character without a valid UCN?

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 2 Jun 2020 23:19:20 -0400

On 6/2/20 7:57 AM, Corentin Jabot via SG16 wrote:
>
>
>
> On Tue, Jun 2, 2020, 13:34 Alisdair Meredith via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>
> Translation phase 1 maps source code to either a member of the
> basic character set, or a UCN corresponding to that character.
> What if there is no such UCN? Is that undefined behavior, or is
> the program ill-formed? I can find nothing on this in [lex.phases]
> where we describe processing the source through an implemetation
> defined character mapping.
>
> When we get to [lex.charset] we can see it is clearly ill-formed if
> the produced UCN is invalid - is that supposed to be the resolution
> here? Source must always map to a UCN, but the UCN need not
> be valid, so we get an error parsing the (implied) UCN in a later
> phase?
>
>
> One more reason i want to rewrite phase 1.
>
> 2 things should be specified here:
>
> > Any source file character not in thebasic source character set
> <http://eel.is/c++draft/lex#def:basic_source_character_set>is replaced
> by the
> universal-character-name
> <http://eel.is/c++draft/lex#nt:universal-character-name> that
> designates that character.
> <http://eel.is/c++draft/lex#phases-1.1.sentence-3>
>
> This is wrong, characters may map to ucn sequences, not single ucns.
Good point. This can happen with a few legacy encodings. For example,
Big5-HKCS includes a few characters that map to a Unicode code point pair.
>
> Characters that do not have representation in Unicode should
> be ill-formed - with the caveat that implementers can do _anything_
> in phase 0

Phase 0?

Since source file encoding and phase 1 translation is
implementation-defined, I don't think the standard can really say much
about this scenario. But it seems pretty clear that, post phase 1, the
standard specifies no facilities to handle a non-Unicode character.

>
> Note that the existence of a mapping is different from the validity of
> a UCN
> It is an implementation strategy to map characters without
> representation to nothing.
> Other valid strategies would be to use the PUA to represent these
> characters
>
>
> To give you an idea of where i want to be, here is a very early draft
> of what I think phase 1 and 2 should do, pending
> a couple of design changes that EWG would have to look at
>
> 1. If the physical source character is the Unicode character set, each
> code point in the source
> file is converted to the internal representation of that same code
> point. Codepoints that
> are surrogate codepoints or invalid codepoints are ill-formed.
> Otherwise, each abstract character in the source file is mapped in an
> implementation-
> defined manner to a sequence of Unicode codepoint representing the
> same abstract
> character. (introducing new-line characters for end-of-line indicators
> if necessary).
> An implementation may use any internal encoding able to represent
> uniquely any Uni-
> code codepoint. *If an abstract character in the source file is not
> representable in the
> Unicode character set, the program is ill-formed.*
> An implementation supports source files representing a sequence of
> UTF-8 code units.
> Any additional physical source file character sets accepted are
> implementation-defined.
> How the the character set of a source file is determined is
> implementation-defined.
>
> 2. Each implementation-defined line termination sequence of characters
> is replaced by a
> LINE FED character (U+000A). Each instance of a BACKSLASH (\) immediately
> followed by a LINE FEED or at the end of a file is deleted, splicing
> physical source
> lines to form logical source lines. Only the last backslash on any
> physical source line shall
> be eligible for being part of such a splice. Except for splices
> reverted in a raw string literal,
> if a splice results in a codepoint sequence that matches the syntax of
> a universal-character-
> name, the behavior is implementation-defined. A source file that is
> not empty and that does not end
> in a /LINE FEED/, or that ends in a LINE FEED immediately preceded by
> a BACKSLASH before any such splicing takes place, shall be processed
> as if an
> additional LINE FEED were appended to the file.
> Sequences of whitespace codepoints at the end of each line are removed.
> Each universal-character-name is replaced by the Unicode codepoint it
> designates.
The sentence regarding the removal of trailing whitespace doesn't
specify whether the removal occurs for physical or logical lines (before
or after splicing).

Wording concerns aside, I think it would be helpful to list the intended
behavioral changes.

Tom.

> Corentin
>
>
> AlisdairM
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
>

Received on 2020-06-02 22:22:28