sg16: Re: [SG16] Is it an error to encounter a character without a valid UCN?

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Wed, 3 Jun 2020 08:45:50 +0200

On Wed, Jun 3, 2020, 05:19 Tom Honermann <tom_at_[hidden]> wrote:

> On 6/2/20 7:57 AM, Corentin Jabot via SG16 wrote:
>
>
>
>
> On Tue, Jun 2, 2020, 13:34 Alisdair Meredith via SG16 <
> sg16_at_[hidden]> wrote:
>
>> Translation phase 1 maps source code to either a member of the
>> basic character set, or a UCN corresponding to that character.
>> What if there is no such UCN? Is that undefined behavior, or is
>> the program ill-formed? I can find nothing on this in [lex.phases]
>> where we describe processing the source through an implemetation
>> defined character mapping.
>>
>> When we get to [lex.charset] we can see it is clearly ill-formed if
>> the produced UCN is invalid - is that supposed to be the resolution
>> here? Source must always map to a UCN, but the UCN need not
>> be valid, so we get an error parsing the (implied) UCN in a later
>> phase?
>>
>
> One more reason i want to rewrite phase 1.
>
> 2 things should be specified here:
>
> > Any source file character not in the basic source character set
> <http://eel.is/c++draft/lex#def:basic_source_character_set> is replaced
> by the
> universal-character-name
> <http://eel.is/c++draft/lex#nt:universal-character-name> that designates
> that character. <http://eel.is/c++draft/lex#phases-1.1.sentence-3>
>
>
> This is wrong, characters may map to ucn sequences, not single ucns.
>
> Good point. This can happen with a few legacy encodings. For example,
> Big5-HKCS includes a few characters that map to a Unicode code point pair.
>
>
> Characters that do not have representation in Unicode should
> be ill-formed - with the caveat that implementers can do _anything_ in
> phase 0
>
> Phase 0?
>

Sorry, let me clarify. Whatever we specify, we can't prevent implementers
to do transformations before phase one, so phase 1 is mostly a guidance.
I don't think there is a way around that, nor should there be

> Since source file encoding and phase 1 translation is
> implementation-defined, I don't think the standard can really say much
> about this scenario. But it seems pretty clear that, post phase 1, the
> standard specifies no facilities to handle a non-Unicode character.
>
>
> Note that the existence of a mapping is different from the validity of a
> UCN
> It is an implementation strategy to map characters without representation
> to nothing.
> Other valid strategies would be to use the PUA to represent these
> characters
>
>
> To give you an idea of where i want to be, here is a very early draft of
> what I think phase 1 and 2 should do, pending
> a couple of design changes that EWG would have to look at
>
> 1. If the physical source character is the Unicode character set, each
> code point in the source
> file is converted to the internal representation of that same code point.
> Codepoints that
> are surrogate codepoints or invalid codepoints are ill-formed.
> Otherwise, each abstract character in the source file is mapped in an
> implementation-
> defined manner to a sequence of Unicode codepoint representing the same
> abstract
> character. (introducing new-line characters for end-of-line indicators if
> necessary).
> An implementation may use any internal encoding able to represent uniquely
> any Uni-
> code codepoint.
> *If an abstract character in the source file is not representable in the
> Unicode character set, the program is ill-formed.*
> An implementation supports source files representing a sequence of UTF-8
> code units.
> Any additional physical source file character sets accepted are
> implementation-defined.
> How the the character set of a source file is determined is
> implementation-defined.
>
> 2. Each implementation-defined line termination sequence of characters is
> replaced by a
> LINE FED character (U+000A). Each instance of a BACKSLASH (\) immediately
> followed by a LINE FEED or at the end of a file is deleted, splicing
> physical source
> lines to form logical source lines. Only the last backslash on any
> physical source line shall
> be eligible for being part of such a splice. Except for splices reverted
> in a raw string literal,
> if a splice results in a codepoint sequence that matches the syntax of a
> universal-character-
> name, the behavior is implementation-defined. A source file that is not
> empty and that does not end
> in a *LINE FEED*, or that ends in a LINE FEED immediately preceded by a
> BACKSLASH before any such splicing takes place, shall be processed as if
> an
> additional LINE FEED were appended to the file.
> Sequences of whitespace codepoints at the end of each line are removed.
> Each universal-character-name is replaced by the Unicode codepoint it
> designates.
>
> The sentence regarding the removal of trailing whitespace doesn't specify
> whether the removal occurs for physical or logical lines (before or after
> splicing).
>
> Wording concerns aside, I think it would be helpful to list the intended
> behavioral changes.
>

Here: removing of trailing whitespaces is mandated, utf8 support is
mandated.

> Tom.
>
>
> Corentin
>
>
>
>>
>> AlisdairM
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>
>
>

Received on 2020-06-03 01:49:10