sg16: Re: [SG16] Is it an error to encounter a character without a valid UCN?

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Tue, 2 Jun 2020 21:38:28 -0400

On Tue, Jun 2, 2020 at 7:57 AM Corentin Jabot via SG16 <
sg16_at_[hidden]> wrote:

>
>
>
> On Tue, Jun 2, 2020, 13:34 Alisdair Meredith via SG16 <
> sg16_at_[hidden]> wrote:
>
>> Translation phase 1 maps source code to either a member of the
>> basic character set, or a UCN corresponding to that character.
>> What if there is no such UCN? Is that undefined behavior, or is
>> the program ill-formed? I can find nothing on this in [lex.phases]
>> where we describe processing the source through an implemetation
>> defined character mapping.
>>
>> When we get to [lex.charset] we can see it is clearly ill-formed if
>> the produced UCN is invalid - is that supposed to be the resolution
>> here? Source must always map to a UCN, but the UCN need not
>> be valid, so we get an error parsing the (implied) UCN in a later
>> phase?
>>
>
> One more reason i want to rewrite phase 1.
>
> 2 things should be specified here:
>
> > Any source file character not in the basic source character set
> <http://eel.is/c++draft/lex#def:basic_source_character_set> is replaced
> by the
> universal-character-name
> <http://eel.is/c++draft/lex#nt:universal-character-name> that designates
> that character. <http://eel.is/c++draft/lex#phases-1.1.sentence-3>
>
>
> This is wrong, characters may map to ucn sequences, not single ucns.
>
> Characters that do not have representation in Unicode should
> be ill-formed - with the caveat that implementers can do _anything_ in
> phase 0
>
> Note that the existence of a mapping is different from the validity of a
> UCN
> It is an implementation strategy to map characters without representation
> to nothing.
> Other valid strategies would be to use the PUA to represent these
> characters
>
Both of these are harmful to the integrity of user strings unless if (for
the second) there is a reserved space of codepoints for the implementation
to avoid confusion between the implementation's use of a codepoint and the
user's use of the same codepoint.

>
>
> To give you an idea of where i want to be, here is a very early draft of
> what I think phase 1 and 2 should do, pending
> a couple of design changes that EWG would have to look at
>
> 1. If the physical source character is the Unicode character set, each
> code point in the source
> file is converted to the internal representation of that same code point.
> Codepoints that
> are surrogate codepoints or invalid codepoints are ill-formed.
> Otherwise, each abstract character in the source file is mapped in an
> implementation-
> defined manner to a sequence of Unicode codepoint representing the same
> abstract
> character. (introducing new-line characters for end-of-line indicators if
> necessary).
> An implementation may use any internal encoding able to represent uniquely
> any Uni-
> code codepoint.
> *If an abstract character in the source file is not representable in
> theUnicode character set, the program is ill-formed.*
>
I'm not sure where we are expecting this diagnostic to come into play. If a
vendor is dealing with an encoding that has such characters and it is both
the source and assumed execution character set, then I doubt they are
interested in telling their users that their strings have been outlawed by
the committee.

> An implementation supports source files representing a sequence of UTF-8
> code units.
> Any additional physical source file character sets accepted are
> implementation-defined.
> How the the character set of a source file is determined is
> implementation-defined.
>
> 2. Each implementation-defined line termination sequence of characters is
> replaced by a
> LINE FED character (U+000A). Each instance of a BACKSLASH (\) immediately
> followed by a LINE FEED or at the end of a file is deleted, splicing
> physical source
> lines to form logical source lines. Only the last backslash on any
> physical source line shall
> be eligible for being part of such a splice. Except for splices reverted
> in a raw string literal,
> if a splice results in a codepoint sequence that matches the syntax of a
> universal-character-
> name, the behavior is implementation-defined. A source file that is not
> empty and that does not end
> in a *LINE FEED*, or that ends in a LINE FEED immediately preceded by a
> BACKSLASH before any such splicing takes place, shall be processed as if
> an
> additional LINE FEED were appended to the file.
> Sequences of whitespace codepoints at the end of each line are removed.
> Each universal-character-name is replaced by the Unicode codepoint it
> designates.
>
> Corentin
>
>
>
>>
>> AlisdairM
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2020-06-02 20:41:53