C++ Logo

sg16

Advanced search

Re: [SG16] Is it an error to encounter a character without a valid UCN?

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Wed, 3 Jun 2020 08:52:35 +0200
On Wed, Jun 3, 2020, 03:38 Hubert Tong <hubert.reinterpretcast_at_[hidden]>
wrote:

> On Tue, Jun 2, 2020 at 7:57 AM Corentin Jabot via SG16 <
> sg16_at_[hidden]> wrote:
>
>>
>>
>>
>> On Tue, Jun 2, 2020, 13:34 Alisdair Meredith via SG16 <
>> sg16_at_[hidden]> wrote:
>>
>>> Translation phase 1 maps source code to either a member of the
>>> basic character set, or a UCN corresponding to that character.
>>> What if there is no such UCN? Is that undefined behavior, or is
>>> the program ill-formed? I can find nothing on this in [lex.phases]
>>> where we describe processing the source through an implemetation
>>> defined character mapping.
>>>
>>> When we get to [lex.charset] we can see it is clearly ill-formed if
>>> the produced UCN is invalid - is that supposed to be the resolution
>>> here? Source must always map to a UCN, but the UCN need not
>>> be valid, so we get an error parsing the (implied) UCN in a later
>>> phase?
>>>
>>
>> One more reason i want to rewrite phase 1.
>>
>> 2 things should be specified here:
>>
>> > Any source file character not in the basic source character set
>> <http://eel.is/c++draft/lex#def:basic_source_character_set> is replaced
>> by the
>> universal-character-name
>> <http://eel.is/c++draft/lex#nt:universal-character-name> that designates
>> that character. <http://eel.is/c++draft/lex#phases-1.1.sentence-3>
>>
>>
>> This is wrong, characters may map to ucn sequences, not single ucns.
>>
>> Characters that do not have representation in Unicode should
>> be ill-formed - with the caveat that implementers can do _anything_ in
>> phase 0
>>
>> Note that the existence of a mapping is different from the validity of a
>> UCN
>> It is an implementation strategy to map characters without representation
>> to nothing.
>> Other valid strategies would be to use the PUA to represent these
>> characters
>>
> Both of these are harmful to the integrity of user strings unless if (for
> the second) there is a reserved space of codepoints for the implementation
> to avoid confusion between the implementation's use of a codepoint and the
> user's use of the same codepoint.
>

>
>>
>>
>> To give you an idea of where i want to be, here is a very early draft of
>> what I think phase 1 and 2 should do, pending
>> a couple of design changes that EWG would have to look at
>>
>> 1. If the physical source character is the Unicode character set, each
>> code point in the source
>> file is converted to the internal representation of that same code point.
>> Codepoints that
>> are surrogate codepoints or invalid codepoints are ill-formed.
>> Otherwise, each abstract character in the source file is mapped in an
>> implementation-
>> defined manner to a sequence of Unicode codepoint representing the same
>> abstract
>> character. (introducing new-line characters for end-of-line indicators if
>> necessary).
>> An implementation may use any internal encoding able to represent
>> uniquely any Uni-
>> code codepoint.
>> *If an abstract character in the source file is not representable in
>> theUnicode character set, the program is ill-formed.*
>>
> I'm not sure where we are expecting this diagnostic to come into play. If
> a vendor is dealing with an encoding that has such characters and it is
> both the source and assumed execution character set, then I doubt they are
> interested in telling their users that their strings have been outlawed by
> the committee.
>

The scenario here would be a file encoded in big 5, with the execution also
in big 5 for the few characters that do not have representation in Unicode.

An even less realistic scenario would be a piece of paper with a Klingon
symbol.

>
>
>> An implementation supports source files representing a sequence of UTF-8
>> code units.
>> Any additional physical source file character sets accepted are
>> implementation-defined.
>> How the the character set of a source file is determined is
>> implementation-defined.
>>
>> 2. Each implementation-defined line termination sequence of characters is
>> replaced by a
>> LINE FED character (U+000A). Each instance of a BACKSLASH (\) immediately
>> followed by a LINE FEED or at the end of a file is deleted, splicing
>> physical source
>> lines to form logical source lines. Only the last backslash on any
>> physical source line shall
>> be eligible for being part of such a splice. Except for splices reverted
>> in a raw string literal,
>> if a splice results in a codepoint sequence that matches the syntax of a
>> universal-character-
>> name, the behavior is implementation-defined. A source file that is not
>> empty and that does not end
>> in a *LINE FEED*, or that ends in a LINE FEED immediately preceded by a
>> BACKSLASH before any such splicing takes place, shall be processed as if
>> an
>> additional LINE FEED were appended to the file.
>> Sequences of whitespace codepoints at the end of each line are removed.
>> Each universal-character-name is replaced by the Unicode codepoint it
>> designates.
>>
>> Corentin
>>
>>
>>
>>>
>>> AlisdairM
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>

Received on 2020-06-03 01:55:54