C++ Logo

SG16

Advanced search

Subject: Re: Is it an error to encounter a character without a valid UCN?
From: Corentin Jabot (corentinjabot_at_[hidden])
Date: 2020-06-03 02:19:31


On Wed, Jun 3, 2020, 08:52 Corentin Jabot <corentinjabot_at_[hidden]> wrote:

>
>
> On Wed, Jun 3, 2020, 03:38 Hubert Tong <hubert.reinterpretcast_at_[hidden]>
> wrote:
>
>> On Tue, Jun 2, 2020 at 7:57 AM Corentin Jabot via SG16 <
>> sg16_at_[hidden]> wrote:
>>
>>>
>>>
>>>
>>> On Tue, Jun 2, 2020, 13:34 Alisdair Meredith via SG16 <
>>> sg16_at_[hidden]> wrote:
>>>
>>>> Translation phase 1 maps source code to either a member of the
>>>> basic character set, or a UCN corresponding to that character.
>>>> What if there is no such UCN? Is that undefined behavior, or is
>>>> the program ill-formed? I can find nothing on this in [lex.phases]
>>>> where we describe processing the source through an implemetation
>>>> defined character mapping.
>>>>
>>>> When we get to [lex.charset] we can see it is clearly ill-formed if
>>>> the produced UCN is invalid - is that supposed to be the resolution
>>>> here? Source must always map to a UCN, but the UCN need not
>>>> be valid, so we get an error parsing the (implied) UCN in a later
>>>> phase?
>>>>
>>>
>>> One more reason i want to rewrite phase 1.
>>>
>>> 2 things should be specified here:
>>>
>>> > Any source file character not in the basic source character set
>>> <http://eel.is/c++draft/lex#def:basic_source_character_set> is replaced
>>> by the
>>> universal-character-name
>>> <http://eel.is/c++draft/lex#nt:universal-character-name> that
>>> designates that character.
>>> <http://eel.is/c++draft/lex#phases-1.1.sentence-3>
>>>
>>>
>>> This is wrong, characters may map to ucn sequences, not single ucns.
>>>
>>> Characters that do not have representation in Unicode should
>>> be ill-formed - with the caveat that implementers can do _anything_ in
>>> phase 0
>>>
>>> Note that the existence of a mapping is different from the validity of a
>>> UCN
>>> It is an implementation strategy to map characters without
>>> representation to nothing.
>>> Other valid strategies would be to use the PUA to represent these
>>> characters
>>>
>> Both of these are harmful to the integrity of user strings unless if (for
>> the second) there is a reserved space of codepoints for the implementation
>> to avoid confusion between the implementation's use of a codepoint and the
>> user's use of the same codepoint.
>>
>

The standard doesn't make any such guarantee at the moment.
And can not make such guarantee for non unicode source files as some
transcoding are not reversible

>
>>
>>>
>>>
>>> To give you an idea of where i want to be, here is a very early draft of
>>> what I think phase 1 and 2 should do, pending
>>> a couple of design changes that EWG would have to look at
>>>
>>> 1. If the physical source character is the Unicode character set, each
>>> code point in the source
>>> file is converted to the internal representation of that same code
>>> point. Codepoints that
>>> are surrogate codepoints or invalid codepoints are ill-formed.
>>> Otherwise, each abstract character in the source file is mapped in an
>>> implementation-
>>> defined manner to a sequence of Unicode codepoint representing the same
>>> abstract
>>> character. (introducing new-line characters for end-of-line indicators
>>> if necessary).
>>> An implementation may use any internal encoding able to represent
>>> uniquely any Uni-
>>> code codepoint.
>>> *If an abstract character in the source file is not representable in
>>> theUnicode character set, the program is ill-formed.*
>>>
>> I'm not sure where we are expecting this diagnostic to come into play. If
>> a vendor is dealing with an encoding that has such characters and it is
>> both the source and assumed execution character set, then I doubt they are
>> interested in telling their users that their strings have been outlawed by
>> the committee.
>>
>
> The scenario here would be a file encoded in big 5, with the execution
> also in big 5 for the few characters that do not have representation in
> Unicode.
>
> An even less realistic scenario would be a piece of paper with a Klingon
> symbol.
>

Note that this doesn't change existing behavior:

Any source file character not in the basic source character set
<http://eel.is/c++draft/lex#def:basic_source_character_set> is replaced by
the universal-character-name
<http://eel.is/c++draft/lex#nt:universal-character-name> that designates
that character

This wording assumes a mapping always exist - that is not the case in
limited scenario

>
>>
>>> An implementation supports source files representing a sequence of UTF-8
>>> code units.
>>> Any additional physical source file character sets accepted are
>>> implementation-defined.
>>> How the the character set of a source file is determined is
>>> implementation-defined.
>>>
>>> 2. Each implementation-defined line termination sequence of characters
>>> is replaced by a
>>> LINE FED character (U+000A). Each instance of a BACKSLASH (\) immediately
>>> followed by a LINE FEED or at the end of a file is deleted, splicing
>>> physical source
>>> lines to form logical source lines. Only the last backslash on any
>>> physical source line shall
>>> be eligible for being part of such a splice. Except for splices reverted
>>> in a raw string literal,
>>> if a splice results in a codepoint sequence that matches the syntax of a
>>> universal-character-
>>> name, the behavior is implementation-defined. A source file that is not
>>> empty and that does not end
>>> in a *LINE FEED*, or that ends in a LINE FEED immediately preceded by a
>>> BACKSLASH before any such splicing takes place, shall be processed as
>>> if an
>>> additional LINE FEED were appended to the file.
>>> Sequences of whitespace codepoints at the end of each line are removed.
>>> Each universal-character-name is replaced by the Unicode codepoint it
>>> designates.
>>>
>>> Corentin
>>>
>>>
>>>
>>>>
>>>> AlisdairM
>>>> --
>>>> SG16 mailing list
>>>> SG16_at_[hidden]
>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>
>>> --
>>> SG16 mailing list
>>> SG16_at_[hidden]
>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>
>>



SG16 list run by sg16-owner@lists.isocpp.org