C++ Logo

SG16

Advanced search

Subject: Re: Is it an error to encounter a character without a valid UCN?
From: Steve Downey (sdowney_at_[hidden])
Date: 2020-06-04 10:15:16


While I think it's quite possible for an implementation to provide custom
mappings in phase 1, enabling processing Klingon strings in some
hypothetical non-Unicode encoding of Klingon, I believe that if a mapping
can not be made, then yes it's an error, as there is no way for the rest of
lexing or translation to proceed. The current standard specifies an
otherwise non-standard Unicode Transformation Format as the internal as-if
format. Failing to map from a UCN (aka code point) into the extended
execution character set is not an error, and is already handled in the
standard.

On Thu, Jun 4, 2020 at 4:04 AM Corentin Jabot via SG16 <
sg16_at_[hidden]> wrote:

>
>
> On Thu, 4 Jun 2020 at 01:25, Hubert Tong <hubert.reinterpretcast_at_[hidden]>
> wrote:
>
>> On Wed, Jun 3, 2020 at 3:19 AM Corentin Jabot <corentinjabot_at_[hidden]>
>> wrote:
>>
>>>
>>>
>>> On Wed, Jun 3, 2020, 08:52 Corentin Jabot <corentinjabot_at_[hidden]>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Jun 3, 2020, 03:38 Hubert Tong <
>>>> hubert.reinterpretcast_at_[hidden]> wrote:
>>>>
>>>>> On Tue, Jun 2, 2020 at 7:57 AM Corentin Jabot via SG16 <
>>>>> sg16_at_[hidden]> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 2, 2020, 13:34 Alisdair Meredith via SG16 <
>>>>>> sg16_at_[hidden]> wrote:
>>>>>>
>>>>>>> Translation phase 1 maps source code to either a member of the
>>>>>>> basic character set, or a UCN corresponding to that character.
>>>>>>> What if there is no such UCN? Is that undefined behavior, or is
>>>>>>> the program ill-formed? I can find nothing on this in [lex.phases]
>>>>>>> where we describe processing the source through an implemetation
>>>>>>> defined character mapping.
>>>>>>>
>>>>>>> When we get to [lex.charset] we can see it is clearly ill-formed if
>>>>>>> the produced UCN is invalid - is that supposed to be the resolution
>>>>>>> here? Source must always map to a UCN, but the UCN need not
>>>>>>> be valid, so we get an error parsing the (implied) UCN in a later
>>>>>>> phase?
>>>>>>>
>>>>>>
>>>>>> One more reason i want to rewrite phase 1.
>>>>>>
>>>>>> 2 things should be specified here:
>>>>>>
>>>>>> > Any source file character not in the basic source character set
>>>>>> <http://eel.is/c++draft/lex#def:basic_source_character_set> is
>>>>>> replaced by the
>>>>>> universal-character-name
>>>>>> <http://eel.is/c++draft/lex#nt:universal-character-name> that
>>>>>> designates that character.
>>>>>> <http://eel.is/c++draft/lex#phases-1.1.sentence-3>
>>>>>>
>>>>>>
>>>>>> This is wrong, characters may map to ucn sequences, not single ucns.
>>>>>>
>>>>>> Characters that do not have representation in Unicode should
>>>>>> be ill-formed - with the caveat that implementers can do _anything_ in
>>>>>> phase 0
>>>>>>
>>>>>> Note that the existence of a mapping is different from the validity
>>>>>> of a UCN
>>>>>> It is an implementation strategy to map characters without
>>>>>> representation to nothing.
>>>>>> Other valid strategies would be to use the PUA to represent these
>>>>>> characters
>>>>>>
>>>>> Both of these are harmful to the integrity of user strings unless if
>>>>> (for the second) there is a reserved space of codepoints for the
>>>>> implementation to avoid confusion between the implementation's use of a
>>>>> codepoint and the user's use of the same codepoint.
>>>>>
>>>>
>>>
>>> The standard doesn't make any such guarantee at the moment.
>>> And can not make such guarantee for non unicode source files as some
>>> transcoding are not reversible
>>>
>> Well, just because it currently doesn't does not mean it can't.
>>
>>
>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> To give you an idea of where i want to be, here is a very early draft
>>>>>> of what I think phase 1 and 2 should do, pending
>>>>>> a couple of design changes that EWG would have to look at
>>>>>>
>>>>>> 1. If the physical source character is the Unicode character set,
>>>>>> each code point in the source
>>>>>> file is converted to the internal representation of that same code
>>>>>> point. Codepoints that
>>>>>> are surrogate codepoints or invalid codepoints are ill-formed.
>>>>>> Otherwise, each abstract character in the source file is mapped in an
>>>>>> implementation-
>>>>>> defined manner to a sequence of Unicode codepoint representing the
>>>>>> same abstract
>>>>>> character. (introducing new-line characters for end-of-line
>>>>>> indicators if necessary).
>>>>>> An implementation may use any internal encoding able to represent
>>>>>> uniquely any Uni-
>>>>>> code codepoint.
>>>>>> *If an abstract character in the source file is not representable in
>>>>>> theUnicode character set, the program is ill-formed.*
>>>>>>
>>>>> I'm not sure where we are expecting this diagnostic to come into play.
>>>>> If a vendor is dealing with an encoding that has such characters and it is
>>>>> both the source and assumed execution character set, then I doubt they are
>>>>> interested in telling their users that their strings have been outlawed by
>>>>> the committee.
>>>>>
>>>>
>>>> The scenario here would be a file encoded in big 5, with the execution
>>>> also in big 5 for the few characters that do not have representation in
>>>> Unicode.
>>>>
>>>> An even less realistic scenario would be a piece of paper with a
>>>> Klingon symbol.
>>>>
>>> I am fairly certain that EBCDIC control characters do appear in strings
>> in some user programs. Some of these control characters have no semantic
>> equivalent in Unicode. The existence of mappings does not mean that they
>> retain the semantic value of the abstract characters. For a pair of file
>> encodings, there may be different mappings for different purposes.
>>
>
> A mapping is specified for these control characters by IBM as part of utf
> ebcdic, and it is true that the semantic might not be preserved here. but
> then again, controls are not "abstract characters"
>
>>
>> I'm willing to buy that phase 1 translation can, as a fiction,
>> contextually replace such characters with numeric escape sequences or
>> stronger transformations to retain the best mapping of source character to
>> the presumed execution (narrow or wide) character set.
>>
>>
>>>
>>>
>>> Note that this doesn't change existing behavior:
>>>
>>> Any source file character not in the basic source character set
>>> <http://eel.is/c++draft/lex#def:basic_source_character_set> is replaced
>>> by the universal-character-name
>>> <http://eel.is/c++draft/lex#nt:universal-character-name> that
>>> designates that character
>>>
>>> This wording assumes a mapping always exist - that is not the case in
>>> limited scenario
>>>
>>>>
>>>>>
>>>>>> An implementation supports source files representing a sequence of
>>>>>> UTF-8 code units.
>>>>>> Any additional physical source file character sets accepted are
>>>>>> implementation-defined.
>>>>>> How the the character set of a source file is determined is
>>>>>> implementation-defined.
>>>>>>
>>>>>> 2. Each implementation-defined line termination sequence of
>>>>>> characters is replaced by a
>>>>>> LINE FED character (U+000A). Each instance of a BACKSLASH (\)
>>>>>> immediately
>>>>>> followed by a LINE FEED or at the end of a file is deleted, splicing
>>>>>> physical source
>>>>>> lines to form logical source lines. Only the last backslash on any
>>>>>> physical source line shall
>>>>>> be eligible for being part of such a splice. Except for splices
>>>>>> reverted in a raw string literal,
>>>>>> if a splice results in a codepoint sequence that matches the syntax
>>>>>> of a universal-character-
>>>>>> name, the behavior is implementation-defined. A source file that is
>>>>>> not empty and that does not end
>>>>>> in a *LINE FEED*, or that ends in a LINE FEED immediately preceded
>>>>>> by a BACKSLASH before any such splicing takes place, shall be
>>>>>> processed as if an
>>>>>> additional LINE FEED were appended to the file.
>>>>>> Sequences of whitespace codepoints at the end of each line are
>>>>>> removed.
>>>>>> Each universal-character-name is replaced by the Unicode codepoint it
>>>>>> designates.
>>>>>>
>>>>>> Corentin
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> AlisdairM
>>>>>>> --
>>>>>>> SG16 mailing list
>>>>>>> SG16_at_[hidden]
>>>>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>>>>
>>>>>> --
>>>>>> SG16 mailing list
>>>>>> SG16_at_[hidden]
>>>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>>>>
>>>>> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>



SG16 list run by sg16-owner@lists.isocpp.org