C++ Logo

sg16

Advanced search

Re: [SG16] Is it an error to encounter a character without a valid UCN?

From: Tom Honermann <tom_at_[hidden]>
Date: Wed, 3 Jun 2020 14:59:51 -0400
On 6/3/20 2:45 AM, Corentin Jabot via SG16 wrote:
>
>
> On Wed, Jun 3, 2020, 05:19 Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 6/2/20 7:57 AM, Corentin Jabot via SG16 wrote:
>>
>>
>>
>> On Tue, Jun 2, 2020, 13:34 Alisdair Meredith via SG16
>> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
>>
>> Translation phase 1 maps source code to either a member of the
>> basic character set, or a UCN corresponding to that character.
>> What if there is no such UCN? Is that undefined behavior, or is
>> the program ill-formed? I can find nothing on this in
>> [lex.phases]
>> where we describe processing the source through an implemetation
>> defined character mapping.
>>
>> When we get to [lex.charset] we can see it is clearly
>> ill-formed if
>> the produced UCN is invalid - is that supposed to be the
>> resolution
>> here? Source must always map to a UCN, but the UCN need not
>> be valid, so we get an error parsing the (implied) UCN in a later
>> phase?
>>
>>
>> One more reason i want to rewrite phase 1.
>>
>> 2 things should be specified here:
>>
>> > Any source file character not in thebasic source character set
>> <http://eel.is/c++draft/lex#def:basic_source_character_set>is
>> replaced by the
>> universal-character-name
>> <http://eel.is/c++draft/lex#nt:universal-character-name> that
>> designates that character.
>> <http://eel.is/c++draft/lex#phases-1.1.sentence-3>
>>
>> This is wrong, characters may map to ucn sequences, not single ucns.
> Good point. This can happen with a few legacy encodings. For
> example, Big5-HKCS includes a few characters that map to a Unicode
> code point pair.
>>
>> Characters that do not have representation in Unicode should
>> be ill-formed - with the caveat that implementers can do
>> _anything_ in phase 0
>
> Phase 0?
>
>
> Sorry, let me clarify. Whatever we specify, we can't prevent
> implementers to do transformations before phase one, so phase 1 is
> mostly a guidance.
> I don't think there is a way around that, nor should there be
>
> Since source file encoding and phase 1 translation is
> implementation-defined, I don't think the standard can really say
> much about this scenario. But it seems pretty clear that, post
> phase 1, the standard specifies no facilities to handle a
> non-Unicode character.
>
>>
>> Note that the existence of a mapping is different from the
>> validity of a UCN
>> It is an implementation strategy to map characters without
>> representation to nothing.
>> Other valid strategies would be to use the PUA to represent these
>> characters
>>
>>
>> To give you an idea of where i want to be, here is a very early
>> draft of what I think phase 1 and 2 should do, pending
>> a couple of design changes that EWG would have to look at
>>
>> 1. If the physical source character is the Unicode character set,
>> each code point in the source
>> file is converted to the internal representation of that same
>> code point. Codepoints that
>> are surrogate codepoints or invalid codepoints are ill-formed.
>> Otherwise, each abstract character in the source file is mapped
>> in an implementation-
>> defined manner to a sequence of Unicode codepoint representing
>> the same abstract
>> character. (introducing new-line characters for end-of-line
>> indicators if necessary).
>> An implementation may use any internal encoding able to represent
>> uniquely any Uni-
>> code codepoint. *If an abstract character in the source file is
>> not representable in the
>> Unicode character set, the program is ill-formed.*
>> An implementation supports source files representing a sequence
>> of UTF-8 code units.
>> Any additional physical source file character sets accepted are
>> implementation-defined.
>> How the the character set of a source file is determined is
>> implementation-defined.
>>
>> 2. Each implementation-defined line termination sequence of
>> characters is replaced by a
>> LINE FED character (U+000A). Each instance of a BACKSLASH (\)
>> immediately
>> followed by a LINE FEED or at the end of a file is deleted,
>> splicing physical source
>> lines to form logical source lines. Only the last backslash on
>> any physical source line shall
>> be eligible for being part of such a splice. Except for splices
>> reverted in a raw string literal,
>> if a splice results in a codepoint sequence that matches the
>> syntax of a universal-character-
>> name, the behavior is implementation-defined. A source file that
>> is not empty and that does not end
>> in a /LINE FEED/, or that ends in a LINE FEED immediately
>> preceded by a BACKSLASH before any such splicing takes place,
>> shall be processed as if an
>> additional LINE FEED were appended to the file.
>> Sequences of whitespace codepoints at the end of each line are
>> removed.
>> Each universal-character-name is replaced by the Unicode
>> codepoint it designates.
> The sentence regarding the removal of trailing whitespace doesn't
> specify whether the removal occurs for physical or logical lines
> (before or after splicing).
>
> Wording concerns aside, I think it would be helpful to list the
> intended behavioral changes.
>
>
> Here: removing of trailing whitespaces is mandated, utf8 support is
> mandated.

Ok, so intentionally adopting the current gcc and Clang behavior of
removing trailing spaces on physical lines before evaluation of whether
to splice?

Tom.

> Tom.
>
>> Corentin
>>
>>
>> AlisdairM
>> --
>> SG16 mailing list
>> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>>
>
>


Received on 2020-06-03 14:03:00