sg16: Re: [SG16] [isocpp-core] New draft revision: D2029R2 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals)

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 2 Jul 2020 00:12:15 -0400

On 7/1/20 1:45 PM, Jens Maurer wrote:
> On 01/07/2020 14.19, Corentin wrote:
>>
>> On Wed, 1 Jul 2020 at 14:06, Jens Maurer <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>> wrote:
>>
>> On 01/07/2020 10.23, Corentin wrote:
>> >
>> >
>> > On Wed, 1 Jul 2020 at 10:14, Jens Maurer <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]> <mailto:Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>>> wrote:
>> >
>> > On 01/07/2020 09.44, Corentin wrote:
>> > >
>> > >
>> > > On Wed, 1 Jul 2020 at 09:29, Jens Maurer via Core <core_at_[hidden] <mailto:core_at_[hidden]> <mailto:core_at_[hidden] <mailto:core_at_[hidden]>> <mailto:core_at_[hidden] <mailto:core_at_[hidden]> <mailto:core_at_[hidden] <mailto:core_at_[hidden]>>>> wrote:
>> >
>> > > We should be clear in the text whether an implementation is allowed to encode
>> > > a sequence of non-numeric-escape-sequence s-chars as a whole, or whether
>> > > each character is encoded separately. There was concern that "separately"
>> > > doesn't address stateful encodings, where the encoding of string character
>> > > i+1 may depend on what string character i was.
>> > >
>> > >
>> > > We should be careful not to change the behavior here.
>> > > Encoding sequences allow an implementation to encode <latin small letter e, combining accute accent> as <latin small letter e with acute>
>> >
>> > Agreed. We should probably prohibit doing that for UTF-x literals,
>> > but I'm not seeing a behavior change for ordinary and wide string
>> > literals.
>> >
>> > > Which is not the current behavior described by the standard.
>> >
>> > Could you point me to the specific place where the standard
>> > doesn't allow that, currently?
>> >
>> > [lex.string] p10
>> > "it is initialized with the given characters."
>> >
>> > for example doesn't speak to the question, in my view.
>> >
>> >
>> > My reading of the description of the size of the string http://eel.is/c++draft/lex.string#1
>>
>> A strict reading of [lex.string] p13 doesn't convey that, though:
>>
>> "The size of a narrow string literal is the total number of
>> escape sequences and other characters, plus at least one for
>> the multibyte encoding of each universal-character-name,
>> plus one for the terminating '\0'."
>>
>> Let's start with our example string:
>> "latin small letter e, combining accute accent"
>>
>> "combining acute accent" is not in the basic source character set, so it's
>> represented as a universal-character-name.
>> According to the formula, we have 1 for "e", at least 1 for universal-character-name,
>> 1 for the terminating '\0' -> at least 3.
>> Encoding this as "<latin small letter e with acute>" yields
>> 2 bytes of UTF-8 encoding plus 1 for the terminating '\0' -> 3
>> So, the requirement "at least 3" is satisfied.
>>
>>
>> If the execution encoding is UTF-8 that happens to be true. What about iso 8859-1?
>> 1 byte for é, 1 for \0
> Right, so the transformation is allowed for UTF-8 execution encoding of
> ordinary string literals, but not for ISO 8859-1 execution encoding.

I think phase 1 leniency makes the above argumentation moot. An
implementation can treat denormalized input as though it was normalized
in phase 1. In other words, an implementation can adjust the input in
phase 1 to get the output it desires in phase 5 regardless.

Corentin, I suspect your objections regarding conversion of groups of
characters at a time is because you don't want implementations to
implicitly normalize Unicode input. If that is correct, then I agree
with that goal, but I do not want to address it as part of this paper;
we can address that in future papers intended to address portability of
UTF-8 source files. If that is not correct, then I'd like to better
understand exactly what your concern is.

>
> That seems schizophrenic; the standard should have a view here that
> does not depend on the particular encoding chosen.

I agree; the encoding (however the implementation defines it) should
determine the result. This is why the proposed changes remove all of
that wording about the resulting size of the string literal.

Tom.

Received on 2020-07-01 23:15:34