On Thu, Jul 2, 2020, 06:12 Tom Honermann <tom@honermann.net> wrote:

On 7/1/20 1:45 PM, Jens Maurer wrote:
> On 01/07/2020 14.19, Corentin wrote:
>>
>> On Wed, 1 Jul 2020 at 14:06, Jens Maurer <Jens.Maurer@gmx.net <mailto:Jens.Maurer@gmx.net>> wrote:
>>
>> On 01/07/2020 10.23, Corentin wrote:
>> >
>> >
>> > On Wed, 1 Jul 2020 at 10:14, Jens Maurer <Jens.Maurer@gmx.net <mailto:Jens.Maurer@gmx.net> <mailto:Jens.Maurer@gmx.net <mailto:Jens.Maurer@gmx.net>>> wrote:
>> >
>> > On 01/07/2020 09.44, Corentin wrote:
>> > >
>> > >
>> > > On Wed, 1 Jul 2020 at 09:29, Jens Maurer via Core <core@lists.isocpp.org <mailto:core@lists.isocpp.org> <mailto:core@lists.isocpp.org <mailto:core@lists.isocpp.org>> <mailto:core@lists.isocpp.org <mailto:core@lists.isocpp.org> <mailto:core@lists.isocpp.org <mailto:core@lists.isocpp.org>>>> wrote:
>> >
>> > > We should be clear in the text whether an implementation is allowed to encode
>> > > a sequence of non-numeric-escape-sequence s-chars as a whole, or whether
>> > > each character is encoded separately. There was concern that "separately"
>> > > doesn't address stateful encodings, where the encoding of string character
>> > > i+1 may depend on what string character i was.
>> > >
>> > >
>> > > We should be careful not to change the behavior here.
>> > > Encoding sequences allow an implementation to encode <latin small letter e, combining accute accent> as <latin small letter e with acute>
>> >
>> > Agreed. We should probably prohibit doing that for UTF-x literals,
>> > but I'm not seeing a behavior change for ordinary and wide string
>> > literals.
>> >
>> > > Which is not the current behavior described by the standard.
>> >
>> > Could you point me to the specific place where the standard
>> > doesn't allow that, currently?
>> >
>> > [lex.string] p10
>> > "it is initialized with the given characters."
>> >
>> > for example doesn't speak to the question, in my view.
>> >
>> >
>> > My reading of the description of the size of the string http://eel.is/c++draft/lex.string#1
>>
>> A strict reading of [lex.string] p13 doesn't convey that, though:
>>
>> "The size of a narrow string literal is the total number of
>> escape sequences and other characters, plus at least one for
>> the multibyte encoding of each universal-character-name,
>> plus one for the terminating '\0'."
>>
>> Let's start with our example string:
>> "latin small letter e, combining accute accent"
>>
>> "combining acute accent" is not in the basic source character set, so it's
>> represented as a universal-character-name.
>> According to the formula, we have 1 for "e", at least 1 for universal-character-name,
>> 1 for the terminating '\0' -> at least 3.
>> Encoding this as "<latin small letter e with acute>" yields
>> 2 bytes of UTF-8 encoding plus 1 for the terminating '\0' -> 3
>> So, the requirement "at least 3" is satisfied.
>>
>>
>> If the execution encoding is UTF-8 that happens to be true. What about iso 8859-1?
>> 1 byte for é, 1 for \0
> Right, so the transformation is allowed for UTF-8 execution encoding of
> ordinary string literals, but not for ISO 8859-1 execution encoding.

I think phase 1 leniency makes the above argumentation moot. An
implementation can treat denormalized input as though it was normalized
in phase 1. In other words, an implementation can adjust the input in
phase 1 to get the output it desires in phase 5 regardless.

Corentin, I suspect your objections regarding conversion of groups of
characters at a time is because you don't want implementations to
implicitly normalize Unicode input. If that is correct, then I agree
with that goal, but I do not want to address it as part of this paper;
we can address that in future papers intended to address portability of
UTF-8 source files. If that is not correct, then I'd like to better
understand exactly what your concern is.

2 things.

I would like that we keep phases sensible despite "anything can happen in phase 1". Yes, anything can happen in phase 1, but that should not become "anything can happen in phase 5 because it could have happened in phase 1".

My concern for allowing a different number of codepoints in phase 5 when going from Unicode to something else is that in practice it is very difficult to implement, and we cannot expect implementations to do that consistently (just normalizing might not be sufficient) So I am concerned about implementation divergence, inconsistent results and portability issues. All of which I think are a problem worth addressing.

This is why I have become convinced that the mapping in phase 5 should be 1-1.

But I think we might need to poll it?

Like, what I think doesn't matter, I am just trying to make sure that we don't change the described behavior if we don't intend to, and the description of the behavior is currently ambiguous. I think ambiguous is a fair qualification?

>
> That seems schizophrenic; the standard should have a view here that
> does not depend on the particular encoding chosen.

I agree; the encoding (however the implementation defines it) should
determine the result. This is why the proposed changes remove all of
that wording about the resulting size of the string literal.

I agree with that. The size should just be described as the number of code units in the encoded literal.

Tom.