sg16: Re: [SG16] [isocpp-core] New draft revision: D2029R2 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals)

From: Corentin <corentin.jabot_at_[hidden]>
Date: Thu, 2 Jul 2020 07:31:41 +0200

On Thu, Jul 2, 2020, 06:12 Tom Honermann <tom_at_[hidden]> wrote:

> On 7/1/20 1:45 PM, Jens Maurer wrote:
> > On 01/07/2020 14.19, Corentin wrote:
> >>
> >> On Wed, 1 Jul 2020 at 14:06, Jens Maurer <Jens.Maurer_at_[hidden] <mailto:
> Jens.Maurer_at_[hidden]>> wrote:
> >>
> >> On 01/07/2020 10.23, Corentin wrote:
> >> >
> >> >
> >> > On Wed, 1 Jul 2020 at 10:14, Jens Maurer <Jens.Maurer_at_[hidden]
> <mailto:Jens.Maurer_at_[hidden]> <mailto:Jens.Maurer_at_[hidden] <mailto:
> Jens.Maurer_at_[hidden]>>> wrote:
> >> >
> >> > On 01/07/2020 09.44, Corentin wrote:
> >> > >
> >> > >
> >> > > On Wed, 1 Jul 2020 at 09:29, Jens Maurer via Core <
> core_at_[hidden] <mailto:core_at_[hidden]> <mailto:
> core_at_[hidden] <mailto:core_at_[hidden]>> <mailto:
> core_at_[hidden] <mailto:core_at_[hidden]> <mailto:
> core_at_[hidden] <mailto:core_at_[hidden]>>>> wrote:
> >> >
> >> > > We should be clear in the text whether an
> implementation is allowed to encode
> >> > > a sequence of non-numeric-escape-sequence s-chars as a
> whole, or whether
> >> > > each character is encoded separately. There was
> concern that "separately"
> >> > > doesn't address stateful encodings, where the encoding
> of string character
> >> > > i+1 may depend on what string character i was.
> >> > >
> >> > >
> >> > > We should be careful not to change the behavior here.
> >> > > Encoding sequences allow an implementation to encode
> <latin small letter e, combining accute accent> as <latin small letter e
> with acute>
> >> >
> >> > Agreed. We should probably prohibit doing that for UTF-x
> literals,
> >> > but I'm not seeing a behavior change for ordinary and wide
> string
> >> > literals.
> >> >
> >> > > Which is not the current behavior described by the
> standard.
> >> >
> >> > Could you point me to the specific place where the standard
> >> > doesn't allow that, currently?
> >> >
> >> > [lex.string] p10
> >> > "it is initialized with the given characters."
> >> >
> >> > for example doesn't speak to the question, in my view.
> >> >
> >> >
> >> > My reading of the description of the size of the string
> http://eel.is/c++draft/lex.string#1
> >>
> >> A strict reading of [lex.string] p13 doesn't convey that, though:
> >>
> >> "The size of a narrow string literal is the total number of
> >> escape sequences and other characters, plus at least one for
> >> the multibyte encoding of each universal-character-name,
> >> plus one for the terminating '\0'."
> >>
> >> Let's start with our example string:
> >> "latin small letter e, combining accute accent"
> >>
> >> "combining acute accent" is not in the basic source character set,
> so it's
> >> represented as a universal-character-name.
> >> According to the formula, we have 1 for "e", at least 1 for
> universal-character-name,
> >> 1 for the terminating '\0' -> at least 3.
> >> Encoding this as "<latin small letter e with acute>" yields
> >> 2 bytes of UTF-8 encoding plus 1 for the terminating '\0' -> 3
> >> So, the requirement "at least 3" is satisfied.
> >>
> >>
> >> If the execution encoding is UTF-8 that happens to be true. What
> about iso 8859-1?
> >> 1 byte for é, 1 for \0
> > Right, so the transformation is allowed for UTF-8 execution encoding of
> > ordinary string literals, but not for ISO 8859-1 execution encoding.
>
> I think phase 1 leniency makes the above argumentation moot. An
> implementation can treat denormalized input as though it was normalized
> in phase 1. In other words, an implementation can adjust the input in
> phase 1 to get the output it desires in phase 5 regardless.
>
> Corentin, I suspect your objections regarding conversion of groups of
> characters at a time is because you don't want implementations to
> implicitly normalize Unicode input. If that is correct, then I agree
> with that goal, but I do not want to address it as part of this paper;
> we can address that in future papers intended to address portability of
> UTF-8 source files. If that is not correct, then I'd like to better
> understand exactly what your concern is.
>

2 things.
I would like that we keep phases sensible despite "anything can happen in
phase 1". Yes, anything can happen in phase 1, but that should not become
"anything can happen in phase 5 because it could have happened in phase 1".

My concern for allowing a different number of codepoints in phase 5 when
going from Unicode to something else is that in practice it is very
difficult to implement, and we cannot expect implementations to do that
consistently (just normalizing might not be sufficient) So I am concerned
about implementation divergence, inconsistent results and portability
issues. All of which I think are a problem worth addressing.
This is why I have become convinced that the mapping in phase 5 should be
1-1.
But I think we might need to poll it?
Like, what I think doesn't matter, I am just trying to make sure that we
don't change the described behavior if we don't intend to, and the
description of the behavior is currently ambiguous. I think ambiguous is a
fair qualification?

>
> >
> > That seems schizophrenic; the standard should have a view here that
> > does not depend on the particular encoding chosen.
>
> I agree; the encoding (however the implementation defines it) should
> determine the result. This is why the proposed changes remove all of
> that wording about the resulting size of the string literal.
>

I agree with that. The size should just be described as the number of code
units in the encoded literal.

>
> Tom.
>
>

Received on 2020-07-02 00:35:13