sg16: Re: [SG16] [isocpp-core] New draft revision: D2029R2 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals)

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 2 Jul 2020 01:50:56 -0400

On 7/2/20 1:31 AM, Corentin wrote:
>
>
> On Thu, Jul 2, 2020, 06:12 Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 7/1/20 1:45 PM, Jens Maurer wrote:
> > On 01/07/2020 14.19, Corentin wrote:
> >>
> >> On Wed, 1 Jul 2020 at 14:06, Jens Maurer <Jens.Maurer_at_[hidden]
> <mailto:Jens.Maurer_at_[hidden]> <mailto:Jens.Maurer_at_[hidden]
> <mailto:Jens.Maurer_at_[hidden]>>> wrote:
> >>
> >> On 01/07/2020 10.23, Corentin wrote:
> >> >
> >> >
> >> > On Wed, 1 Jul 2020 at 10:14, Jens Maurer
> <Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>
> <mailto:Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>>
> <mailto:Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>
> <mailto:Jens.Maurer_at_[hidden] <mailto:Jens.Maurer_at_[hidden]>>>> wrote:
> >> >
> >> > On 01/07/2020 09.44, Corentin wrote:
> >> > >
> >> > >
> >> > > On Wed, 1 Jul 2020 at 09:29, Jens Maurer via Core
> <core_at_[hidden] <mailto:core_at_[hidden]>
> <mailto:core_at_[hidden] <mailto:core_at_[hidden]>>
> <mailto:core_at_[hidden] <mailto:core_at_[hidden]>
> <mailto:core_at_[hidden] <mailto:core_at_[hidden]>>>
> <mailto:core_at_[hidden] <mailto:core_at_[hidden]>
> <mailto:core_at_[hidden] <mailto:core_at_[hidden]>>
> <mailto:core_at_[hidden] <mailto:core_at_[hidden]>
> <mailto:core_at_[hidden] <mailto:core_at_[hidden]>>>>>
> wrote:
> >> >
> >> > > We should be clear in the text whether an
> implementation is allowed to encode
> >> > > a sequence of non-numeric-escape-sequence
> s-chars as a whole, or whether
> >> > > each character is encoded separately. There
> was concern that "separately"
> >> > > doesn't address stateful encodings, where the
> encoding of string character
> >> > > i+1 may depend on what string character i was.
> >> > >
> >> > >
> >> > > We should be careful not to change the behavior here.
> >> > > Encoding sequences allow an implementation to
> encode <latin small letter e, combining accute accent> as <latin
> small letter e with acute>
> >> >
> >> > Agreed. We should probably prohibit doing that for
> UTF-x literals,
> >> > but I'm not seeing a behavior change for ordinary
> and wide string
> >> > literals.
> >> >
> >> > > Which is not the current behavior described by the
> standard.
> >> >
> >> > Could you point me to the specific place where the
> standard
> >> > doesn't allow that, currently?
> >> >
> >> > [lex.string] p10
> >> > "it is initialized with the given characters."
> >> >
> >> > for example doesn't speak to the question, in my view.
> >> >
> >> >
> >> > My reading of the description of the size of the string
> http://eel.is/c++draft/lex.string#1
> >>
> >> A strict reading of [lex.string] p13 doesn't convey that,
> though:
> >>
> >> "The size of a narrow string literal is the total number of
> >> escape sequences and other characters, plus at least one for
> >> the multibyte encoding of each universal-character-name,
> >> plus one for the terminating '\0'."
> >>
> >> Let's start with our example string:
> >> "latin small letter e, combining accute accent"
> >>
> >> "combining acute accent" is not in the basic source
> character set, so it's
> >> represented as a universal-character-name.
> >> According to the formula, we have 1 for "e", at least 1
> for universal-character-name,
> >> 1 for the terminating '\0' -> at least 3.
> >> Encoding this as "<latin small letter e with acute>" yields
> >> 2 bytes of UTF-8 encoding plus 1 for the terminating '\0'
> -> 3
> >> So, the requirement "at least 3" is satisfied.
> >>
> >>
> >> If the execution encoding is UTF-8 that happens to be true.
> What about iso 8859-1?
> >> 1 byte for é, 1 for \0
> > Right, so the transformation is allowed for UTF-8 execution
> encoding of
> > ordinary string literals, but not for ISO 8859-1 execution encoding.
>
> I think phase 1 leniency makes the above argumentation moot. An
> implementation can treat denormalized input as though it was
> normalized
> in phase 1. In other words, an implementation can adjust the
> input in
> phase 1 to get the output it desires in phase 5 regardless.
>
> Corentin, I suspect your objections regarding conversion of groups of
> characters at a time is because you don't want implementations to
> implicitly normalize Unicode input. If that is correct, then I agree
> with that goal, but I do not want to address it as part of this
> paper;
> we can address that in future papers intended to address
> portability of
> UTF-8 source files. If that is not correct, then I'd like to better
> understand exactly what your concern is.
>
>
> 2 things.
> I would like that we keep phases sensible despite "anything can happen
> in phase 1". Yes, anything can happen in phase 1, but that should not
> become "anything can happen in phase 5 because it could have happened
> in phase 1".
That is fair.
>
> My concern for allowing a different number of codepoints in phase 5
> when going from Unicode to something else is that in practice it is
> very difficult to implement, and we cannot expect implementations to
> do that consistently (just normalizing might not be sufficient) So I
> am concerned about implementation divergence, inconsistent results and
> portability issues. All of which I think are a problem worth addressing.
My perspective is that, since the execution encodings are
implementation-defined anyway, implementations should have the freedom
to convert from source input however they see fit. If implementations
want to be compatible, then this effectively becomes an ABI issue and
they need to agree on how the encoding works. Requiring 1-1 mapping
isn't sufficient to ensure the same result for all inputs.
> This is why I have become convinced that the mapping in phase 5 should
> be 1-1.
> But I think we might need to poll it?
Perhaps, that will be up to the CWG chair.
> Like, what I think doesn't matter, I am just trying to make sure that
> we don't change the described behavior if we don't intend to, and the
> description of the behavior is currently ambiguous. I think ambiguous
> is a fair qualification?

I would categorize it more as implementation freedom and QOI. The
wording changes don't force a change on any implementation.

Tom.

>
> >
> > That seems schizophrenic; the standard should have a view here that
> > does not depend on the particular encoding chosen.
>
> I agree; the encoding (however the implementation defines it) should
> determine the result. This is why the proposed changes remove all of
> that wording about the resulting size of the string literal.
>
>
> I agree with that. The size should just be described as the number of
> code units in the encoded literal.
>
>
> Tom.
>

Received on 2020-07-02 00:54:12