sg16: Re: [SG16] [isocpp-core] New draft revision: D2029R2 (Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals)

From: Corentin <corentin.jabot_at_[hidden]>
Date: Wed, 1 Jul 2020 14:19:45 +0200

On Wed, 1 Jul 2020 at 14:06, Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 01/07/2020 10.23, Corentin wrote:
> >
> >
> > On Wed, 1 Jul 2020 at 10:14, Jens Maurer <Jens.Maurer_at_[hidden] <mailto:
> Jens.Maurer_at_[hidden]>> wrote:
> >
> > On 01/07/2020 09.44, Corentin wrote:
> > >
> > >
> > > On Wed, 1 Jul 2020 at 09:29, Jens Maurer via Core <
> core_at_[hidden] <mailto:core_at_[hidden]> <mailto:
> core_at_[hidden] <mailto:core_at_[hidden]>>> wrote:
> >
> > > We should be clear in the text whether an implementation is
> allowed to encode
> > > a sequence of non-numeric-escape-sequence s-chars as a whole,
> or whether
> > > each character is encoded separately. There was concern that
> "separately"
> > > doesn't address stateful encodings, where the encoding of
> string character
> > > i+1 may depend on what string character i was.
> > >
> > >
> > > We should be careful not to change the behavior here.
> > > Encoding sequences allow an implementation to encode <latin small
> letter e, combining accute accent> as <latin small letter e with acute>
> >
> > Agreed. We should probably prohibit doing that for UTF-x literals,
> > but I'm not seeing a behavior change for ordinary and wide string
> > literals.
> >
> > > Which is not the current behavior described by the standard.
> >
> > Could you point me to the specific place where the standard
> > doesn't allow that, currently?
> >
> > [lex.string] p10
> > "it is initialized with the given characters."
> >
> > for example doesn't speak to the question, in my view.
> >
> >
> > My reading of the description of the size of the string
> http://eel.is/c++draft/lex.string#1
>
> A strict reading of [lex.string] p13 doesn't convey that, though:
>
> "The size of a narrow string literal is the total number of
> escape sequences and other characters, plus at least one for
> the multibyte encoding of each universal-character-name,
> plus one for the terminating '\0'."
>
> Let's start with our example string:
> "latin small letter e, combining accute accent"
>
> "combining acute accent" is not in the basic source character set, so it's
> represented as a universal-character-name.
> According to the formula, we have 1 for "e", at least 1 for
> universal-character-name,
> 1 for the terminating '\0' -> at least 3.
> Encoding this as "<latin small letter e with acute>" yields
> 2 bytes of UTF-8 encoding plus 1 for the terminating '\0' -> 3
> So, the requirement "at least 3" is satisfied.
>

If the execution encoding is UTF-8 that happens to be true. What
about iso 8859-1?
1 byte for é, 1 for \0

>
> Jens
>

Received on 2020-07-01 07:23:10