On Wed, 1 Jul 2020 at 14:06, Jens Maurer <Jens.Maurer@gmx.net> wrote:
On 01/07/2020 10.23, Corentin wrote:
>
>
> On Wed, 1 Jul 2020 at 10:14, Jens Maurer <Jens.Maurer@gmx.net <mailto:Jens.Maurer@gmx.net>> wrote:
>
>     On 01/07/2020 09.44, Corentin wrote:
>     >
>     >
>     > On Wed, 1 Jul 2020 at 09:29, Jens Maurer via Core <core@lists.isocpp.org <mailto:core@lists.isocpp.org> <mailto:core@lists.isocpp.org <mailto:core@lists.isocpp.org>>> wrote:
>
>     >     We should be clear in the text whether an implementation is allowed to encode
>     >     a sequence of non-numeric-escape-sequence s-chars as a whole, or whether
>     >     each character is encoded separately.  There was concern that "separately"
>     >     doesn't address stateful encodings, where the encoding of string character
>     >     i+1 may depend on what string character i was.
>     >
>     >
>     > We should be careful not to change the behavior here.
>     > Encoding sequences allow an implementation to encode <latin small letter e, combining accute accent> as <latin small letter e with acute>
>
>     Agreed.  We should probably prohibit doing that for UTF-x literals,
>     but I'm not seeing a behavior change for ordinary and wide string
>     literals.
>
>     > Which is not the current behavior described by the standard.
>
>     Could you point me to the specific place where the standard
>     doesn't allow that, currently?
>
>     [lex.string] p10
>     "it is initialized with the given characters."
>
>     for example doesn't speak to the question, in my view.
>
>
> My reading  of the description of the size of the string http://eel.is/c++draft/lex.string#1

A strict reading of [lex.string] p13 doesn't convey that, though:

"The size of a narrow string literal is the total number of
escape sequences and other characters, plus at least one for
the multibyte encoding of each universal-character-name,
plus one for the terminating '\0'."

Let's start with our example string:
"latin small letter e, combining accute accent"

"combining acute accent" is not in the basic source character set, so it's
represented as a universal-character-name.
According to the formula, we have 1 for "e", at least 1 for universal-character-name,
1 for the terminating '\0' -> at least 3.
Encoding this as "<latin small letter e with acute>" yields
2 bytes of UTF-8 encoding plus 1 for the terminating '\0'  -> 3
So, the requirement "at least 3" is satisfied.

If the execution encoding is UTF-8 that happens to be true. What about  iso 8859-1?
1 byte for é, 1 for \0

Jens