On 4/24/21 7:25 AM, Corentin via SG16 wrote:

Hello, 

Consider the following:

auto x = u8"\xC0";

In [decl.init.string], we say

> An array of ordinary character type ([basic.fundamental]), char8_­t array, char16_­t array, char32_­t array, or wchar_­t array can be initialized by an ordinary string literal, UTF-8 string literal, UTF-16 string literal, UTF-32 string literal, or wide string literal, respectively [..]

The definition of "UTF-8 string literal" used here corresponds to [lex.string]p1 table 12 and that definition does not impose a well-formed encoding requirement.  How a string literal containing an escape sequence is encoded is specified in [lex.string]p10, and no such requirements are imposed there either.  [dcl.init.string]p1 states "Successive characters of the value of the string-literal initialize the elements of the array".  We can quibble over the use of the word "character" there, but I think the intent is clear; the elements of the string literal are copied.

I think the reference to ISO 10646 only serves to provide a definition of "UTF-8" and to describe how basic-s-chars, r-chars, simple-escape-sequences, and universal-character-names are encoded; I don't see how it can be interpreted to apply to numeric-escape-sequences.

Tom.


In ISO 10646, 9.1 UTF-8

> Table 3 lists all the ranges (inclusive) of the octet sequences that are well-formed in UTF-8. Any UTF-8 sequence that does not match the patterns listed in table 3 is ill-formed [..]
As a consequence of the well-formedness conditions specified in table 9.2, the following octet values are disallowed in UTF-8: C0-C1, F5-FE

A reading of both standards would lead me to believe that the code is ill-formed.
Either the standard represents the intent, this is ill-formed and all implementations need fixing (and we might add a note to the standard), or the standard does not describe the intent.

I would argue that it should be ill-formed, exactly because there is no such thing as invalid UTF-8 and allowing that defeats the purpose of UTF-8 literals and char8_t.
This doesn't really contradict P2029 : no value in preventing numeric escape sequences, but there should be a well-formedness check after all other transformations. 

But regardless of whether we agree on that design question, my reading is that the standard contradicts ISO 10646.

Have a great week-end,

Corentin