sg16: Re: [SG16] Are ill-formed UTF literals well-formed?

From: Tom Honermann <tom_at_[hidden]>
Date: Sat, 24 Apr 2021 15:57:36 -0400

On 4/24/21 7:25 AM, Corentin via SG16 wrote:
>
> Hello,
>
> Consider the following:
>
> auto x = u8"\xC0";
>
> In [decl.init.string], we say
>
> > An array of ordinary character type ([basic.fundamental]), char8_t
> array, char16_t array, char32_t array, or wchar_t array can be
> initialized by an ordinary string literal, UTF-8 string literal,
> UTF-16 string literal, UTF-32 string literal, or wide string literal,
> respectively [..]

The definition of "UTF-8 string literal" used here corresponds to
[lex.string]p1 table 12
<http://eel.is/c++draft/lex.string#tab:lex.string.literal-row-5> and
that definition does not impose a well-formed encoding requirement. How
a string literal containing an escape sequence is encoded is specified
in [lex.string]p10 <http://eel.is/c++draft/lex.string#10>, and no such
requirements are imposed there either. [dcl.init.string]p1
<http://eel.is/c++draft/dcl.init.string#1> states "Successive characters
of the value of the string-literal initialize the elements of the
array". We can quibble over the use of the word "character" there, but
I think the intent is clear; the elements of the string literal are copied.

I think the reference to ISO 10646 only serves to provide a definition
of "UTF-8" and to describe how /basic-s-chars/, /r-chars/,
/simple-escape-sequences/, and /universal-character-names/ are encoded;
I don't see how it can be interpreted to apply to
/numeric-escape-sequences/.

Tom.

>
> In ISO 10646, 9.1 UTF-8
>
> > Table 3 lists all the ranges (inclusive) of the octet sequences that
> are well-formed in UTF-8. Any UTF-8 sequence that does not match the
> patterns listed in table 3 is ill-formed [..]
> As a consequence of the well-formedness conditions specified in table
> 9.2, the following octet values are disallowed in UTF-8: C0-C1, F5-FE
>
> A reading of both standards would lead me to believe that the code is
> ill-formed.
> Either the standard represents the intent, this is ill-formed and all
> implementations need fixing (and we might add a note to the standard),
> or the standard does not describe the intent.
>
> I would argue that it should be ill-formed, exactly because there is
> no such thing as invalid UTF-8 and allowing that defeats the purpose
> of UTF-8 literals and char8_t.
> This doesn't really contradict P2029 : no value in preventing numeric
> escape sequences, but there should be a well-formedness check after
> all other transformations.
>
> But regardless of whether we agree on that design question, my reading
> is that the standard contradicts ISO 10646.
>
> Have a great week-end,
>
> Corentin
>
>

Received on 2021-04-24 14:57:39