sg16: Re: [SG16] Are ill-formed UTF literals well-formed?

From: Steve Downey <sdowney_at_[hidden]>
Date: Sat, 24 Apr 2021 14:47:20 -0400

I believe the intent, and the wording, is to say that the associated
encoding is UTF-n.
The real issue, in my opinion, is that string literal is not a type, and
any well-formedness guarantees about literals does not and can not apply to
char8_t[N] or char8_t*.
char8_t s[] = {0xff, 0x0}
should not be undefined behavior.
More than that, a standard utf8->scalar_value decoder should not have
undefined behavior when handed ill-formed encodings.

On Sat, Apr 24, 2021 at 7:26 AM Corentin via SG16 <sg16_at_[hidden]>
wrote:

>
> Hello,
>
> Consider the following:
>
> auto x = u8"\xC0";
>
> In [decl.init.string], we say
>
> > An array of ordinary character type ([basic.fundamental]), char8_t
> array, char16_t array, char32_t array, or wchar_t array can be
> initialized by an ordinary string literal, UTF-8 string literal, UTF-16
> string literal, UTF-32 string literal, or wide string literal, respectively
> [..]
>
> In ISO 10646, 9.1 UTF-8
>
> > Table 3 lists all the ranges (inclusive) of the octet sequences that are
> well-formed in UTF-8. Any UTF-8 sequence that does not match the patterns
> listed in table 3 is ill-formed [..]
> As a consequence of the well-formedness conditions specified in table 9.2,
> the following octet values are disallowed in UTF-8: C0-C1, F5-FE
>
> A reading of both standards would lead me to believe that the code is
> ill-formed.
> Either the standard represents the intent, this is ill-formed and all
> implementations need fixing (and we might add a note to the standard), or
> the standard does not describe the intent.
>
> I would argue that it should be ill-formed, exactly because there is no
> such thing as invalid UTF-8 and allowing that defeats the purpose of UTF-8
> literals and char8_t.
> This doesn't really contradict P2029 : no value in preventing numeric
> escape sequences, but there should be a well-formedness check after all
> other transformations.
>
> But regardless of whether we agree on that design question, my reading is
> that the standard contradicts ISO 10646.
>
> Have a great week-end,
>
> Corentin
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2021-04-24 13:47:35