Date: Sat, 24 Apr 2021 13:25:58 +0200
Hello,
Consider the following:
auto x = u8"\xC0";
In [decl.init.string], we say
> An array of ordinary character type ([basic.fundamental]), char8_t
array, char16_t array, char32_t array, or wchar_t array can be
initialized by an ordinary string literal, UTF-8 string literal, UTF-16
string literal, UTF-32 string literal, or wide string literal, respectively
[..]
In ISO 10646, 9.1 UTF-8
> Table 3 lists all the ranges (inclusive) of the octet sequences that are
well-formed in UTF-8. Any UTF-8 sequence that does not match the patterns
listed in table 3 is ill-formed [..]
As a consequence of the well-formedness conditions specified in table 9.2,
the following octet values are disallowed in UTF-8: C0-C1, F5-FE
A reading of both standards would lead me to believe that the code is
ill-formed.
Either the standard represents the intent, this is ill-formed and all
implementations need fixing (and we might add a note to the standard), or
the standard does not describe the intent.
I would argue that it should be ill-formed, exactly because there is no
such thing as invalid UTF-8 and allowing that defeats the purpose of UTF-8
literals and char8_t.
This doesn't really contradict P2029 : no value in preventing numeric
escape sequences, but there should be a well-formedness check after all
other transformations.
But regardless of whether we agree on that design question, my reading is
that the standard contradicts ISO 10646.
Have a great week-end,
Corentin
Consider the following:
auto x = u8"\xC0";
In [decl.init.string], we say
> An array of ordinary character type ([basic.fundamental]), char8_t
array, char16_t array, char32_t array, or wchar_t array can be
initialized by an ordinary string literal, UTF-8 string literal, UTF-16
string literal, UTF-32 string literal, or wide string literal, respectively
[..]
In ISO 10646, 9.1 UTF-8
> Table 3 lists all the ranges (inclusive) of the octet sequences that are
well-formed in UTF-8. Any UTF-8 sequence that does not match the patterns
listed in table 3 is ill-formed [..]
As a consequence of the well-formedness conditions specified in table 9.2,
the following octet values are disallowed in UTF-8: C0-C1, F5-FE
A reading of both standards would lead me to believe that the code is
ill-formed.
Either the standard represents the intent, this is ill-formed and all
implementations need fixing (and we might add a note to the standard), or
the standard does not describe the intent.
I would argue that it should be ill-formed, exactly because there is no
such thing as invalid UTF-8 and allowing that defeats the purpose of UTF-8
literals and char8_t.
This doesn't really contradict P2029 : no value in preventing numeric
escape sequences, but there should be a well-formedness check after all
other transformations.
But regardless of whether we agree on that design question, my reading is
that the standard contradicts ISO 10646.
Have a great week-end,
Corentin
Received on 2021-04-24 06:26:11