sg16: Re: [SG16] Are ill-formed UTF literals well-formed?

From: JeanHeyd Meneide <phdofthehouse_at_[hidden]>
Date: Sat, 24 Apr 2021 15:11:06 -0400

I agree with Steve.

There is no way to differentiate const char s[] = "foo"; and const
char s[] = { 'f', 'o', 'o', '\0' }; in Standard C++. Not even
extensions such as __is_builtin_pointer_p are capable of getting the
answer 100% correct. Even if we mandated that string literals be
exactly UTF-8 after all processing for the totality of the string
literals, it does not stop all arrays of char8_t from being (1) not C
strings and (2) not UTF-8.

This also gets in the way of writing string literals which are meant
to interoperate with legacy systems, where embedded nulls are re-coded
as (illegal) overlong UTF-8 sequences. If someone wants to force-check
validation of a string - including at compile time - they can use
compiler tools like /validate-charset (for source file checking) or
they can use library tools
(https://github.com/soasis/text/blob/main/tests/basic_compile_time/source/validate_decodable_as.unicode.explicit.cpp#L52).

The precondition applied to string literals is not useful here. These
preconditions belong on functions or -- better yet -- on actual types:
https://ztdtext.readthedocs.io/en/latest/api/views/decode_view.html

Sincerely,
JeanHeyd

On Sat, Apr 24, 2021 at 2:47 PM Steve Downey via SG16
<sg16_at_[hidden]> wrote:
>
> I believe the intent, and the wording, is to say that the associated encoding is UTF-n.
> The real issue, in my opinion, is that string literal is not a type, and any well-formedness guarantees about literals does not and can not apply to char8_t[N] or char8_t*.
> char8_t s[] = {0xff, 0x0}
> should not be undefined behavior.
> More than that, a standard utf8->scalar_value decoder should not have undefined behavior when handed ill-formed encodings.
>
> On Sat, Apr 24, 2021 at 7:26 AM Corentin via SG16 <sg16_at_[hidden]> wrote:
>>
>>
>> Hello,
>>
>> Consider the following:
>>
>> auto x = u8"\xC0";
>>
>> In [decl.init.string], we say
>>
>> > An array of ordinary character type ([basic.fundamental]), char8_t array, char16_t array, char32_t array, or wchar_t array can be initialized by an ordinary string literal, UTF-8 string literal, UTF-16 string literal, UTF-32 string literal, or wide string literal, respectively [..]
>>
>> In ISO 10646, 9.1 UTF-8
>>
>> > Table 3 lists all the ranges (inclusive) of the octet sequences that are well-formed in UTF-8. Any UTF-8 sequence that does not match the patterns listed in table 3 is ill-formed [..]
>> As a consequence of the well-formedness conditions specified in table 9.2, the following octet values are disallowed in UTF-8: C0-C1, F5-FE
>>
>> A reading of both standards would lead me to believe that the code is ill-formed.
>> Either the standard represents the intent, this is ill-formed and all implementations need fixing (and we might add a note to the standard), or the standard does not describe the intent.
>>
>> I would argue that it should be ill-formed, exactly because there is no such thing as invalid UTF-8 and allowing that defeats the purpose of UTF-8 literals and char8_t.
>> This doesn't really contradict P2029 : no value in preventing numeric escape sequences, but there should be a well-formedness check after all other transformations.
>>
>> But regardless of whether we agree on that design question, my reading is that the standard contradicts ISO 10646.
>>
>> Have a great week-end,
>>
>> Corentin
>>
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16

Received on 2021-04-24 14:11:20