C++ Logo

sg16

Advanced search

Re: Comments on P2513R0 char8_t Compatibility and Portability Fixes

From: Tom Honermann <tom_at_[hidden]>
Date: Tue, 25 Jan 2022 01:23:22 -0500
> On Jan 24, 2022, at 5:04 AM, Corentin Jabot via SG16 <sg16_at_[hidden]> wrote:
>
> 
> The paper makes an excellent case for
>
> unsigned char foo [] = u8""; and char foo [] = u8"";
>
> However, the paper does give no justification for signed char. UTF-8 code units are numbers between [0-255]. As such the paper is proposing a conversion from unsigned char to char and does not specify how these signed char elements should be initialized in cases of overflow.
>
> An important point was that u8 literals would always be valid, because all codepoints in the sequence would have representation in the storage. And this is the case in phase 5.
> Neither
> We will make the program ill-formed if there is an overflow (in effect only allowing ASCII)
> We will just copy the bits over and now there are negative utf8 code units
> Seem satisfactory solution.
>
> And while it is easy to find many examples to motivate the paper in general, the case for signed char isn't motivated *at all*.

Thank you for pointing this out; we should update the paper to discuss this.

The allowance for signed char was added at my suggestion and was motivated for compatibility with C and consistency with ordinary string literals.

Since char might be a signed type, the concerns regarding overflow and negative code units apply there as well. These concerns apply for ordinary string literals when the literal encoding is UTF-8 too.

Both C and C++ allow arrays of signed char to be initialized with an ordinary string literal. I think (continuing) to allow the same for UTF-8 string literals (assuming acceptance as a C++20 DR) makes sense just to avoid a (minor) compatibility issue.

This is also consistent with initialization of signed char with a UTF-8 character literal; that is well-formed thanks to implicit integer conversions.

I think use of signed char for text is quite rare, so I don’t have strongly held opinions on this.

Tom.

>
> I would very much be in the paper if the wording was changed to
>
> > Additionally, an array of char or unsigned char may be initialized by a UTF-8 string literal, or by such a string literal enclosed in braces
>
> Thanks,
>
> Corentin
>
>
>
>
>> On Sat, Jan 22, 2022 at 10:31 PM Jens Maurer via SG16 <sg16_at_[hidden]> wrote:
>> Hi,
>>
>> Here are my comments:
>>
>> - Section 3.2, fifth to last word should not have an apostrophe.
>>
>> - The title promises "fixes", but I can see only a single fix in the
>> wording: Allow initialization of an ordinary character array with
>> a UTF-8 string literal. Where are the several fixes?
>>
>> - Wording:
>>
>> "Additionally, an array of ordinary character type may be initialized by a UTF-8
>> string literal, or by a char8_t-typed string-literal enclosed in braces."
>>
>> I agree that "may" (giving permission) is the better verb here compared to
>> "can" in the preceding, existing text.
>>
>> However, we discuss here "UTF-8 string literal", and a few words later we talk
>> about a "char8_t-typed string-literal". Is there any intended difference between
>> these? If so, I need help in seeing the difference. If not, just say
>> ", or by such a string literal enclosed in braces."
>>
>> Jens
>> --
>> SG16 mailing list
>> SG16_at_[hidden]
>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16

Received on 2022-01-25 06:23:23