liaison: Re: [wg14/wg21 liaison] (SC22WG14.19417) [SG16] Draft WG14 N2653: char8_t: A type for UTF-8 characters and strings (Revision 1)

From: Tom Honermann <tom_at_[hidden]>
Date: Sat, 5 Jun 2021 09:50:54 -0400

> On Jun 4, 2021, at 7:47 PM, Victor Yodaiken <victor.yodaiken_at_[hidden]> wrote:
>
>
>
>> On Fri, Jun 4, 2021 at 7:06 PM Tom Honermann <tom_at_[hidden]> wrote:
>>> On 6/4/21 3:13 AM, Jens Maurer via SG16 wrote:
>>>> On 04/06/2021 00.50, Tom Honermann via SG16 wrote:
>>>>> On 6/2/21 1:47 PM, Richard Smith via SG16 wrote:
>>>>> On Sun, May 30, 2021 at 6:33 PM Tom Honermann via Liaison <liaison_at_[hidden] <mailto:liaison_at_[hidden]>> wrote:
>>>>>
>>>>> I am seeking review feedback on a draft of N2653: char8_t: A type for UTF-8 characters and strings (Revision 1) <https://rawgit.com/sg16-unicode/sg16/master/papers/n2653.html>. This paper revises an earlier paper, N2231 <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2231.htm>, from 2018.
>>>>>
>>>>> The revision is a rewrite of much of the original paper and follows the C++20 adoption of P0482R6 <https://wg21.link/p0482r6>. The primary motivation is to maintain source code compatibility between C and C++.
>>>>>
>>>>> Notable differences between what was adopted in C++20 and what is proposed for C2X in N2653 <https://rawgit.com/sg16-unicode/sg16/master/papers/n2653.html> are:
>>>>>
>>>>> 1. In C++20, char8_t is a fundamental type. The C2X proposal is for a char8_t typedef name of unsigned char. This is consistent with existing differences between the languages for wchar_t, char16_t, and char32_t.
>>>>>
>>>>> One of the important properties of char8_t in C++20 is that it's not an "aliases everything" type. Having that diverge between C and C++ seems likely to be problematic.
>>>> Thank you, Richard. I'll update the paper to discuss that in the "typedef name vs a new integer type" design options section.
>>>>
>>>> Does the following sufficiently capture how such problems might realistically materialize? Do you have other examples?
>>> Before going into the potential C / C++ compatibility problems, it might be worthwhile
>>> to spend a paragraph explaining that "does not alias everything" is a desirable property,
>>> in general.
>
>
> Why is it not desirable?

I mentioned that there is a tradeoff between code efficiency and safety in the updates made to the paper.

>
>
>> Thank you, good suggestion.
>>
>> I updated the "typedef name vs a new integer type" section and have now submitted the paper to WG14.
>>
>> Tom.
>>
>>>> Since char8_t is a distinct type in C++, casts are required before problematic situations arise there. In either language, char and unsigned char may be used to examine the underlying storage of char8_t objects regardless since they alias everything. The problematic cases therefore involve accessing non-char8_t typed objects via char8_t types. With the draft proposal, such cases could arise in C code like the following, but this is ill-formed for C++ (where a copy would be required unless/until we introduce an explicit scoped aliasing facility as we've previously discussed). Granted, this code might well be written with a cast in order to silence warnings about changes in signedness, and in that case, UB would be introduced in C++.
>>>>
>>>> void do_utf8_things(const char8_t *s) { ... }
>>>> void f(const char *presumably_utf8_text) {
>>>> do_utf8_things(presumably_utf8_text);
>>>> }
>>> Yes, this is the issue if the "presumably_utf8_text" objects are actually
>>> char objects.
>>>
>>> Jens
>
> That seems more useful. What is the purpose of creating more type restrictions?

The usual reasons; a coherent object model enables type based analysis for improved code generation and other forms of static analysis.

Tom.

Received on 2021-06-05 08:51:00