liaison: Re: [wg14/wg21 liaison] (SC22WG14.19411) [SG16] Draft WG14 N2653: char8_t: A type for UTF-8 characters and strings (Revision 1)

From: Richard Smith <richardsmith_at_[hidden]>
Date: Mon, 7 Jun 2021 17:52:41 -0700

On Thu, Jun 3, 2021 at 3:50 PM Tom Honermann <tom_at_[hidden]> wrote:

> On 6/2/21 1:47 PM, Richard Smith via SG16 wrote:
>
> On Sun, May 30, 2021 at 6:33 PM Tom Honermann via Liaison <
> liaison_at_[hidden]> wrote:
>
>> I am seeking review feedback on a draft of N2653: char8_t: A type for
>> UTF-8 characters and strings (Revision 1)
>> <https://rawgit.com/sg16-unicode/sg16/master/papers/n2653.html>. This
>> paper revises an earlier paper, N2231
>> <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2231.htm>, from 2018.
>>
>> The revision is a rewrite of much of the original paper and follows the
>> C++20 adoption of P0482R6 <https://wg21.link/p0482r6>. The primary
>> motivation is to maintain source code compatibility between C and C++.
>>
>> Notable differences between what was adopted in C++20 and what is
>> proposed for C2X in N2653
>> <https://rawgit.com/sg16-unicode/sg16/master/papers/n2653.html> are:
>>
>> 1. In C++20, char8_t is a fundamental type. The C2X proposal is for
>> a char8_t typedef name of unsigned char. This is consistent with
>> existing differences between the languages for wchar_t, char16_t, and
>> char32_t.
>>
>> One of the important properties of char8_t in C++20 is that it's not an
> "aliases everything" type. Having that diverge between C and C++ seems
> likely to be problematic.
>
> Thank you, Richard. I'll update the paper to discuss that in the "typedef
> name vs a new integer type" design options section.
>
> Does the following sufficiently capture how such problems might
> realistically materialize? Do you have other examples?
>
> Since char8_t is a distinct type in C++, casts are required before
> problematic situations arise there. In either language, char and unsigned
> char may be used to examine the underlying storage of char8_t objects
> regardless since they alias everything. The problematic cases therefore
> involve accessing non-char8_t typed objects via char8_t types. With the
> draft proposal, such cases could arise in C code like the following, but
> this is ill-formed for C++ (where a copy would be required unless/until we
> introduce an explicit scoped aliasing facility as we've previously
> discussed). Granted, this code might well be written with a cast in order
> to silence warnings about changes in signedness, and in that case, UB would
> be introduced in C++.
>
> void do_utf8_things(const char8_t *s) { ... }
> void f(const char *presumably_utf8_text) {
> do_utf8_things(presumably_utf8_text);
> }
>
> Right, I'd be thinking of something like:

enum MyUTF8 : unsigned char {};
void f(const MyUTF8 *presumably_utf8_text) {
do_utf8_things((const char8_t*)presumably_utf8_text);
}

... which would likely be defined (and require a correct cast) in C but
undefined in C++.

> Tom.
>
>
>> 1. In C++20, a UTF-8 string literal may no longer be used to
>> initialize an array of char, signed char, or unsigned char. The C2X
>> proposal retains these initializations. This is also consistent with
>> existing differences for array initialization by a string literal with a
>> mismatched encoding prefix.
>>
>> The Design Options section discusses these design decisions in more
>> detail.
>>
>> I intend to submit this revision to WG14 later this week. Any feedback
>> is appreciated.
>>
>> Tom.
>> _______________________________________________
>> Liaison mailing list
>> Liaison_at_[hidden]
>> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/liaison
>> Link to this post: http://lists.isocpp.org/liaison/2021/05/0597.php
>>
>
>
>

Received on 2021-06-07 19:52:57