C++ Logo

liaison

Advanced search

Re: [wg14/wg21 liaison] [SG16] Draft WG14 N2653: char8_t: A type for UTF-8 characters and strings (Revision 1)

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Fri, 4 Jun 2021 09:13:49 +0200
On 04/06/2021 00.50, Tom Honermann via SG16 wrote:
> On 6/2/21 1:47 PM, Richard Smith via SG16 wrote:
>> On Sun, May 30, 2021 at 6:33 PM Tom Honermann via Liaison <liaison_at_[hidden] <mailto:liaison_at_[hidden]>> wrote:
>>
>> I am seeking review feedback on a draft of N2653: char8_t: A type for UTF-8 characters and strings (Revision 1) <https://rawgit.com/sg16-unicode/sg16/master/papers/n2653.html>. This paper revises an earlier paper, N2231 <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2231.htm>, from 2018.
>>
>> The revision is a rewrite of much of the original paper and follows the C++20 adoption of P0482R6 <https://wg21.link/p0482r6>. The primary motivation is to maintain source code compatibility between C and C++.
>>
>> Notable differences between what was adopted in C++20 and what is proposed for C2X in N2653 <https://rawgit.com/sg16-unicode/sg16/master/papers/n2653.html> are:
>>
>> 1. In C++20, char8_t is a fundamental type. The C2X proposal is for a char8_t typedef name of unsigned char. This is consistent with existing differences between the languages for wchar_t, char16_t, and char32_t.
>>
>> One of the important properties of char8_t in C++20 is that it's not an "aliases everything" type. Having that diverge between C and C++ seems likely to be problematic.
>
> Thank you, Richard. I'll update the paper to discuss that in the "typedef name vs a new integer type" design options section.
>
> Does the following sufficiently capture how such problems might realistically materialize? Do you have other examples?

Before going into the potential C / C++ compatibility problems, it might be worthwhile
to spend a paragraph explaining that "does not alias everything" is a desirable property,
in general.

> Since char8_t is a distinct type in C++, casts are required before problematic situations arise there. In either language, char and unsigned char may be used to examine the underlying storage of char8_t objects regardless since they alias everything. The problematic cases therefore involve accessing non-char8_t typed objects via char8_t types. With the draft proposal, such cases could arise in C code like the following, but this is ill-formed for C++ (where a copy would be required unless/until we introduce an explicit scoped aliasing facility as we've previously discussed). Granted, this code might well be written with a cast in order to silence warnings about changes in signedness, and in that case, UB would be introduced in C++.
>
> void do_utf8_things(const char8_t *s) { ... }
> void f(const char *presumably_utf8_text) {
> do_utf8_things(presumably_utf8_text);
> }

Yes, this is the issue if the "presumably_utf8_text" objects are actually
char objects.

Jens

Received on 2021-06-04 02:13:57