C++ Logo


Advanced search

Subject: Re: [wg14/wg21 liaison] Draft WG14 N2653: char8_t: A type for UTF-8 characters and strings (Revision 1)
From: Jens Maurer (Jens.Maurer_at_[hidden])
Date: 2021-06-04 02:13:49

On 04/06/2021 00.50, Tom Honermann via SG16 wrote:
> On 6/2/21 1:47 PM, Richard Smith via SG16 wrote:
>> On Sun, May 30, 2021 at 6:33 PM Tom Honermann via Liaison <liaison_at_[hidden] <mailto:liaison_at_[hidden]>> wrote:
>> I am seeking review feedback on a draft of N2653: char8_t: A type for UTF-8 characters and strings (Revision 1) <https://rawgit.com/sg16-unicode/sg16/master/papers/n2653.html>.  This paper revises an earlier paper, N2231 <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2231.htm>, from 2018.
>> The revision is a rewrite of much of the original paper and follows the C++20 adoption of P0482R6 <https://wg21.link/p0482r6>.  The primary motivation is to maintain source code compatibility between C and C++.
>> Notable differences between what was adopted in C++20 and what is proposed for C2X in N2653 <https://rawgit.com/sg16-unicode/sg16/master/papers/n2653.html> are:
>> 1. In C++20, char8_t is a fundamental type.  The C2X proposal is for a char8_t typedef name of unsigned char.  This is consistent with existing differences between the languages for wchar_t, char16_t, and char32_t.
>> One of the important properties of char8_t in C++20 is that it's not an "aliases everything" type. Having that diverge between C and C++ seems likely to be problematic.
> Thank you, Richard.  I'll update the paper to discuss that in the "typedef name vs a new integer type" design options section.
> Does the following sufficiently capture how such problems might realistically materialize?  Do you have other examples?

Before going into the potential C / C++ compatibility problems, it might be worthwhile
to spend a paragraph explaining that "does not alias everything" is a desirable property,
in general.

> Since char8_t is a distinct type in C++, casts are required before problematic situations arise there.  In either language, char and unsigned char may be used to examine the underlying storage of char8_t objects regardless since they alias everything.  The problematic cases therefore involve accessing non-char8_t typed objects via char8_t types.  With the draft proposal, such cases could arise in C code like the following, but this is ill-formed for C++ (where a copy would be required unless/until we introduce an explicit scoped aliasing facility as we've previously discussed).  Granted, this code might well be written with a cast in order to silence warnings about changes in signedness, and in that case, UB would be introduced in C++.
> void do_utf8_things(const char8_t *s) { ... }
> void f(const char *presumably_utf8_text) {
>   do_utf8_things(presumably_utf8_text);
> }

Yes, this is the issue if the "presumably_utf8_text" objects are actually
char objects.


SG16 list run by sg16-owner@lists.isocpp.org