C++ Logo


Advanced search

Subject: Re: [wg14/wg21 liaison] Draft WG14 N2653: char8_t: A type for UTF-8 characters and strings (Revision 1)
From: Tom Honermann (tom_at_[hidden])
Date: 2021-06-04 18:01:23

On 6/4/21 3:13 AM, Jens Maurer via SG16 wrote:
> On 04/06/2021 00.50, Tom Honermann via SG16 wrote:
>> On 6/2/21 1:47 PM, Richard Smith via SG16 wrote:
>>> On Sun, May 30, 2021 at 6:33 PM Tom Honermann via Liaison <liaison_at_[hidden] <mailto:liaison_at_[hidden]>> wrote:
>>> I am seeking review feedback on a draft of N2653: char8_t: A type for UTF-8 characters and strings (Revision 1) <https://rawgit.com/sg16-unicode/sg16/master/papers/n2653.html>.  This paper revises an earlier paper, N2231 <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2231.htm>, from 2018.
>>> The revision is a rewrite of much of the original paper and follows the C++20 adoption of P0482R6 <https://wg21.link/p0482r6>.  The primary motivation is to maintain source code compatibility between C and C++.
>>> Notable differences between what was adopted in C++20 and what is proposed for C2X in N2653 <https://rawgit.com/sg16-unicode/sg16/master/papers/n2653.html> are:
>>> 1. In C++20, char8_t is a fundamental type.  The C2X proposal is for a char8_t typedef name of unsigned char.  This is consistent with existing differences between the languages for wchar_t, char16_t, and char32_t.
>>> One of the important properties of char8_t in C++20 is that it's not an "aliases everything" type. Having that diverge between C and C++ seems likely to be problematic.
>> Thank you, Richard.  I'll update the paper to discuss that in the "typedef name vs a new integer type" design options section.
>> Does the following sufficiently capture how such problems might realistically materialize?  Do you have other examples?
> Before going into the potential C / C++ compatibility problems, it might be worthwhile
> to spend a paragraph explaining that "does not alias everything" is a desirable property,
> in general.

Thank you, good suggestion.

I updated the "typedef name vs a new integer type"
section and have now submitted the paper to WG14.


>> Since char8_t is a distinct type in C++, casts are required before problematic situations arise there.  In either language, char and unsigned char may be used to examine the underlying storage of char8_t objects regardless since they alias everything.  The problematic cases therefore involve accessing non-char8_t typed objects via char8_t types.  With the draft proposal, such cases could arise in C code like the following, but this is ill-formed for C++ (where a copy would be required unless/until we introduce an explicit scoped aliasing facility as we've previously discussed).  Granted, this code might well be written with a cast in order to silence warnings about changes in signedness, and in that case, UB would be introduced in C++.
>> void do_utf8_things(const char8_t *s) { ... }
>> void f(const char *presumably_utf8_text) {
>>   do_utf8_things(presumably_utf8_text);
>> }
> Yes, this is the issue if the "presumably_utf8_text" objects are actually
> char objects.
> Jens

SG16 list run by sg16-owner@lists.isocpp.org