C++ Logo

liaison

Advanced search

Re: [wg14/wg21 liaison] [SG16] Draft WG14 N2653: char8_t: A type for UTF-8 characters and strings (Revision 1)

From: Tom Honermann <tom_at_[hidden]>
Date: Thu, 3 Jun 2021 18:50:26 -0400
On 6/2/21 1:47 PM, Richard Smith via SG16 wrote:
> On Sun, May 30, 2021 at 6:33 PM Tom Honermann via Liaison
> <liaison_at_[hidden] <mailto:liaison_at_[hidden]>> wrote:
>
> I am seeking review feedback on a draft of N2653: char8_t: A type
> for UTF-8 characters and strings (Revision 1)
> <https://rawgit.com/sg16-unicode/sg16/master/papers/n2653.html>.
> This paper revises an earlier paper, N2231
> <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2231.htm>, from
> 2018.
>
> The revision is a rewrite of much of the original paper and
> follows the C++20 adoption of P0482R6
> <https://wg21.link/p0482r6>. The primary motivation is to
> maintain source code compatibility between C and C++.
>
> Notable differences between what was adopted in C++20 and what is
> proposed for C2X in N2653
> <https://rawgit.com/sg16-unicode/sg16/master/papers/n2653.html> are:
>
> 1. In C++20, char8_t is a fundamental type. The C2X proposal is
> for a char8_t typedef name of unsigned char. This is
> consistent with existing differences between the languages for
> wchar_t, char16_t, and char32_t.
>
> One of the important properties of char8_t in C++20 is that it's not
> an "aliases everything" type. Having that diverge between C and C++
> seems likely to be problematic.

Thank you, Richard. I'll update the paper to discuss that in the
"typedef name vs a new integer type" design options section.

Does the following sufficiently capture how such problems might
realistically materialize? Do you have other examples?

Since char8_t is a distinct type in C++, casts are required before
problematic situations arise there. In either language, char and
unsigned char may be used to examine the underlying storage of char8_t
objects regardless since they alias everything. The problematic cases
therefore involve accessing non-char8_t typed objects via char8_t
types. With the draft proposal, such cases could arise in C code like
the following, but this is ill-formed for C++ (where a copy would be
required unless/until we introduce an explicit scoped aliasing facility
as we've previously discussed). Granted, this code might well be written
with a cast in order to silence warnings about changes in signedness,
and in that case, UB would be introduced in C++.

void do_utf8_things(const char8_t *s) { ... }
void f(const char *presumably_utf8_text) {
   do_utf8_things(presumably_utf8_text);
}

Tom.

> 1. In C++20, a UTF-8 string literal may no longer be used to
> initialize an array of char, signed char, or unsigned char.
> The C2X proposal retains these initializations. This is also
> consistent with existing differences for array initialization
> by a string literal with a mismatched encoding prefix.
>
> The Design Options section discusses these design decisions in
> more detail.
>
> I intend to submit this revision to WG14 later this week. Any
> feedback is appreciated.
>
> Tom.
>
> _______________________________________________
> Liaison mailing list
> Liaison_at_[hidden] <mailto:Liaison_at_[hidden]>
> Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/liaison
> Link to this post: http://lists.isocpp.org/liaison/2021/05/0597.php
>
>


Received on 2021-06-03 17:50:32