On Thu, Jun 3, 2021 at 3:50 PM Tom Honermann <tom@honermann.net> wrote:
On 6/2/21 1:47 PM, Richard Smith via SG16 wrote:
On Sun, May 30, 2021 at 6:33 PM Tom Honermann via Liaison <liaison@lists.isocpp.org> wrote:

I am seeking review feedback on a draft of N2653: char8_t: A type for UTF-8 characters and strings (Revision 1).  This paper revises an earlier paper, N2231, from 2018.

The revision is a rewrite of much of the original paper and follows the C++20 adoption of P0482R6.  The primary motivation is to maintain source code compatibility between C and C++.

Notable differences between what was adopted in C++20 and what is proposed for C2X in N2653 are:

  1. In C++20, char8_t is a fundamental type.  The C2X proposal is for a char8_t typedef name of unsigned char.  This is consistent with existing differences between the languages for wchar_t, char16_t, and char32_t.
One of the important properties of char8_t in C++20 is that it's not an "aliases everything" type. Having that diverge between C and C++ seems likely to be problematic.

Thank you, Richard.  I'll update the paper to discuss that in the "typedef name vs a new integer type" design options section.

Does the following sufficiently capture how such problems might realistically materialize?  Do you have other examples?

Since char8_t is a distinct type in C++, casts are required before problematic situations arise there.  In either language, char and unsigned char may be used to examine the underlying storage of char8_t objects regardless since they alias everything.  The problematic cases therefore involve accessing non-char8_t typed objects via char8_t types.  With the draft proposal, such cases could arise in C code like the following, but this is ill-formed for C++ (where a copy would be required unless/until we introduce an explicit scoped aliasing facility as we've previously discussed).  Granted, this code might well be written with a cast in order to silence warnings about changes in signedness, and in that case, UB would be introduced in C++.

void do_utf8_things(const char8_t *s) { ... }
void f(const char *presumably_utf8_text) {

Right, I'd be thinking of something like:

enum MyUTF8 : unsigned char {};
void f(const MyUTF8 *presumably_utf8_text) {
  do_utf8_things((const char8_t*)presumably_utf8_text);

... which would likely be defined (and require a correct cast) in C but undefined in C++.


  1. In C++20, a UTF-8 string literal may no longer be used to initialize an array of char, signed char, or unsigned char.  The C2X proposal retains these initializations.  This is also consistent with existing differences for array initialization by a string literal with a mismatched encoding prefix.

The Design Options section discusses these design decisions in more detail.

I intend to submit this revision to WG14 later this week.  Any feedback is appreciated.


Liaison mailing list
Subscription: https://lists.isocpp.org/mailman/listinfo.cgi/liaison
Link to this post: http://lists.isocpp.org/liaison/2021/05/0597.php