On 6/4/21 3:13 AM, Jens Maurer via SG16 wrote:
On 04/06/2021 00.50, Tom Honermann via SG16 wrote:
On 6/2/21 1:47 PM, Richard Smith via SG16 wrote:
On Sun, May 30, 2021 at 6:33 PM Tom Honermann via Liaison <liaison@lists.isocpp.org <mailto:liaison@lists.isocpp.org>> wrote:

    I am seeking review feedback on a draft of N2653: char8_t: A type for UTF-8 characters and strings (Revision 1) <https://rawgit.com/sg16-unicode/sg16/master/papers/n2653.html>.  This paper revises an earlier paper, N2231 <http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2231.htm>, from 2018.

    The revision is a rewrite of much of the original paper and follows the C++20 adoption of P0482R6 <https://wg21.link/p0482r6>.  The primary motivation is to maintain source code compatibility between C and C++.

    Notable differences between what was adopted in C++20 and what is proposed for C2X in N2653 <https://rawgit.com/sg16-unicode/sg16/master/papers/n2653.html> are:

     1. In C++20, char8_t is a fundamental type.  The C2X proposal is for a char8_t typedef name of unsigned char.  This is consistent with existing differences between the languages for wchar_t, char16_t, and char32_t.

One of the important properties of char8_t in C++20 is that it's not an "aliases everything" type. Having that diverge between C and C++ seems likely to be problematic.
Thank you, Richard.  I'll update the paper to discuss that in the "typedef name vs a new integer type" design options section.

Does the following sufficiently capture how such problems might realistically materialize?  Do you have other examples?
Before going into the potential C / C++ compatibility problems, it might be worthwhile
to spend a paragraph explaining that "does not alias everything" is a desirable property,
in general.

Thank you, good suggestion.

I updated the "typedef name vs a new integer type" section and have now submitted the paper to WG14.

Tom.


Since char8_t is a distinct type in C++, casts are required before problematic situations arise there.  In either language, char and unsigned char may be used to examine the underlying storage of char8_t objects regardless since they alias everything.  The problematic cases therefore involve accessing non-char8_t typed objects via char8_t types.  With the draft proposal, such cases could arise in C code like the following, but this is ill-formed for C++ (where a copy would be required unless/until we introduce an explicit scoped aliasing facility as we've previously discussed).  Granted, this code might well be written with a cast in order to silence warnings about changes in signedness, and in that case, UB would be introduced in C++.

void do_utf8_things(const char8_t *s) { ... }
void f(const char *presumably_utf8_text) {
  do_utf8_things(presumably_utf8_text);
}
Yes, this is the issue if the "presumably_utf8_text" objects are actually
char objects.

Jens