On Sat, Sep 18, 2021 at 2:33 AM Jens Maurer <Jens.Maurer@gmx.net> wrote:

On 17/09/2021 23.24, Hubert Tong via Lib-Ext wrote:
> P1885 does not exist in a vacuum. And the existing wording does place a requirement between the narrow and wide execution encodings. I am somewhat convinced that P1885 is not the place to address the wchar_t problems re: UCS-2 versus UTF-16, but I will point out that P1885 theoretically exacerbates the problem because the UTF-8 narrow and UCS-2 wide case is not perfectly valid. Previously, only the UTF-16 case was clearly misaligned with the standard; with P1886, the UCS-2 case is also misaligned.

Regardless of outcome, I'd like to see an explanation of the issue in the prose
text of the paper. I understand why UTF-16 is incompatible with the standard's
assumptions for wchar_t, but I'd appreciate some succinct statement why a
requirement (which one?) of P1885 also makes UTF-8 and UCS-2 invalid.

I think it's two sides of the same coin. [basic.fundamental] p8:
The values of type wchar_t can represent distinct codes for all members of the largest extended character set specified among the supported locales.

If claiming UTF-8 as the narrow encoding means that the coded character set is UCS, then UCS-2 does not meet the requirement. Since the need/desire to consider claiming UTF-8 narrow encoding comes from P1885, it is P1885 that makes this a more acute problem than before.

> User expectations of something completely novel is rather hard to guess at. Should the narrow and wide EBCDIC versions of the same character set be called the same charset? For cases where there are no multibyte characters, most indications are "yes". For cases where there are multibyte characters, it seems to be more up in the air. If the answer is "no", then I imagine we end up with some "x-" prefixes and "-2byte" or "-4byte" suffixes (the endianness is always big endian).

I'd like to point out that the paper talks about encodings, not charsets.

"charset"s as defined by IETF in relation to the IANA character set registry are encodings with octets as code units. See https://datatracker.ietf.org/doc/html/rfc2978#section-1.3.

In my view, obviously narrow and wide encodings for EBCDIC (with no multibyte
characters) are different encodings, because the target type differs.
(There might be value in claiming this and similar cases mentioned above to be
"the same" encoding, but maybe we can come up with a different term, then.)

There are whole other threads starting from https://lists.isocpp.org/sg16/2021/09/2584.php about how to abstractly apply the "charset" concept to C++ types. Technically the target type differs as well between csUnicode and a UCS-2 wchar_t: one is an octet-based native endian UCS-2 encoding scheme and the other is a serialization of the encoding form using 16-bit code units.

Jens