C++ Logo

sg16

Advanced search

Re: [SG16] [isocpp-lib-ext] P1885: Naming text encodings: Encodings in the environment versus registered character sets

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Sat, 18 Sep 2021 11:27:03 -0400
On Sat, Sep 18, 2021 at 2:33 AM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 17/09/2021 23.24, Hubert Tong via Lib-Ext wrote:
> > P1885 does not exist in a vacuum. And the existing wording does place a
> requirement between the narrow and wide execution encodings. I am somewhat
> convinced that P1885 is not the place to address the wchar_t problems re:
> UCS-2 versus UTF-16, but I will point out that P1885 theoretically
> exacerbates the problem because the UTF-8 narrow and UCS-2 wide case is not
> perfectly valid. Previously, only the UTF-16 case was clearly misaligned
> with the standard; with P1886, the UCS-2 case is also misaligned.
>
> Regardless of outcome, I'd like to see an explanation of the issue in the
> prose
> text of the paper. I understand why UTF-16 is incompatible with the
> standard's
> assumptions for wchar_t, but I'd appreciate some succinct statement why a
> requirement (which one?) of P1885 also makes UTF-8 and UCS-2 invalid.
>

I think it's two sides of the same coin. [basic.fundamental] p8:
The values of type wchar_t can represent distinct codes for all members of
the largest extended character set specified among the supported locales.

If claiming UTF-8 as the narrow encoding means that the coded character set
is UCS, then UCS-2 does not meet the requirement. Since the need/desire to
consider claiming UTF-8 narrow encoding comes from P1885, it is P1885 that
makes this a more acute problem than before.


>
> > User expectations of something completely novel is rather hard to guess
> at. Should the narrow and wide EBCDIC versions of the same character set be
> called the same charset? For cases where there are no multibyte characters,
> most indications are "yes". For cases where there are multibyte characters,
> it seems to be more up in the air. If the answer is "no", then I imagine we
> end up with some "x-" prefixes and "-2byte" or "-4byte" suffixes (the
> endianness is always big endian).
>
> I'd like to point out that the paper talks about encodings, not charsets.
>

"charset"s as defined by IETF in relation to the IANA character set
registry are encodings with octets as code units. See
https://datatracker.ietf.org/doc/html/rfc2978#section-1.3.


> In my view, obviously narrow and wide encodings for EBCDIC (with no
> multibyte
> characters) are different encodings, because the target type differs.
> (There might be value in claiming this and similar cases mentioned above
> to be
> "the same" encoding, but maybe we can come up with a different term, then.)
>

There are whole other threads starting from
https://lists.isocpp.org/sg16/2021/09/2584.php about how to abstractly
apply the "charset" concept to C++ types. Technically the target type
differs as well between csUnicode and a UCS-2 wchar_t: one is an
octet-based native endian UCS-2 encoding scheme and the other is a
serialization of the encoding form using 16-bit code units.


>
> Jens
>

Received on 2021-09-18 10:27:31