C++ Logo

sg16

Advanced search

Re: [SG16] [isocpp-lib-ext] P1885: Naming text encodings: Encodings in the environment versus registered character sets

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Sat, 18 Sep 2021 08:33:14 +0200
On 17/09/2021 23.24, Hubert Tong via Lib-Ext wrote:
> P1885 does not exist in a vacuum. And the existing wording does place a requirement between the narrow and wide execution encodings. I am somewhat convinced that P1885 is not the place to address the wchar_t problems re: UCS-2 versus UTF-16, but I will point out that P1885 theoretically exacerbates the problem because the UTF-8 narrow and UCS-2 wide case is not perfectly valid. Previously, only the UTF-16 case was clearly misaligned with the standard; with P1886, the UCS-2 case is also misaligned.

Regardless of outcome, I'd like to see an explanation of the issue in the prose
text of the paper. I understand why UTF-16 is incompatible with the standard's
assumptions for wchar_t, but I'd appreciate some succinct statement why a
requirement (which one?) of P1885 also makes UTF-8 and UCS-2 invalid.

> User expectations of something completely novel is rather hard to guess at. Should the narrow and wide EBCDIC versions of the same character set be called the same charset? For cases where there are no multibyte characters, most indications are "yes". For cases where there are multibyte characters, it seems to be more up in the air. If the answer is "no", then I imagine we end up with some "x-" prefixes and "-2byte" or "-4byte" suffixes (the endianness is always big endian).

I'd like to point out that the paper talks about encodings, not charsets.
In my view, obviously narrow and wide encodings for EBCDIC (with no multibyte
characters) are different encodings, because the target type differs.
(There might be value in claiming this and similar cases mentioned above to be
"the same" encoding, but maybe we can come up with a different term, then.)

Jens

Received on 2021-09-18 01:33:23