C++ Logo

sg16

Advanced search

Re: [SG16] [isocpp-lib-ext] P1885: Naming text encodings: Encodings in the environment versus registered character sets

From: Corentin <corentin.jabot_at_[hidden]>
Date: Sat, 18 Sep 2021 08:49:54 +0200
On Sat, Sep 18, 2021 at 8:33 AM Jens Maurer <Jens.Maurer_at_[hidden]> wrote:

> On 17/09/2021 23.24, Hubert Tong via Lib-Ext wrote:
> > P1885 does not exist in a vacuum. And the existing wording does place a
> requirement between the narrow and wide execution encodings. I am somewhat
> convinced that P1885 is not the place to address the wchar_t problems re:
> UCS-2 versus UTF-16, but I will point out that P1885 theoretically
> exacerbates the problem because the UTF-8 narrow and UCS-2 wide case is not
> perfectly valid. Previously, only the UTF-16 case was clearly misaligned
> with the standard; with P1886, the UCS-2 case is also misaligned.
>
> Regardless of outcome, I'd like to see an explanation of the issue in the
> prose
> text of the paper. I understand why UTF-16 is incompatible with the
> standard's
> assumptions for wchar_t, but I'd appreciate some succinct statement why a
> requirement (which one?) of P1885 also makes UTF-8 and UCS-2 invalid.
>

Sure, I can add more prose.
To clarify your second sentence, P1885 does not preclude returning UCS-2.
However, the wide encoding on windows is documented to be UTF-16, and
that's probably what the implementation would want to return
https://docs.microsoft.com/en-us/windows/win32/learnwin32/working-with-strings
.


>
> > User expectations of something completely novel is rather hard to guess
> at. Should the narrow and wide EBCDIC versions of the same character set be
> called the same charset? For cases where there are no multibyte characters,
> most indications are "yes". For cases where there are multibyte characters,
> it seems to be more up in the air. If the answer is "no", then I imagine we
> end up with some "x-" prefixes and "-2byte" or "-4byte" suffixes (the
> endianness is always big endian).
>
> I'd like to point out that the paper talks about encodings, not charsets.
> In my view, obviously narrow and wide encodings for EBCDIC (with no
> multibyte
> characters) are different encodings, because the target type differs.
> (There might be value in claiming this and similar cases mentioned above
> to be
> "the same" encoding, but maybe we can come up with a different term, then.)
>

Yes, they are different encodings


>
> Jens
>

Received on 2021-09-18 01:50:13