C++ Logo

sg16

Advanced search

Re: [SG16] P1885: Naming text encodings: problem+solution re: charsets, octets, and wide encodings

From: Peter Brett <pbrett_at_[hidden]>
Date: Tue, 14 Sep 2021 09:09:30 +0000
Hi Hubert,

I am extremely unclear about the circumstances in which any of the points you raise would be a problem with either implementing or using P1885 in practice.

Please could you give some specific examples of the kinds of situation that you are worried about?

Best regards,

                  Peter

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Sent: 14 September 2021 05:38
To: SG16 <sg16_at_[hidden]>; C++ Library Evolution Working Group <lib-ext_at_[hidden]>; Corentin <corentin.jabot_at_[hidden]>; Peter Brett <pbrett_at_[hidden]>
Subject: P1885: Naming text encodings: problem+solution re: charsets, octets, and wide encodings

EXTERNAL MAIL
P1885 is heavily based on the IANA character set registry, which has a concept termed a "charset". According to RFC 2978<https://urldefense.com/v3/__https:/datatracker.ietf.org/doc/html/rfc2978__;!!EHscmS1ygiU1lA!UFJwVTdUSCzHqwW_TyYBHg9PXkGI0Kq9m6kMbs2vDbTzLKfzGhyLF9I5qZuHWA$>, a "charset" is "a method of converting a sequence of octets into a sequence of characters". This means that the variety of code units for a "charset" is necessarily limited to 256 different code units. Since P1885 intends to provide the ability to identify encodings associated with the translation or execution environments with a "charset" based abstraction, there is a bit of an issue on how to manage encodings whose code units are not octets. This arises both for the "narrow" encoding (when CHAR_BIT is greater than 8) and for more generally for wide encodings.

There are at least two plausible interpretations to what the encodings in P1885 represent:
Each "encoding" expresses how code unit sequences are interpreted into a sequence of characters; each char/wchar_t element encodes a single code unit.
Each "encoding" expresses how octet sequences, obtained through some implementation-defined process to decompose char/wchar_t elements, are interpreted into a sequence of characters.

The paper should provide clarity in its proposed wording as to what interpretation is intended (see below for some considerations and my opinion).

Under the first interpretation, the IANA character set registry is not suitable for the maintenance of encodings for C++ due to the inherent category error involved with encodings having code units that involve more than 256 values.

Under the second interpretation, the IANA character set registry could, in theory, be used; however, there would be a proliferation of wide encodings that include variations in endianness and width unless if we acknowledge that the decomposition process may be non-trivial.

More practically, under the first interpretation, many registered character sets that do not involve multibyte characters are suitable for describing the wide-oriented environment encoding; however, basically none of the registered character sets involving multibyte characters are (because the interpretation involves having each wchar_t taken as holding an individual octet value).

Under the second interpretation, very few registered character sets are suitable for describing the wide-oriented environment encoding without non-trivial decomposition. Some of the suitable ones are csUnicode, csUCS4, csUTF32LE, and csUTF32BE: the first two assumes native endianness and the last two are BOM-agnostic (unlike csUTF32).

Overall, it is questionable whether there is widespread practice in being able to have an identifying name for the wide-oriented environment encoding. GNU iconv provides "WCHAR_T" but arguably does not include names for wide encodings (the Unicode encoding schemes being byte-oriented). Nevertheless, I think a combination of the second interpretation plus explicitly acknowledging the possibility of non-trivial "decomposition" (akin to conversion from UTF-32 to UTF-8) would successfully navigate us past the friction that this note started with. Of the non-Unicode wide encodings that I am aware of, all are closely related to a corresponding "narrow" encoding.

Received on 2021-09-14 04:09:59