C++ Logo

sg16

Advanced search

[SG16] P1885: Naming text encodings: problem+solution re: charsets, octets, and wide encodings

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Tue, 14 Sep 2021 00:37:38 -0400
P1885 is heavily based on the IANA character set registry, which has a
concept termed a "charset". According to RFC 2978
<https://datatracker.ietf.org/doc/html/rfc2978>, a "charset" is "a method
of converting a sequence of octets into a sequence of characters". This
means that the variety of code units for a "charset" is necessarily limited
to 256 different code units. Since P1885 intends to provide the ability to
identify encodings associated with the translation or execution
environments with a "charset" based abstraction, there is a bit of an issue
on how to manage encodings whose code units are not octets. This arises
both for the "narrow" encoding (when CHAR_BIT is greater than 8) and for
more generally for wide encodings.

There are at least two plausible interpretations to what the encodings in
P1885 represent:
Each "encoding" expresses how code unit sequences are interpreted into a
sequence of characters; each char/wchar_t element encodes a single code
unit.
Each "encoding" expresses how octet sequences, obtained through some
implementation-defined process to decompose char/wchar_t elements, are
interpreted into a sequence of characters.

The paper should provide clarity in its proposed wording as to what
interpretation is intended (see below for some considerations and my
opinion).

Under the first interpretation, the IANA character set registry is not
suitable for the maintenance of encodings for C++ due to the inherent
category error involved with encodings having code units that involve more
than 256 values.

Under the second interpretation, the IANA character set registry could, in
theory, be used; however, there would be a proliferation of wide encodings
that include variations in endianness and width unless if we acknowledge
that the decomposition process may be non-trivial.

More practically, under the first interpretation, many registered character
sets that do not involve multibyte characters are suitable for describing
the wide-oriented environment encoding; however, basically none of the
registered character sets involving multibyte characters are (because the
interpretation involves having each wchar_t taken as holding an individual
octet value).

Under the second interpretation, very few registered character sets are
suitable for describing the wide-oriented environment encoding without
non-trivial decomposition. Some of the suitable ones are csUnicode, csUCS4,
csUTF32LE, and csUTF32BE: the first two assumes native endianness and the
last two are BOM-agnostic (unlike csUTF32).

Overall, it is questionable whether there is widespread practice in being
able to have an identifying name for the wide-oriented environment
encoding. GNU iconv provides "WCHAR_T" but arguably does not include names
for wide encodings (the Unicode encoding schemes being byte-oriented).
Nevertheless, I think a combination of the second interpretation plus
explicitly acknowledging the possibility of non-trivial "decomposition"
(akin to conversion from UTF-32 to UTF-8) would successfully navigate us
past the friction that this note started with. Of the non-Unicode wide
encodings that I am aware of, all are closely related to a corresponding
"narrow" encoding.

Received on 2021-09-13 23:38:08