sg16: Re: [SG16] P1885: Naming text encodings: problem+solution re: charsets, octets, and wide encodings

From: Corentin <corentin.jabot_at_[hidden]>
Date: Tue, 14 Sep 2021 11:07:03 +0200

On Tue, Sep 14, 2021 at 6:38 AM Hubert Tong <
hubert.reinterpretcast_at_[hidden]> wrote:

> P1885 is heavily based on the IANA character set registry, which has a
> concept termed a "charset". According to RFC 2978
> <https://datatracker.ietf.org/doc/html/rfc2978>, a "charset" is "a method
> of converting a sequence of octets into a sequence of characters". This
> means that the variety of code units for a "charset" is necessarily limited
> to 256 different code units. Since P1885 intends to provide the ability to
> identify encodings associated with the translation or execution
> environments with a "charset" based abstraction, there is a bit of an issue
> on how to manage encodings whose code units are not octets. This arises
> both for the "narrow" encoding (when CHAR_BIT is greater than 8) and for
> more generally for wide encodings.
>

A code unit does not need to be a byte.
In particular UTF-16 is considered to have 16 bits code units, for example.
Decomposition to bytes then only requires knowing the byte order.
Note that with the exception UTF-16-BE, UTF-16-LE, etc, considerations
about byte order are usually left to the application.

As such, an encoding is a mechanism that represents elements of text in a
sequence of code units, the nature of which is tied to that specific
encoding.
This is completely independent of a specific implementation, let alone c++
characters types.

There are at least two plausible interpretations to what the encodings in
> P1885 represent:
> Each "encoding" expresses how code unit sequences are interpreted into a
> sequence of characters; each char/wchar_t element encodes a single code
> unit.
> Each "encoding" expresses how octet sequences, obtained through some
> implementation-defined process to decompose char/wchar_t elements, are
> interpreted into a sequence of characters.
>
> The paper should provide clarity in its proposed wording as to what
> interpretation is intended (see below for some considerations and my
> opinion).
>
> Under the first interpretation, the IANA character set registry is not
> suitable for the maintenance of encodings for C++ due to the inherent
> category error involved with encodings having code units that involve more
> than 256 values.
>
> Under the second interpretation, the IANA character set registry could, in
> theory, be used; however, there would be a proliferation of wide encodings
> that include variations in endianness and width unless if we acknowledge
> that the decomposition process may be non-trivial.
>
> More practically, under the first interpretation, many registered
> character sets that do not involve multibyte characters are suitable for
> describing the wide-oriented environment encoding; however, basically none
> of the registered character sets involving multibyte characters are
> (because the interpretation involves having each wchar_t taken as holding
> an individual octet value).
>

More generally, the wide execution encoding would have been chosen by the
platform to be suitable to use by wchar_t, or the size of wchar_t would
have been chosen by the implementation to be suitable for the platform.
This ignores the issue that existing practices and the standard disagree
(Single wchar_t code units do not all represent all the associated encoding
on windows).

> Under the second interpretation, very few registered character sets are
> suitable for describing the wide-oriented environment encoding without
> non-trivial decomposition. Some of the suitable ones are csUnicode, csUCS4,
> csUTF32LE, and csUTF32BE: the first two assumes native endianness and the
> last two are BOM-agnostic (unlike csUTF32).
>

> Overall, it is questionable whether there is widespread practice in being
> able to have an identifying name for the wide-oriented environment
> encoding. GNU iconv provides "WCHAR_T" but arguably does not include names
> for wide encodings (the Unicode encoding schemes being byte-oriented).
> Nevertheless, I think a combination of the second interpretation plus
> explicitly acknowledging the possibility of non-trivial "decomposition"
> (akin to conversion from UTF-32 to UTF-8) would successfully navigate us
> past the friction that this note started with. Of the non-Unicode wide
> encodings that I am aware of, all are closely related to a corresponding
> "narrow" encoding.
>

A more general observation is that encodings with code units bigger than
one byte are few and far between.
Most Shift-JIS are variable width for exemple.
There are rare fixed-width encodings, and I believe both IBM and FreeBSD
might be able to use non-unicode encodings for wchar_t.
Beyond that, the api is useful to distinguish between UTF-16 and UTF-2.
But yes, the wide interface is many times less useful than the narrow one.

Received on 2021-09-14 04:07:17