C++ Logo


Advanced search

Re: [SG16] P1885: Naming text encodings: problem+solution re: charsets, octets, and wide encodings

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Tue, 14 Sep 2021 12:00:39 -0400
On Tue, Sep 14, 2021 at 5:07 AM Corentin <corentin.jabot_at_[hidden]> wrote:

> On Tue, Sep 14, 2021 at 6:38 AM Hubert Tong <
> hubert.reinterpretcast_at_[hidden]> wrote:
>> P1885 is heavily based on the IANA character set registry, which has a
>> concept termed a "charset". According to RFC 2978
>> <https://datatracker.ietf.org/doc/html/rfc2978>, a "charset" is "a
>> method of converting a sequence of octets into a sequence of characters".
>> This means that the variety of code units for a "charset" is necessarily
>> limited to 256 different code units. Since P1885 intends to provide the
>> ability to identify encodings associated with the translation or execution
>> environments with a "charset" based abstraction, there is a bit of an issue
>> on how to manage encodings whose code units are not octets. This arises
>> both for the "narrow" encoding (when CHAR_BIT is greater than 8) and for
>> more generally for wide encodings.
> A code unit does not need to be a byte.
> In particular UTF-16 is considered to have 16 bits code units, for example.

As a character encoding scheme, csUTF16 involves octets and comes with
BOM-related baggage.

> Decomposition to bytes then only requires knowing the byte order.
> Note that with the exception UTF-16-BE, UTF-16-LE, etc, considerations
> about byte order are usually left to the application.
> As such, an encoding is a mechanism that represents elements of text in a
> sequence of code units, the nature of which is tied to that specific
> encoding.
> This is completely independent of a specific implementation, let alone c++
> characters types.

At which point (as already noted in the original post to this thread), the
IANA character set registry is not fit for purpose for P1885 in relation to
wide encodings.

> There are at least two plausible interpretations to what the encodings in
>> P1885 represent:
>> Each "encoding" expresses how code unit sequences are interpreted into a
>> sequence of characters; each char/wchar_t element encodes a single code
>> unit.
>> Each "encoding" expresses how octet sequences, obtained through some
>> implementation-defined process to decompose char/wchar_t elements, are
>> interpreted into a sequence of characters.
>> The paper should provide clarity in its proposed wording as to what
>> interpretation is intended (see below for some considerations and my
>> opinion).
>> Under the first interpretation, the IANA character set registry is not
>> suitable for the maintenance of encodings for C++ due to the inherent
>> category error involved with encodings having code units that involve more
>> than 256 values.
>> Under the second interpretation, the IANA character set registry could,
>> in theory, be used; however, there would be a proliferation of wide
>> encodings that include variations in endianness and width unless if we
>> acknowledge that the decomposition process may be non-trivial.
>> More practically, under the first interpretation, many registered
>> character sets that do not involve multibyte characters are suitable for
>> describing the wide-oriented environment encoding; however, basically none
>> of the registered character sets involving multibyte characters are
>> (because the interpretation involves having each wchar_t taken as holding
>> an individual octet value).
> More generally, the wide execution encoding would have been chosen by the
> platform to be suitable to use by wchar_t, or the size of wchar_t would
> have been chosen by the implementation to be suitable for the platform.
> This ignores the issue that existing practices and the standard disagree
> (Single wchar_t code units do not all represent all the associated encoding
> on windows).
>> Under the second interpretation, very few registered character sets are
>> suitable for describing the wide-oriented environment encoding without
>> non-trivial decomposition. Some of the suitable ones are csUnicode, csUCS4,
>> csUTF32LE, and csUTF32BE: the first two assumes native endianness and the
>> last two are BOM-agnostic (unlike csUTF32).
>> Overall, it is questionable whether there is widespread practice in being
>> able to have an identifying name for the wide-oriented environment
>> encoding. GNU iconv provides "WCHAR_T" but arguably does not include names
>> for wide encodings (the Unicode encoding schemes being byte-oriented).
>> Nevertheless, I think a combination of the second interpretation plus
>> explicitly acknowledging the possibility of non-trivial "decomposition"
>> (akin to conversion from UTF-32 to UTF-8) would successfully navigate us
>> past the friction that this note started with. Of the non-Unicode wide
>> encodings that I am aware of, all are closely related to a corresponding
>> "narrow" encoding.
> A more general observation is that encodings with code units bigger than
> one byte are few and far between.

I am not sure it is a point in favour of the paper that it tries to apply
solutions for encodings historically used for storage or interchange to a
problem space involving other encodings.

> Most Shift-JIS are variable width for exemple.
> There are rare fixed-width encodings, and I believe both IBM and FreeBSD
> might be able to use non-unicode encodings for wchar_t.
> Beyond that, the api is useful to distinguish between UTF-16 and UTF-2.

I am not denying that there is utility to this facility. I am noting that
it seems rather less grounded in existing practice than the corresponding
facility for narrow encodings. In terms of reasonable behaviour from an
implementation, I think the technical specification is both unclear and
cannot be read as-is to support the desired behaviour in various cases
(such as those involving wide EBCDIC).

> But yes, the wide interface is many times less useful than the narrow one.

Received on 2021-09-14 11:01:10