C++ Logo

sg16

Advanced search

Re: [SG16] P1885: Naming text encodings: problem+solution re: charsets, octets, and wide encodings

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Tue, 14 Sep 2021 12:00:39 -0400
On Tue, Sep 14, 2021 at 5:07 AM Corentin <corentin.jabot_at_[hidden]> wrote:

>
>
> On Tue, Sep 14, 2021 at 6:38 AM Hubert Tong <
> hubert.reinterpretcast_at_[hidden]> wrote:
>
>> P1885 is heavily based on the IANA character set registry, which has a
>> concept termed a "charset". According to RFC 2978
>> <https://datatracker.ietf.org/doc/html/rfc2978>, a "charset" is "a
>> method of converting a sequence of octets into a sequence of characters".
>> This means that the variety of code units for a "charset" is necessarily
>> limited to 256 different code units. Since P1885 intends to provide the
>> ability to identify encodings associated with the translation or execution
>> environments with a "charset" based abstraction, there is a bit of an issue
>> on how to manage encodings whose code units are not octets. This arises
>> both for the "narrow" encoding (when CHAR_BIT is greater than 8) and for
>> more generally for wide encodings.
>>
>
> A code unit does not need to be a byte.
> In particular UTF-16 is considered to have 16 bits code units, for example.
>

As a character encoding scheme, csUTF16 involves octets and comes with
BOM-related baggage.


> Decomposition to bytes then only requires knowing the byte order.
> Note that with the exception UTF-16-BE, UTF-16-LE, etc, considerations
> about byte order are usually left to the application.
>
> As such, an encoding is a mechanism that represents elements of text in a
> sequence of code units, the nature of which is tied to that specific
> encoding.
> This is completely independent of a specific implementation, let alone c++
> characters types.
>

At which point (as already noted in the original post to this thread), the
IANA character set registry is not fit for purpose for P1885 in relation to
wide encodings.


>
>
> There are at least two plausible interpretations to what the encodings in
>> P1885 represent:
>> Each "encoding" expresses how code unit sequences are interpreted into a
>> sequence of characters; each char/wchar_t element encodes a single code
>> unit.
>> Each "encoding" expresses how octet sequences, obtained through some
>> implementation-defined process to decompose char/wchar_t elements, are
>> interpreted into a sequence of characters.
>>
>> The paper should provide clarity in its proposed wording as to what
>> interpretation is intended (see below for some considerations and my
>> opinion).
>>
>> Under the first interpretation, the IANA character set registry is not
>> suitable for the maintenance of encodings for C++ due to the inherent
>> category error involved with encodings having code units that involve more
>> than 256 values.
>>
>> Under the second interpretation, the IANA character set registry could,
>> in theory, be used; however, there would be a proliferation of wide
>> encodings that include variations in endianness and width unless if we
>> acknowledge that the decomposition process may be non-trivial.
>>
>> More practically, under the first interpretation, many registered
>> character sets that do not involve multibyte characters are suitable for
>> describing the wide-oriented environment encoding; however, basically none
>> of the registered character sets involving multibyte characters are
>> (because the interpretation involves having each wchar_t taken as holding
>> an individual octet value).
>>
>
> More generally, the wide execution encoding would have been chosen by the
> platform to be suitable to use by wchar_t, or the size of wchar_t would
> have been chosen by the implementation to be suitable for the platform.
> This ignores the issue that existing practices and the standard disagree
> (Single wchar_t code units do not all represent all the associated encoding
> on windows).
>
>
>> Under the second interpretation, very few registered character sets are
>> suitable for describing the wide-oriented environment encoding without
>> non-trivial decomposition. Some of the suitable ones are csUnicode, csUCS4,
>> csUTF32LE, and csUTF32BE: the first two assumes native endianness and the
>> last two are BOM-agnostic (unlike csUTF32).
>>
>
>> Overall, it is questionable whether there is widespread practice in being
>> able to have an identifying name for the wide-oriented environment
>> encoding. GNU iconv provides "WCHAR_T" but arguably does not include names
>> for wide encodings (the Unicode encoding schemes being byte-oriented).
>> Nevertheless, I think a combination of the second interpretation plus
>> explicitly acknowledging the possibility of non-trivial "decomposition"
>> (akin to conversion from UTF-32 to UTF-8) would successfully navigate us
>> past the friction that this note started with. Of the non-Unicode wide
>> encodings that I am aware of, all are closely related to a corresponding
>> "narrow" encoding.
>>
>
> A more general observation is that encodings with code units bigger than
> one byte are few and far between.
>

I am not sure it is a point in favour of the paper that it tries to apply
solutions for encodings historically used for storage or interchange to a
problem space involving other encodings.


> Most Shift-JIS are variable width for exemple.
> There are rare fixed-width encodings, and I believe both IBM and FreeBSD
> might be able to use non-unicode encodings for wchar_t.
> Beyond that, the api is useful to distinguish between UTF-16 and UTF-2.
>

I am not denying that there is utility to this facility. I am noting that
it seems rather less grounded in existing practice than the corresponding
facility for narrow encodings. In terms of reasonable behaviour from an
implementation, I think the technical specification is both unclear and
cannot be read as-is to support the desired behaviour in various cases
(such as those involving wide EBCDIC).


> But yes, the wide interface is many times less useful than the narrow one.
>
>

Received on 2021-09-14 11:01:10