On Tue, Sep 14, 2021 at 5:07 AM Corentin <corentin.jabot@gmail.com> wrote:


On Tue, Sep 14, 2021 at 6:38 AM Hubert Tong <hubert.reinterpretcast@gmail.com> wrote:
P1885 is heavily based on the IANA character set registry, which has a concept termed a "charset". According to RFC 2978, a "charset" is "a method of converting a sequence of octets into a sequence of characters". This means that the variety of code units for a "charset" is necessarily limited to 256 different code units. Since P1885 intends to provide the ability to identify encodings associated with the translation or execution environments with a "charset" based abstraction, there is a bit of an issue on how to manage encodings whose code units are not octets. This arises both for the "narrow" encoding (when CHAR_BIT is greater than 8) and for more generally for wide encodings.

A code unit does not need to be a byte.
In particular UTF-16 is considered to have 16 bits code units, for example.

As a character encoding scheme, csUTF16 involves octets and comes with BOM-related baggage.
 
Decomposition to bytes then only requires knowing the byte order.
Note that with the exception UTF-16-BE, UTF-16-LE, etc, considerations about byte order are usually left to the application.

As such, an encoding is a mechanism that represents elements of text in a sequence of code units, the nature of which is tied to that specific encoding.
This is completely independent of a specific implementation, let alone c++ characters types. 

At which point (as already noted in the original post to this thread), the IANA character set registry is not fit for purpose for P1885 in relation to wide encodings.
 


There are at least two plausible interpretations to what the encodings in P1885 represent:
Each "encoding" expresses how code unit sequences are interpreted into a sequence of characters; each char/wchar_t element encodes a single code unit.
Each "encoding" expresses how octet sequences, obtained through some implementation-defined process to decompose char/wchar_t elements, are interpreted into a sequence of characters.

The paper should provide clarity in its proposed wording as to what interpretation is intended (see below for some considerations and my opinion).

Under the first interpretation, the IANA character set registry is not suitable for the maintenance of encodings for C++ due to the inherent category error involved with encodings having code units that involve more than 256 values.

Under the second interpretation, the IANA character set registry could, in theory, be used; however, there would be a proliferation of wide encodings that include variations in endianness and width unless if we acknowledge that the decomposition process may be non-trivial.

More practically, under the first interpretation, many registered character sets that do not involve multibyte characters are suitable for describing the wide-oriented environment encoding; however, basically none of the registered character sets involving multibyte characters are (because the interpretation involves having each wchar_t taken as holding an individual octet value).

More generally, the wide execution encoding would have been chosen by the platform to be suitable to use by wchar_t, or the size of wchar_t would have been chosen by the implementation to be suitable for the platform. This ignores the issue that existing practices and the standard disagree (Single wchar_t code units do not all represent all the associated encoding on windows).


Under the second interpretation, very few registered character sets are suitable for describing the wide-oriented environment encoding without non-trivial decomposition. Some of the suitable ones are csUnicode, csUCS4, csUTF32LE, and csUTF32BE: the first two assumes native endianness and the last two are BOM-agnostic (unlike csUTF32). 

Overall, it is questionable whether there is widespread practice in being able to have an identifying name for the wide-oriented environment encoding. GNU iconv provides "WCHAR_T" but arguably does not include names for wide encodings (the Unicode encoding schemes being byte-oriented). Nevertheless, I think a combination of the second interpretation plus explicitly acknowledging the possibility of non-trivial "decomposition" (akin to conversion from UTF-32 to UTF-8) would successfully navigate us past the friction that this note started with. Of the non-Unicode wide encodings that I am aware of, all are closely related to a corresponding "narrow" encoding.

A more general observation is that encodings with code units bigger than one byte are few and far between.

I am not sure it is a point in favour of the paper that it tries to apply solutions for encodings historically used for storage or interchange to a problem space involving other encodings.
 
Most Shift-JIS are variable width for exemple.
There are rare fixed-width encodings, and I believe both IBM and FreeBSD might be able to use non-unicode encodings for wchar_t.
Beyond that, the api is useful to distinguish between UTF-16 and UTF-2.

I am not denying that there is utility to this facility. I am noting that it seems rather less grounded in existing practice than the corresponding facility for narrow encodings. In terms of reasonable behaviour from an implementation, I think the technical specification is both unclear and cannot be read as-is to support the desired behaviour in various cases (such as those involving wide EBCDIC).
 
But yes, the wide interface is many times less useful than the narrow one.