C++ Logo

sg16

Advanced search

Re: [SG16] P1885: Naming text encodings: problem+solution re: charsets, octets, and wide encodings

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Tue, 14 Sep 2021 10:17:10 -0400
On Tue, Sep 14, 2021 at 5:09 AM Peter Brett <pbrett_at_[hidden]> wrote:

> Hi Hubert,
>
>
>
> I am extremely unclear about the circumstances in which any of the points
> you raise would be a problem with either implementing or using P1885 in
> practice.
>
>
>
> Please could you give some specific examples of the kinds of situation
> that you are worried about?
>

If a wchar_t is taken as a single code unit (or, alternatively, if a
wchar_t is taken as a number of octets based on the object representation)
and each IANA registered character set is supposed to be an encoding scheme
involving code units being octets (as per the definition), then each
IANA-registered EBCDIC encoding involving multibyte characters in its
narrow form (using shift state controlled by SO/SI) expresses only the
narrow encoding form. There is neither a conventional "MIME" name nor a
conventional iconv name for the wide form where each single-or-multibyte
character is placed into a wchar_t.

This problem also occurs with the wide encoding used with IBM-eucTW on AIX.
The narrow encoding supports the 16 planes of CNS 11643 (but character
assignments may be limited) and the wide encoding has the sixteen 94x94
planes "flattened" starting at 0x100. Again, there is no conventional name
for this encoding. The names of the coded character set have already been
co-opted to mean the expected octet-based character encoding scheme.

To me, the paper seems to be expecting that the return value for
wide_environment be something related to the coded character set or the
Unicode encoding form (as opposed to a character encoding scheme) even if
the return value technically represents a character encoding scheme that is
not the wide-oriented encoding under either of the interpretations (without
a non-trivial transformation).


>
>
> Best regards,
>
>
>
> Peter
>
>
>
> *From:* Hubert Tong <hubert.reinterpretcast_at_[hidden]>
> *Sent:* 14 September 2021 05:38
> *To:* SG16 <sg16_at_[hidden]>; C++ Library Evolution Working Group <
> lib-ext_at_[hidden]>; Corentin <corentin.jabot_at_[hidden]>; Peter
> Brett <pbrett_at_[hidden]>
> *Subject:* P1885: Naming text encodings: problem+solution re: charsets,
> octets, and wide encodings
>
>
>
> EXTERNAL MAIL
>
> P1885 is heavily based on the IANA character set registry, which has a
> concept termed a "charset". According to RFC 2978
> <https://urldefense.com/v3/__https:/datatracker.ietf.org/doc/html/rfc2978__;!!EHscmS1ygiU1lA!UFJwVTdUSCzHqwW_TyYBHg9PXkGI0Kq9m6kMbs2vDbTzLKfzGhyLF9I5qZuHWA$>,
> a "charset" is "a method of converting a sequence of octets into a sequence
> of characters". This means that the variety of code units for a "charset"
> is necessarily limited to 256 different code units. Since P1885 intends to
> provide the ability to identify encodings associated with the translation
> or execution environments with a "charset" based abstraction, there is a
> bit of an issue on how to manage encodings whose code units are not octets.
> This arises both for the "narrow" encoding (when CHAR_BIT is greater than
> 8) and for more generally for wide encodings.
>
>
>
> There are at least two plausible interpretations to what the encodings in
> P1885 represent:
>
> Each "encoding" expresses how code unit sequences are interpreted into a
> sequence of characters; each char/wchar_t element encodes a single code
> unit.
>
> Each "encoding" expresses how octet sequences, obtained through some
> implementation-defined process to decompose char/wchar_t elements, are
> interpreted into a sequence of characters.
>
>
>
> The paper should provide clarity in its proposed wording as to what
> interpretation is intended (see below for some considerations and my
> opinion).
>
>
>
> Under the first interpretation, the IANA character set registry is not
> suitable for the maintenance of encodings for C++ due to the inherent
> category error involved with encodings having code units that involve more
> than 256 values.
>
>
>
> Under the second interpretation, the IANA character set registry could, in
> theory, be used; however, there would be a proliferation of wide encodings
> that include variations in endianness and width unless if we acknowledge
> that the decomposition process may be non-trivial.
>
>
>
> More practically, under the first interpretation, many registered
> character sets that do not involve multibyte characters are suitable for
> describing the wide-oriented environment encoding; however, basically none
> of the registered character sets involving multibyte characters are
> (because the interpretation involves having each wchar_t taken as holding
> an individual octet value).
>
>
>
> Under the second interpretation, very few registered character sets are
> suitable for describing the wide-oriented environment encoding without
> non-trivial decomposition. Some of the suitable ones are csUnicode, csUCS4,
> csUTF32LE, and csUTF32BE: the first two assumes native endianness and the
> last two are BOM-agnostic (unlike csUTF32).
>
>
>
> Overall, it is questionable whether there is widespread practice in being
> able to have an identifying name for the wide-oriented environment
> encoding. GNU iconv provides "WCHAR_T" but arguably does not include names
> for wide encodings (the Unicode encoding schemes being byte-oriented).
> Nevertheless, I think a combination of the second interpretation plus
> explicitly acknowledging the possibility of non-trivial "decomposition"
> (akin to conversion from UTF-32 to UTF-8) would successfully navigate us
> past the friction that this note started with. Of the non-Unicode wide
> encodings that I am aware of, all are closely related to a corresponding
> "narrow" encoding.
>

Received on 2021-09-14 09:17:46