sg16: Re: [SG16] P1885: Naming text encodings: Encodings in the environment versus registered character sets

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Sun, 19 Sep 2021 00:30:10 -0400

On Sat, Sep 18, 2021 at 3:36 AM Corentin <corentin.jabot_at_[hidden]> wrote:

>
>
> On Fri, Sep 17, 2021 at 11:24 PM Hubert Tong <
> hubert.reinterpretcast_at_[hidden]> wrote:
>
>> On Thu, Sep 16, 2021 at 5:04 PM Corentin <corentin.jabot_at_[hidden]>
>> wrote:
>>
>>>
>>>
>>> On Thu, Sep 16, 2021 at 10:23 PM Hubert Tong <
>>> hubert.reinterpretcast_at_[hidden]> wrote:
>>>
>>>> On Thu, Sep 16, 2021 at 1:42 PM Corentin <corentin.jabot_at_[hidden]>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>> Given the many many encodings, a lot are only separated by one or two
>>>>> codepoints. How in that context do we define variants?
>>>>>
>>>>
>>>> I think you are raising a new question about the design: If a system
>>>> has more than one implementation of the same encoding, should it return the
>>>> same registered character set to represent more than one of those encodings?
>>>>
>>>
>>> Well, 2 implementations of the same encoding would be the same, and for
>>> registered character sets, aliases would cover that use case.
>>>
>>
>> Sorry (my fault for not being clear): I meant the question not for the
>> "same encoding" but for encodings separated by a small percentage of
>> differences. For example, near matches for a registered character set but
>> one being modified for the euro sign.
>>
>
> No need to apologize!
> Whether an implementation would choose to consider a specific encoding to
> advertise itself as another will depend a lot on what the current practices
> on that specific platform are.
> Most of the "european" encodings that have been modified for euro have
> been restandardized and republished under a different name (8859-15 and 8859-1
> for example).
>

I think I can work with that; thanks.

>
>> P1885 does not exist in a vacuum. And the existing wording does place a
>> requirement between the narrow and wide execution encodings. I am somewhat
>> convinced that P1885 is not the place to address the wchar_t problems re:
>> UCS-2 versus UTF-16, but I will point out that P1885 theoretically
>> exacerbates the problem because the UTF-8 narrow and UCS-2 wide case is not
>> perfectly valid. Previously, only the UTF-16 case was clearly misaligned
>> with the standard; with P1886, the UCS-2 case is also misaligned.
>>
>
> Sure, although I am not aware of platforms for which UCS-2 is currently
> assumed.
>

AIX is considered to use UCS-2 for the 2-byte version of `wchar_t` (Clang
and GCC may use UTF-16 in place of UCS-2 in encoding literals). The same
for z/OS. It is one of those implementations where `wcstombs` for some
locales gives you CESU-8 encoded output when surrogate pairs are
encountered (but `mbrtoc32` won't interpret the surrogate pairs, so the
narrow encoding is not CESU-8).

>
>
>>
>>
>>>
>>> - Return an encoding that is different from that used by mbstowcs,
>>> for example (the wide environment is rather an environment that you would
>>> expect wprintf could consume without creating mojibake)
>>>
>>> This particular intent has additional limitations: The understanding of
>> locales with the same name is not consistent in practice on various
>> platforms between 32-bit and 64-bit processes.
>>
>
> Do you have specific examples in mind? I am not aware of platforms where
> wchar_t would be 64 bits. Or maybe the size of wchar_t is not your concern,
> can you clarify?
>

The size of wchar_t for the 31/32-bit ABI (which has more history) and for
the 64-bit ABI is different on z/OS and on AIX. For both platforms, wchar_t
is 2 bytes for a 31/32-bit process and 4 bytes for a 64-bit process.

> User expectations of something completely novel is rather hard to guess
>> at. Should the narrow and wide EBCDIC versions of the same character set be
>> called the same charset? For cases where there are no multibyte characters,
>> most indications are "yes". For cases where there are multibyte characters,
>> it seems to be more up in the air. If the answer is "no", then I imagine we
>> end up with some "x-" prefixes and "-2byte" or "-4byte" suffixes (the
>> endianness is always big endian).
>>
>
> Despite poor naming, IANA specifically registers encodings, hence the
> class `text_encoding`.
> In your scenario, narrow and wide EBCDIC would have different sequences of
> code units and different code unit types and as such would be ideally
> considered different encodings.
>

For EBCDIC without multibyte characters in the narrow encoding, the
sequences of code units are the same (if ignoring the code unit type
difference and correcting for endianness when read from a file). The
guidance that they are ideally considered different encodings is reasonable
and I would suggest that the paper's wide EBCDIC example show an
unregistered "x-EBCDIC-US-4byte" as a potential answer. I suppose being
implicitly "native endian" makes some sense.

> I will admit it has been difficult for me to find any information about
> wide-ebcdic, so i don't know if and how it is currently referred to by IBM
> implementations, if there are multiple encoding and character
> sets/encodings defined as some flavor of wide ebcid, etc
>

As far as I know, these encodings just exist and are not talked about or
really named as something separate from the coded character set.

>
>
>> The placement of the "implementation-defined" in the currently proposed
>> wording for environment() is hard for me to read this way. Also, the
>> wording for literal() does not say "implementation-defined".
>>
>
> The literal encoding is already implementation-defined, and we do not need
> as much implementation freedom here.
> I am happy to reword that sentence if you think it would be clearer.
>

I think that wording is fine with the addition below.

>
>
>> Minimal wording (for the general/synopsis section):
>> How a text_encoding object is determined to be representative of a
>> character encoding implemented in the translation or execution environment
>> is implementation-defined.
>>
>
> I would be happy to add that
>

Thanks.

>
>> I would again emphasize that some of the issues are with the novelty of
>> trying to name wide encodings where there has not been sufficient need that
>> there is established existing practice. Do you have a list of existing APIs
>> that provide names for wide encodings out of locale information?
>>
>
> On Windows this is a documented property of the platform: UTF-16.
> On some posix platforms (linux, mac), this will always be UTF-32.
> On others (like freebsd), it is not documented beyond being
> "implementation-defined", and may be some wide shift-jis or fixed-width
> euc. There is currently no api to infer what these wide encodings are.
>
> Possible implementation strategies for freebsd include
>
> - Maintaining a mapping of narrow -> wide encodings, as these
> platforms are specified to have one.
>
> User-defined locales on AIX can mess with that. The wide encoding (when
not UCS-2 or UTF-32) is realized by replacement implementations of
mbstowcs, wcstombs, etc.

>
> - Modifying the libc to expose the name of wide encoding in a way
> similar to nl_lang_info(CODESET) - for use by the C++ library. I hope this
> will be the long term outcome.
> - return id::unknown - which is what I expect these platforms to do
> initially.
>
> I suspect the answer for ebcdic platforms might be very similar?
>

Pretty much the mapping case.

I'm not sure that any of the strategies above other than "unknown" avoids
inventing new names for wide encodings. It seems the title of the paper is
more applicable than I had appreciated until now. :)

Received on 2021-09-18 23:30:41