sg16: Re: [SG16] P1885: Naming text encodings: Encodings in the environment versus registered character sets

From: Corentin <corentin.jabot_at_[hidden]>
Date: Sat, 18 Sep 2021 09:36:46 +0200

On Fri, Sep 17, 2021 at 11:24 PM Hubert Tong <
hubert.reinterpretcast_at_[hidden]> wrote:

> On Thu, Sep 16, 2021 at 5:04 PM Corentin <corentin.jabot_at_[hidden]> wrote:
>
>>
>>
>> On Thu, Sep 16, 2021 at 10:23 PM Hubert Tong <
>> hubert.reinterpretcast_at_[hidden]> wrote:
>>
>>> On Thu, Sep 16, 2021 at 1:42 PM Corentin <corentin.jabot_at_[hidden]>
>>> wrote:
>>>
>>>>
>>>>
>>> Given the many many encodings, a lot are only separated by one or two
>>>> codepoints. How in that context do we define variants?
>>>>
>>>
>>> I think you are raising a new question about the design: If a system has
>>> more than one implementation of the same encoding, should it return the
>>> same registered character set to represent more than one of those encodings?
>>>
>>
>> Well, 2 implementations of the same encoding would be the same, and for
>> registered character sets, aliases would cover that use case.
>>
>
> Sorry (my fault for not being clear): I meant the question not for the
> "same encoding" but for encodings separated by a small percentage of
> differences. For example, near matches for a registered character set but
> one being modified for the euro sign.
>

No need to apologize!
Whether an implementation would choose to consider a specific encoding to
advertise itself as another will depend a lot on what the current practices
on that specific platform are.
Most of the "european" encodings that have been modified for euro have been
restandardized and republished under a different name (8859-15 and 8859-1
for example).

>
>>>
>>>>
>>>> > In addition to the encoding used, the implementation may further
>>>> restrict the set of valid characters
>>>>
>>>> I am not sure I understand the goal of this sentence. P1885 is
>>>> purposefully somewhat removed from precise character sets. For which sets
>>>> of operations would that restriction apply?
>>>>
>>>
>>> This mainly occurs in the 2-byte wchar_t case. Some implementations take
>>> the strategy of using UTF-8 encoding but consider only scalar values in the
>>> BMP range to be valid characters.
>>>
>>
>> UTF-8 happens to be one of the encodings that are precisely defined and
>> specified.
>> An encoding that would not map to all scalar values would not fit the
>> definition of UTF-8 - Likewise, WTF-8, CESU-8, BOCU, etc are NOT UTF-8
>> Can an implementation still advertise UTF-8? Sure, I don't see a value in
>> trying to prevent hostile implementations
>> Do I want to specifically bless that behavior? Nope
>>
>
> I guess (from later statements below) that we just want to chalk this all
> up to a "wchar_t" problem.
>

Yes, we do not want to spread this problem ( I hope we fix it by another
paper which would align the standard with the status quo, namely by stating
that wchar_t do not have to represent all codepoints of its encoding, but
explicitly stating that standard wide ctype and locales functions cannot
cope with codepoints represented with multiple code units. not a great
place to be in but representative of the status quo).

>>
>>>
>>>
>>>>
>>>> > In the absence of a conventional name for the encoding as
>>>> restricted, implementations should provide a return value without regard
>>>> for the restriction,
>>>>
>>>> Again, how do you define what's a conventional name?
>>>>
>>>
>>> I'm happier with overt handwaving than less obvious handwaving. This
>>> sentence is meant to allow "UTF-8" as the result even in implementations
>>> where not all Unicode scalar values are supported by mbstowcs.
>>>
>>
>> We do not mention mbstowcs anywhere.
>>
>> P1885 is not the place to address that the constraints the standard
>> places on wchar_t are not representative of existing practice.
>> P1885 also puts no requirements of relation between the narrow and wide
>> literals nor does it mention representability.
>> So returning UTF-8 for narrow and UTS-2 for wide would be perfectly
>> valid, with the proposed wording.
>>
>
> P1885 does not exist in a vacuum. And the existing wording does place a
> requirement between the narrow and wide execution encodings. I am somewhat
> convinced that P1885 is not the place to address the wchar_t problems re:
> UCS-2 versus UTF-16, but I will point out that P1885 theoretically
> exacerbates the problem because the UTF-8 narrow and UCS-2 wide case is not
> perfectly valid. Previously, only the UTF-16 case was clearly misaligned
> with the standard; with P1886, the UCS-2 case is also misaligned.
>

Sure, although I am not aware of platforms for which UCS-2 is currently
assumed.

>
>
>>
>>>
>>>>
>>>> Trying to constrain implementation freedom in a field that is
>>>> plagued by 70+ years of legacy, special cases and exceptions is a minefield.
>>>> I would like to better understand
>>>>
>>>> - What useful scenarios are allowed by this wording
>>>>
>>>
>>> See above.
>>>
>>>
>>>> - What problematic scenarios are prevented by this wording
>>>>
>>>
>>> Implementations choosing to invent new names because a strict reading
>>> says the registered name is not okay.
>>>
>>
>> The wording intent is to allow an implementation to
>>
>> - Return unknown
>> - Return an encoding that is different from that used by mbstowcs,
>> for example (the wide environment is rather an environment that you would
>> expect wprintf could consume without creating mojibake)
>>
>> This particular intent has additional limitations: The understanding of
> locales with the same name is not consistent in practice on various
> platforms between 32-bit and 64-bit processes.
>

Do you have specific examples in mind? I am not aware of platforms where
wchar_t would be 64 bits. Or maybe the size of wchar_t is not your concern,
can you clarify?

>
>> - Return an encoding that does not fit into a single wide code unit
>> - Return an encoding that matches that used by other components of
>> the platforms and/or the user expectation of that platform.
>>
>> User expectations of something completely novel is rather hard to guess
> at. Should the narrow and wide EBCDIC versions of the same character set be
> called the same charset? For cases where there are no multibyte characters,
> most indications are "yes". For cases where there are multibyte characters,
> it seems to be more up in the air. If the answer is "no", then I imagine we
> end up with some "x-" prefixes and "-2byte" or "-4byte" suffixes (the
> endianness is always big endian).
>

Despite poor naming, IANA specifically registers encodings, hence the class
`text_encoding`.
In your scenario, narrow and wide EBCDIC would have different sequences of
code units and different code unit types and as such would be ideally
considered different encodings.
I will admit it has been difficult for me to find any information about
wide-ebcdic, so i don't know if and how it is currently referred to by IBM
implementations, if there are multiple encoding and character
sets/encodings defined as some flavor of wide ebcid, etc

>
> I do believe that "implementation-defined encoding" gives us a better
>> outcome than trying to constrain either a relation between narrow and wide
>> (especially given the state of the standard), or trying to force
>> implementation to return a registered name when they'd rather not, or
>> return an unregistered name when they'd rather not.
>>
>
> The placement of the "implementation-defined" in the currently proposed
> wording for environment() is hard for me to read this way. Also, the
> wording for literal() does not say "implementation-defined".
>

The literal encoding is already implementation-defined, and we do not need
as much implementation freedom here.
I am happy to reword that sentence if you think it would be clearer.

> Minimal wording (for the general/synopsis section):
> How a text_encoding object is determined to be representative of a
> character encoding implemented in the translation or execution environment
> is implementation-defined.
>

I would be happy to add that

>
>
>>
>> For example, on windows the implementation will probably want to return
>> UTF-16 and we do not want to disallow that.
>>
>> And I don't think it's necessary, nor possible to add some wording that
>> would encourage implementation not to lie,
>> because it may be that they have to choose between 2 lies (is windows
>> big5 exactly the big5 intended by IANA? Maybe not. Is that the answer users
>> expect anyway? Maybe!)
>>
>> There are some historical oddities to contend with and there is a balance
>> to be found between portability and existing practice.
>>
>
> I would again emphasize that some of the issues are with the novelty of
> trying to name wide encodings where there has not been sufficient need that
> there is established existing practice. Do you have a list of existing APIs
> that provide names for wide encodings out of locale information?
>

On Windows this is a documented property of the platform: UTF-16.
On some posix platforms (linux, mac), this will always be UTF-32.
On others (like freebsd), it is not documented beyond being
"implementation-defined", and may be some wide shift-jis or fixed-width
euc. There is currently no api to infer what these wide encodings are.

Possible implementation strategies for freebsd include

   - Maintaining a mapping of narrow -> wide encodings, as these platforms
   are specified to have one.
   - Modifying the libc to expose the name of wide encoding in a way
   similar to nl_lang_info(CODESET) - for use by the C++ library. I hope this
   will be the long term outcome.
   - return id::unknown - which is what I expect these platforms to do
   initially.

I suspect the answer for ebcdic platforms might be very similar?

>
>
>> Especially as the number of problematic scenarios is, thankfully,
>> extremely small.
>>
>>
>>
>>
>>>
>>>
>>>>
>>>> Thanks a lot for your feedback,
>>>>
>>>> Corentin
>>>>
>>>>
>>>>
>>>>> For users it means that implementing a function that would return some
>>>>>> kind of transcoder from a name requires special care
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>

Received on 2021-09-18 02:37:00