sg16: Re: [SG16] P1885: Naming text encodings: Encodings in the environment versus registered character sets

From: Corentin <corentin.jabot_at_[hidden]>
Date: Sat, 18 Sep 2021 15:06:59 +0200

On Sat, Sep 18, 2021 at 9:36 AM Corentin <corentin.jabot_at_[hidden]> wrote:

>
>
> On Fri, Sep 17, 2021 at 11:24 PM Hubert Tong <
> hubert.reinterpretcast_at_[hidden]> wrote:
>
>> On Thu, Sep 16, 2021 at 5:04 PM Corentin <corentin.jabot_at_[hidden]>
>> wrote:
>>
>>>
>>>
>>> On Thu, Sep 16, 2021 at 10:23 PM Hubert Tong <
>>> hubert.reinterpretcast_at_[hidden]> wrote:
>>>
>>>> On Thu, Sep 16, 2021 at 1:42 PM Corentin <corentin.jabot_at_[hidden]>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>> Given the many many encodings, a lot are only separated by one or two
>>>>> codepoints. How in that context do we define variants?
>>>>>
>>>>
>>>> I think you are raising a new question about the design: If a system
>>>> has more than one implementation of the same encoding, should it return the
>>>> same registered character set to represent more than one of those encodings?
>>>>
>>>
>>> Well, 2 implementations of the same encoding would be the same, and for
>>> registered character sets, aliases would cover that use case.
>>>
>>
>> Sorry (my fault for not being clear): I meant the question not for the
>> "same encoding" but for encodings separated by a small percentage of
>> differences. For example, near matches for a registered character set but
>> one being modified for the euro sign.
>>
>
> No need to apologize!
> Whether an implementation would choose to consider a specific encoding to
> advertise itself as another will depend a lot on what the current practices
> on that specific platform are.
> Most of the "european" encodings that have been modified for euro have
> been restandardized and republished under a different name (8859-15 and 8859-1
> for example).
>
>
>>
>>>>
>>>>>
>>>>> > In addition to the encoding used, the implementation may further
>>>>> restrict the set of valid characters
>>>>>
>>>>> I am not sure I understand the goal of this sentence. P1885 is
>>>>> purposefully somewhat removed from precise character sets. For which sets
>>>>> of operations would that restriction apply?
>>>>>
>>>>
>>>> This mainly occurs in the 2-byte wchar_t case. Some implementations
>>>> take the strategy of using UTF-8 encoding but consider only scalar values
>>>> in the BMP range to be valid characters.
>>>>
>>>
>>> UTF-8 happens to be one of the encodings that are precisely defined and
>>> specified.
>>> An encoding that would not map to all scalar values would not fit the
>>> definition of UTF-8 - Likewise, WTF-8, CESU-8, BOCU, etc are NOT UTF-8
>>> Can an implementation still advertise UTF-8? Sure, I don't see a value
>>> in trying to prevent hostile implementations
>>> Do I want to specifically bless that behavior? Nope
>>>
>>
>> I guess (from later statements below) that we just want to chalk this all
>> up to a "wchar_t" problem.
>>
>
> Yes, we do not want to spread this problem ( I hope we fix it by another
> paper which would align the standard with the status quo, namely by stating
> that wchar_t do not have to represent all codepoints of its encoding, but
> explicitly stating that standard wide ctype and locales functions cannot
> cope with codepoints represented with multiple code units. not a great
> place to be in but representative of the status quo).
>
>
>>>
>>>>
>>>>
>>>>>
>>>>> > In the absence of a conventional name for the encoding as
>>>>> restricted, implementations should provide a return value without regard
>>>>> for the restriction,
>>>>>
>>>>> Again, how do you define what's a conventional name?
>>>>>
>>>>
>>>> I'm happier with overt handwaving than less obvious handwaving. This
>>>> sentence is meant to allow "UTF-8" as the result even in implementations
>>>> where not all Unicode scalar values are supported by mbstowcs.
>>>>
>>>
>>> We do not mention mbstowcs anywhere.
>>>
>>> P1885 is not the place to address that the constraints the standard
>>> places on wchar_t are not representative of existing practice.
>>> P1885 also puts no requirements of relation between the narrow and wide
>>> literals nor does it mention representability.
>>> So returning UTF-8 for narrow and UTS-2 for wide would be perfectly
>>> valid, with the proposed wording.
>>>
>>
>> P1885 does not exist in a vacuum. And the existing wording does place a
>> requirement between the narrow and wide execution encodings. I am somewhat
>> convinced that P1885 is not the place to address the wchar_t problems re:
>> UCS-2 versus UTF-16, but I will point out that P1885 theoretically
>> exacerbates the problem because the UTF-8 narrow and UCS-2 wide case is not
>> perfectly valid. Previously, only the UTF-16 case was clearly misaligned
>> with the standard; with P1886, the UCS-2 case is also misaligned.
>>
>
> Sure, although I am not aware of platforms for which UCS-2 is currently
> assumed.
>
>
>>
>>
>>>
>>>>
>>>>>
>>>>> Trying to constrain implementation freedom in a field that is
>>>>> plagued by 70+ years of legacy, special cases and exceptions is a minefield.
>>>>> I would like to better understand
>>>>>
>>>>> - What useful scenarios are allowed by this wording
>>>>>
>>>>
>>>> See above.
>>>>
>>>>
>>>>> - What problematic scenarios are prevented by this wording
>>>>>
>>>>
>>>> Implementations choosing to invent new names because a strict reading
>>>> says the registered name is not okay.
>>>>
>>>
>>> The wording intent is to allow an implementation to
>>>
>>> - Return unknown
>>> - Return an encoding that is different from that used by mbstowcs,
>>> for example (the wide environment is rather an environment that you would
>>> expect wprintf could consume without creating mojibake)
>>>
>>> This particular intent has additional limitations: The understanding of
>> locales with the same name is not consistent in practice on various
>> platforms between 32-bit and 64-bit processes.
>>
>
> Do you have specific examples in mind? I am not aware of platforms where
> wchar_t would be 64 bits. Or maybe the size of wchar_t is not your concern,
> can you clarify?
>
>>
>>> - Return an encoding that does not fit into a single wide code unit
>>> - Return an encoding that matches that used by other components of
>>> the platforms and/or the user expectation of that platform.
>>>
>>> User expectations of something completely novel is rather hard to guess
>> at. Should the narrow and wide EBCDIC versions of the same character set be
>> called the same charset? For cases where there are no multibyte characters,
>> most indications are "yes". For cases where there are multibyte characters,
>> it seems to be more up in the air. If the answer is "no", then I imagine we
>> end up with some "x-" prefixes and "-2byte" or "-4byte" suffixes (the
>> endianness is always big endian).
>>
>
> Despite poor naming, IANA specifically registers encodings, hence the
> class `text_encoding`.
> In your scenario, narrow and wide EBCDIC would have different sequences of
> code units and different code unit types and as such would be ideally
> considered different encodings.
> I will admit it has been difficult for me to find any information about
> wide-ebcdic, so i don't know if and how it is currently referred to by IBM
> implementations, if there are multiple encoding and character
> sets/encodings defined as some flavor of wide ebcid, etc
>
>
>
>>
>> I do believe that "implementation-defined encoding" gives us a better
>>> outcome than trying to constrain either a relation between narrow and wide
>>> (especially given the state of the standard), or trying to force
>>> implementation to return a registered name when they'd rather not, or
>>> return an unregistered name when they'd rather not.
>>>
>>
>> The placement of the "implementation-defined" in the currently proposed
>> wording for environment() is hard for me to read this way. Also, the
>> wording for literal() does not say "implementation-defined".
>>
>
> The literal encoding is already implementation-defined, and we do not need
> as much implementation freedom here.
> I am happy to reword that sentence if you think it would be clearer.
>
>
>> Minimal wording (for the general/synopsis section):
>> How a text_encoding object is determined to be representative of a
>> character encoding implemented in the translation or execution environment
>> is implementation-defined.
>>
>
> I would be happy to add that
>
>>
>>
>>>
>>> For example, on windows the implementation will probably want to return
>>> UTF-16 and we do not want to disallow that.
>>>
>>> And I don't think it's necessary, nor possible to add some wording that
>>> would encourage implementation not to lie,
>>> because it may be that they have to choose between 2 lies (is windows
>>> big5 exactly the big5 intended by IANA? Maybe not. Is that the answer users
>>> expect anyway? Maybe!)
>>>
>>> There are some historical oddities to contend with and there is a
>>> balance to be found between portability and existing practice.
>>>
>>
>> I would again emphasize that some of the issues are with the novelty of
>> trying to name wide encodings where there has not been sufficient need that
>> there is established existing practice. Do you have a list of existing APIs
>> that provide names for wide encodings out of locale information?
>>
>
> On Windows this is a documented property of the platform: UTF-16.
> On some posix platforms (linux, mac), this will always be UTF-32.
> On others (like freebsd), it is not documented beyond being
> "implementation-defined", and may be some wide shift-jis or fixed-width
> euc. There is currently no api to infer what these wide encodings are.
>
> Possible implementation strategies for freebsd include
>
> - Maintaining a mapping of narrow -> wide encodings, as these
> platforms are specified to have one.
> - Modifying the libc to expose the name of wide encoding in a way
> similar to nl_lang_info(CODESET) - for use by the C++ library. I hope this
> will be the long term outcome.
> - return id::unknown - which is what I expect these platforms to do
> initially.
>
> I suspect the answer for ebcdic platforms might be very similar?
>

I have been thinking, would a note along these lines
alleviate your concerns?
"The encoding represented by the return object [of wide_environment(),] if
any, is not required to meet the preconditions of all the standard wide
character functions"

I think it would be a reasonable action until wg21 has the bandwidth to
address the wchar_t requirements issue more generally.

>
>
>
>
>>
>>
>>> Especially as the number of problematic scenarios is, thankfully,
>>> extremely small.
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>>>
>>>>> Thanks a lot for your feedback,
>>>>>
>>>>> Corentin
>>>>>
>>>>>
>>>>>
>>>>>> For users it means that implementing a function that would return
>>>>>>> some kind of transcoder from a name requires special care
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>

Received on 2021-09-18 08:07:13