sg16: Re: [SG16] P1885: Naming text encodings: Encodings in the environment versus registered character sets

From: Corentin <corentin.jabot_at_[hidden]>
Date: Thu, 16 Sep 2021 23:04:19 +0200

On Thu, Sep 16, 2021 at 10:23 PM Hubert Tong <
hubert.reinterpretcast_at_[hidden]> wrote:

> On Thu, Sep 16, 2021 at 1:42 PM Corentin <corentin.jabot_at_[hidden]> wrote:
>
>>
>>
>> On Tue, Sep 14, 2021 at 7:30 AM Hubert Tong <
>> hubert.reinterpretcast_at_[hidden]> wrote:
>>
>>> On Mon, Sep 13, 2021 at 2:27 AM Corentin <corentin.jabot_at_[hidden]>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Mon, Sep 13, 2021 at 7:44 AM Hubert Tong <
>>>> hubert.reinterpretcast_at_[hidden]> wrote:
>>>>
>>>>> In P1885, a registered character set is one that is in (at the point
>>>>> when the paper was written) the IANA character set registry. P1885 also
>>>>> provides static functions to query about the encoding used in either the
>>>>> translation or the execution environment. In some cases (involving subsets
>>>>> or supersets), there are questions of when an implementation should return
>>>>> a registered character set as the result of such static functions.
>>>>>
>>>>> The environment-implements-superset case presents itself in relation
>>>>> to csBig5. The system encodings for "big5" on Windows and AIX contain
>>>>> characters that are not part of the common base of Big5; however, both are
>>>>> also missing characters from Big5-2003:
>>>>> Big5-2003 has U+7881 as F9 D6 and U+2460 as C6 A1.
>>>>> Windows has U+7881 as F9 D6 but not U+2460 as C6 A1.
>>>>> AIX does not have U+7881 as F9 D6 but does have U+2460 as C6 A1.
>>>>>
>>>>> So, the environment-implements-superset case can, in practical terms,
>>>>> be generalized as being about divergent implementations of "charsets".
>>>>> Of course, that generalization could also account for some
>>>>> environment-implements-subset cases; however, in addition to more mundane
>>>>> reasons, the environment-implements-subset case also arises from a
>>>>> technicality: It is questionable whether or not a POSIX environment that
>>>>> uses a UTF-8 encoding paired with a 2-byte (UCS-2) wchar_t can be said to
>>>>> have UTF-8 as the environment text encoding because the characters outside
>>>>> of the BMP cannot (based on wchar_t-representability) be considered members
>>>>> of the character set associated with the environment.
>>>>>
>>>>> So it seems we have some questions:
>>>>> Are the design goals better met or not by allowing divergent
>>>>> implementations of "charsets" to be identified as being the same registered
>>>>> character set?
>>>>> When an implementation indicates a specific environment encoding, do
>>>>> the design goals require that all associated characters or members of the
>>>>> associated code space be wchar_t-representable?
>>>>>
>>>>> It may be useful to characterize the questions as whether the result
>>>>> of the static functions are meant to be more of a hint (with few
>>>>> guarantees) or more of a promise.
>>>>>
>>>>
>>>>
>>>> I think we talked about this before, but as you outlined, mapping an
>>>> encoding name to a specific charset or encoder sometimes
>>>> requires out-of-band information about the platform where the text was
>>>> created.
>>>> The web platform also has yet another definition of big5
>>>> https://encoding.spec.whatwg.org/big5.html
>>>>
>>>> IANA implies uniqueness and some encodings are registered with a
>>>> precise mapping (rfc2978) - also in a few cases tracking what that mapping
>>>> is is difficult.
>>>>
>>>> > Each assigned name MUST uniquely identify a single charset. All
>>>> charset names MUST be suitable for use as the value of a MIME content
>>>> type charset parameter and hence MUST conform to MIME parameter value
>>>> syntax. This applies even if the specific charset being registered
>>>> is not suitable for use with the "text" media type.
>>>>
>>>> Big5-HKSCS registration points to a document (which wasn't exactly easy
>>>> to find
>>>> http://web.archive.org/web/20030324074656/http://www.info.gov.hk/digital21/eng/hkscs/download/e_hkscs.pdf
>>>> )
>>>> But that is unfortunately not the case for Big5.
>>>> The issue is that these things were registered after being widely
>>>> deployed by several vendors, so we are left
>>>> with minor implementation divergence.
>>>>
>>>> I do not think it needs wording, or special care.
>>>> If a vendor considers that their character set maps to a registered
>>>> IANA character set, they should be able to express it under P1885 - I don't
>>>> think that will lead to more abuse
>>>> as the current situation.
>>>>
>>>
>>> Having the standard written as if the ambiguity does not or should not
>>> exist when we fully intend that it does (because we can't practically
>>> prevent it) is not helpful. Also, "should be able to" is different from
>>> "should".
>>>
>>> I believe wording should be present:
>>> An implementation may provide a return value representing a registered
>>> character set in lieu of one representing an unregistered variant. When the
>>> unregistered variant is the traditional realization of the registered
>>> character set in the context of the implementation, an implementation
>>> should provide a return value representing the registered character set. In
>>> addition to the encoding used, the implementation may further restrict the
>>> set of valid characters. In the absence of a conventional name for the
>>> encoding as restricted, implementations should provide a return value
>>> without regard for the restriction,
>>>
>>>
>> I for some reason missed this email
>>
>> > When the unregistered variant is the traditional realization of the
>> registered character set in the context of the implementation, an
>> implementation should provide a return value representing the registered
>> character set.
>>
>> I am struggling to understand what we are allowing here
>>
>
> The general allowance was in the sentence before this one. The answers to
> your further questions below would hopefully clarify.
>
>
>>
>> Can, eg "Big5" be returned on windows? Yes, there is no precise
>> definition of Big5 that can comprehensively account for all code points
>> without considering the platforms. And this is probably the right answer to
>> give to users.
>>
>
> The intended answer is "yes". This second sentence moves beyond allowing
> that return value to encouraging it (which clarifies between the
> "precision" versus "utility" design intentions).
>
>
>>
>> Can, eg "Shift-JIS" be returned on windows? Yes technically. But the
>> Windows implementation of shift-jis is very well documented, with a
>> specific name that is registered. Implementations and users will expect
>> that to be returned instead.
>>
>
> The sentence happens to cover this case. The variant is registered (and
> the sentence speaks of "unregistered").
>
>
>>
>> > unregistered variant is the traditional realization of the registered
>> character set.
>>
>> This is ill-defined. How do you define that when not all-registered
>> charsets have an actual associated charsets - Many times only a name is
>> registered.
>> What is considered a variant and what is considered traditional?
>>
>
> If the system realization can reasonably be considered the registered
> character set (as opposed to a variant), then we don't really reach this
> point in the logic.
>
>
>>
>> > an implementation should provide a return value representing the
>> registered character set
>> Maybe. Depending on the specific encoding, an implementation may want to
>> do something different that more closely matches the expectations of users
>> on that platform.
>>
>
> I think that is what the wording is trying to say. I am guessing that you
> are speaking to a case where the registered character set should also have
> an alias that matches the expectations of users on that platform.
>
>
>>
>> Given the many many encodings, a lot are only separated by one or two
>> codepoints. How in that context do we define variants?
>>
>
> I think you are raising a new question about the design: If a system has
> more than one implementation of the same encoding, should it return the
> same registered character set to represent more than one of those encodings?
>

Well, 2 implementations of the same encoding would be the same, and for
registered character sets, aliases would cover that use case.

>
>
>>
>> > In addition to the encoding used, the implementation may further
>> restrict the set of valid characters
>>
>> I am not sure I understand the goal of this sentence. P1885 is
>> purposefully somewhat removed from precise character sets. For which sets
>> of operations would that restriction apply?
>>
>
> This mainly occurs in the 2-byte wchar_t case. Some implementations take
> the strategy of using UTF-8 encoding but consider only scalar values in the
> BMP range to be valid characters.
>

UTF-8 happens to be one of the encodings that are precisely defined and
specified.
An encoding that would not map to all scalar values would not fit the
definition of UTF-8 - Likewise, WTF-8, CESU-8, BOCU, etc are NOT UTF-8
Can an implementation still advertise UTF-8? Sure, I don't see a value in
trying to prevent hostile implementations
Do I want to specifically bless that behavior? Nope

>
>
>>
>> > In the absence of a conventional name for the encoding as restricted,
>> implementations should provide a return value without regard for the
>> restriction,
>>
>> Again, how do you define what's a conventional name?
>>
>
> I'm happier with overt handwaving than less obvious handwaving. This
> sentence is meant to allow "UTF-8" as the result even in implementations
> where not all Unicode scalar values are supported by mbstowcs.
>

We do not mention mbstowcs anywhere.

P1885 is not the place to address that the constraints the standard places
on wchar_t are not representative of existing practice.
P1885 also puts no requirements of relation between the narrow and wide
literals nor does it mention representability.
So returning UTF-8 for narrow and UTS-2 for wide would be perfectly valid,
with the proposed wording.

>
>
>>
>> Trying to constrain implementation freedom in a field that is plagued by
>> 70+ years of legacy, special cases and exceptions is a minefield.
>> I would like to better understand
>>
>> - What useful scenarios are allowed by this wording
>>
>
> See above.
>
>
>> - What problematic scenarios are prevented by this wording
>>
>
> Implementations choosing to invent new names because a strict reading says
> the registered name is not okay.
>

The wording intent is to allow an implementation to

   - Return unknown
   - Return an encoding that is different from that used by mbstowcs, for
   example (the wide environment is rather an environment that you would
   expect wprintf could consume without creating mojibake)
   - Return an encoding that does not fit into a single wide code unit
   - Return an encoding that matches that used by other components of the
   platforms and/or the user expectation of that platform.

I do believe that "implementation-defined encoding" gives us a better
outcome than trying to constrain either a relation between narrow and wide
(especially given the state of the standard), or trying to force
implementation to return a registered name when they'd rather not, or
return an unregistered name when they'd rather not.

For example, on windows the implementation will probably want to return
UTF-16 and we do not want to disallow that.

And I don't think it's necessary, nor possible to add some wording that
would encourage implementation not to lie,
because it may be that they have to choose between 2 lies (is windows big5
exactly the big5 intended by IANA? Maybe not. Is that the answer users
expect anyway? Maybe!)

There are some historical oddities to contend with and there is a balance
to be found between portability and existing practice.
Especially as the number of problematic scenarios is, thankfully,
extremely small.

>
>
>>
>> Thanks a lot for your feedback,
>>
>> Corentin
>>
>>
>>
>>> For users it means that implementing a function that would return some
>>>> kind of transcoder from a name requires special care
>>>>
>>>>
>>>>
>>>>
>>>>
>>>

Received on 2021-09-16 16:04:32