sg16: Re: [SG16] P1885: Naming text encodings: Encodings in the environment versus registered character sets

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Tue, 14 Sep 2021 01:29:58 -0400

On Mon, Sep 13, 2021 at 2:27 AM Corentin <corentin.jabot_at_[hidden]> wrote:

>
>
> On Mon, Sep 13, 2021 at 7:44 AM Hubert Tong <
> hubert.reinterpretcast_at_[hidden]> wrote:
>
>> In P1885, a registered character set is one that is in (at the point when
>> the paper was written) the IANA character set registry. P1885 also provides
>> static functions to query about the encoding used in either the translation
>> or the execution environment. In some cases (involving subsets or
>> supersets), there are questions of when an implementation should return a
>> registered character set as the result of such static functions.
>>
>> The environment-implements-superset case presents itself in relation to
>> csBig5. The system encodings for "big5" on Windows and AIX contain
>> characters that are not part of the common base of Big5; however, both are
>> also missing characters from Big5-2003:
>> Big5-2003 has U+7881 as F9 D6 and U+2460 as C6 A1.
>> Windows has U+7881 as F9 D6 but not U+2460 as C6 A1.
>> AIX does not have U+7881 as F9 D6 but does have U+2460 as C6 A1.
>>
>> So, the environment-implements-superset case can, in practical terms, be
>> generalized as being about divergent implementations of "charsets".
>> Of course, that generalization could also account for some
>> environment-implements-subset cases; however, in addition to more mundane
>> reasons, the environment-implements-subset case also arises from a
>> technicality: It is questionable whether or not a POSIX environment that
>> uses a UTF-8 encoding paired with a 2-byte (UCS-2) wchar_t can be said to
>> have UTF-8 as the environment text encoding because the characters outside
>> of the BMP cannot (based on wchar_t-representability) be considered members
>> of the character set associated with the environment.
>>
>> So it seems we have some questions:
>> Are the design goals better met or not by allowing divergent
>> implementations of "charsets" to be identified as being the same registered
>> character set?
>> When an implementation indicates a specific environment encoding, do the
>> design goals require that all associated characters or members of the
>> associated code space be wchar_t-representable?
>>
>> It may be useful to characterize the questions as whether the result of
>> the static functions are meant to be more of a hint (with few guarantees)
>> or more of a promise.
>>
>
>
> I think we talked about this before, but as you outlined, mapping an
> encoding name to a specific charset or encoder sometimes
> requires out-of-band information about the platform where the text was
> created.
> The web platform also has yet another definition of big5
> https://encoding.spec.whatwg.org/big5.html
>
> IANA implies uniqueness and some encodings are registered with a precise
> mapping (rfc2978) - also in a few cases tracking what that mapping is is
> difficult.
>
> > Each assigned name MUST uniquely identify a single charset. All
> charset names MUST be suitable for use as the value of a MIME content
> type charset parameter and hence MUST conform to MIME parameter value
> syntax. This applies even if the specific charset being registered
> is not suitable for use with the "text" media type.
>
> Big5-HKSCS registration points to a document (which wasn't exactly easy to
> find
> http://web.archive.org/web/20030324074656/http://www.info.gov.hk/digital21/eng/hkscs/download/e_hkscs.pdf
> )
> But that is unfortunately not the case for Big5.
> The issue is that these things were registered after being widely deployed
> by several vendors, so we are left
> with minor implementation divergence.
>
> I do not think it needs wording, or special care.
> If a vendor considers that their character set maps to a registered IANA
> character set, they should be able to express it under P1885 - I don't
> think that will lead to more abuse
> as the current situation.
>

Having the standard written as if the ambiguity does not or should not
exist when we fully intend that it does (because we can't practically
prevent it) is not helpful. Also, "should be able to" is different from
"should".

I believe wording should be present:
An implementation may provide a return value representing a registered
character set in lieu of one representing an unregistered variant. When the
unregistered variant is the traditional realization of the registered
character set in the context of the implementation, an implementation
should provide a return value representing the registered character set. In
addition to the encoding used, the implementation may further restrict the
set of valid characters. In the absence of a conventional name for the
encoding as restricted, implementations should provide a return value
without regard for the restriction,

> For users it means that implementing a function that would return some
> kind of transcoder from a name requires special care
>
>
>
>
>

Received on 2021-09-14 00:30:30