C++ Logo

sg16

Advanced search

Re: [SG16] P1885: Naming text encodings: Encodings in the environment versus registered character sets

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Thu, 16 Sep 2021 16:23:28 -0400
On Thu, Sep 16, 2021 at 1:42 PM Corentin <corentin.jabot_at_[hidden]> wrote:

>
>
> On Tue, Sep 14, 2021 at 7:30 AM Hubert Tong <
> hubert.reinterpretcast_at_[hidden]> wrote:
>
>> On Mon, Sep 13, 2021 at 2:27 AM Corentin <corentin.jabot_at_[hidden]>
>> wrote:
>>
>>>
>>>
>>> On Mon, Sep 13, 2021 at 7:44 AM Hubert Tong <
>>> hubert.reinterpretcast_at_[hidden]> wrote:
>>>
>>>> In P1885, a registered character set is one that is in (at the point
>>>> when the paper was written) the IANA character set registry. P1885 also
>>>> provides static functions to query about the encoding used in either the
>>>> translation or the execution environment. In some cases (involving subsets
>>>> or supersets), there are questions of when an implementation should return
>>>> a registered character set as the result of such static functions.
>>>>
>>>> The environment-implements-superset case presents itself in relation to
>>>> csBig5. The system encodings for "big5" on Windows and AIX contain
>>>> characters that are not part of the common base of Big5; however, both are
>>>> also missing characters from Big5-2003:
>>>> Big5-2003 has U+7881 as F9 D6 and U+2460 as C6 A1.
>>>> Windows has U+7881 as F9 D6 but not U+2460 as C6 A1.
>>>> AIX does not have U+7881 as F9 D6 but does have U+2460 as C6 A1.
>>>>
>>>> So, the environment-implements-superset case can, in practical terms,
>>>> be generalized as being about divergent implementations of "charsets".
>>>> Of course, that generalization could also account for some
>>>> environment-implements-subset cases; however, in addition to more mundane
>>>> reasons, the environment-implements-subset case also arises from a
>>>> technicality: It is questionable whether or not a POSIX environment that
>>>> uses a UTF-8 encoding paired with a 2-byte (UCS-2) wchar_t can be said to
>>>> have UTF-8 as the environment text encoding because the characters outside
>>>> of the BMP cannot (based on wchar_t-representability) be considered members
>>>> of the character set associated with the environment.
>>>>
>>>> So it seems we have some questions:
>>>> Are the design goals better met or not by allowing divergent
>>>> implementations of "charsets" to be identified as being the same registered
>>>> character set?
>>>> When an implementation indicates a specific environment encoding, do
>>>> the design goals require that all associated characters or members of the
>>>> associated code space be wchar_t-representable?
>>>>
>>>> It may be useful to characterize the questions as whether the result of
>>>> the static functions are meant to be more of a hint (with few guarantees)
>>>> or more of a promise.
>>>>
>>>
>>>
>>> I think we talked about this before, but as you outlined, mapping an
>>> encoding name to a specific charset or encoder sometimes
>>> requires out-of-band information about the platform where the text was
>>> created.
>>> The web platform also has yet another definition of big5
>>> https://encoding.spec.whatwg.org/big5.html
>>>
>>> IANA implies uniqueness and some encodings are registered with a precise
>>> mapping (rfc2978) - also in a few cases tracking what that mapping is is
>>> difficult.
>>>
>>> > Each assigned name MUST uniquely identify a single charset. All
>>> charset names MUST be suitable for use as the value of a MIME content
>>> type charset parameter and hence MUST conform to MIME parameter value
>>> syntax. This applies even if the specific charset being registered
>>> is not suitable for use with the "text" media type.
>>>
>>> Big5-HKSCS registration points to a document (which wasn't exactly easy
>>> to find
>>> http://web.archive.org/web/20030324074656/http://www.info.gov.hk/digital21/eng/hkscs/download/e_hkscs.pdf
>>> )
>>> But that is unfortunately not the case for Big5.
>>> The issue is that these things were registered after being widely
>>> deployed by several vendors, so we are left
>>> with minor implementation divergence.
>>>
>>> I do not think it needs wording, or special care.
>>> If a vendor considers that their character set maps to a registered IANA
>>> character set, they should be able to express it under P1885 - I don't
>>> think that will lead to more abuse
>>> as the current situation.
>>>
>>
>> Having the standard written as if the ambiguity does not or should not
>> exist when we fully intend that it does (because we can't practically
>> prevent it) is not helpful. Also, "should be able to" is different from
>> "should".
>>
>> I believe wording should be present:
>> An implementation may provide a return value representing a registered
>> character set in lieu of one representing an unregistered variant. When the
>> unregistered variant is the traditional realization of the registered
>> character set in the context of the implementation, an implementation
>> should provide a return value representing the registered character set. In
>> addition to the encoding used, the implementation may further restrict the
>> set of valid characters. In the absence of a conventional name for the
>> encoding as restricted, implementations should provide a return value
>> without regard for the restriction,
>>
>>
> I for some reason missed this email
>
> > When the unregistered variant is the traditional realization of the
> registered character set in the context of the implementation, an
> implementation should provide a return value representing the registered
> character set.
>
> I am struggling to understand what we are allowing here
>

The general allowance was in the sentence before this one. The answers to
your further questions below would hopefully clarify.


>
> Can, eg "Big5" be returned on windows? Yes, there is no precise definition
> of Big5 that can comprehensively account for all code points without
> considering the platforms. And this is probably the right answer to give to
> users.
>

The intended answer is "yes". This second sentence moves beyond allowing
that return value to encouraging it (which clarifies between the
"precision" versus "utility" design intentions).


>
> Can, eg "Shift-JIS" be returned on windows? Yes technically. But the
> Windows implementation of shift-jis is very well documented, with a
> specific name that is registered. Implementations and users will expect
> that to be returned instead.
>

The sentence happens to cover this case. The variant is registered (and the
sentence speaks of "unregistered").


>
> > unregistered variant is the traditional realization of the registered
> character set.
>
> This is ill-defined. How do you define that when not all-registered
> charsets have an actual associated charsets - Many times only a name is
> registered.
> What is considered a variant and what is considered traditional?
>

If the system realization can reasonably be considered the registered
character set (as opposed to a variant), then we don't really reach this
point in the logic.


>
> > an implementation should provide a return value representing the
> registered character set
> Maybe. Depending on the specific encoding, an implementation may want to
> do something different that more closely matches the expectations of users
> on that platform.
>

I think that is what the wording is trying to say. I am guessing that you
are speaking to a case where the registered character set should also have
an alias that matches the expectations of users on that platform.


>
> Given the many many encodings, a lot are only separated by one or two
> codepoints. How in that context do we define variants?
>

I think you are raising a new question about the design: If a system has
more than one implementation of the same encoding, should it return the
same registered character set to represent more than one of those encodings?


>
> > In addition to the encoding used, the implementation may further
> restrict the set of valid characters
>
> I am not sure I understand the goal of this sentence. P1885 is
> purposefully somewhat removed from precise character sets. For which sets
> of operations would that restriction apply?
>

This mainly occurs in the 2-byte wchar_t case. Some implementations take
the strategy of using UTF-8 encoding but consider only scalar values in the
BMP range to be valid characters.


>
> > In the absence of a conventional name for the encoding as restricted,
> implementations should provide a return value without regard for the
> restriction,
>
> Again, how do you define what's a conventional name?
>

I'm happier with overt handwaving than less obvious handwaving. This
sentence is meant to allow "UTF-8" as the result even in implementations
where not all Unicode scalar values are supported by mbstowcs.


>
> Trying to constrain implementation freedom in a field that is plagued by
> 70+ years of legacy, special cases and exceptions is a minefield.
> I would like to better understand
>
> - What useful scenarios are allowed by this wording
>

See above.


> - What problematic scenarios are prevented by this wording
>

Implementations choosing to invent new names because a strict reading says
the registered name is not okay.


>
> Thanks a lot for your feedback,
>
> Corentin
>
>
>
>> For users it means that implementing a function that would return some
>>> kind of transcoder from a name requires special care
>>>
>>>
>>>
>>>
>>>
>>

Received on 2021-09-16 15:23:58