Date: Thu, 26 Mar 2020 11:46:10 -0400
On 3/25/20 3:41 PM, Steven R. Loomis via SG16 wrote:
>
>> El mar. 24, 2020, a las 8:42 a. m., Corentin
>> <corentin.jabot_at_[hidden] <mailto:corentin.jabot_at_[hidden]>> escribió:
>> On Tue, 24 Mar 2020 at 15:42, Steven R. Loomis <srl295_at_[hidden]
>> <mailto:srl295_at_[hidden]>> wrote:
>>
>> Corentin,
>> Please see some of the work done in ICU on encodings.
>>
>> In particular, IANA does not specify the actual mapping. So we
>> have found the IANA names insufficient to distinguish two actual
>> encodings, shift_jis is an example. Comment and datafile:
>> https://github.com/unicode-org/icu/blob/master/icu4c/source/data/mappings/convrtrs.txt#L93
>>
>> So while IANA names are widely used from a spec point of view, in
>> practice there are many, many challenges with their use in
>> implementation.
>>
>>
>> This proposal is solely about names and not encoding conversion
>> facilities
>
> I understand, but that is exactly how we get into compatibility
> problems today. I mentioned Shift_Jis
> <https://en.wikipedia.org/wiki/Shift_JIS#Multiple_versions> - standard
> name, incompatible implementations. There are many other issues which
> are visible from the mapping table, where an IANA name alone is not
> sufficient.
It seems that Big5 suffers a similar issue. If my research is correct,
IANA recognizes Big5 and Big5-HKSCS, but the Big5 variant in the
Encoding Standard is a merged version of them that is not a super set of
either.
>
> Giving only names without specifying encoding conversion is less than
> helpful, indeed harmful.
> We know there are incompatibilities. Why give a false sense of
> security for something that’s clearly underspecified?
The initial motivation for this feature was to allow a C++
implementation to communicate to a program the encoding used to encode
character and string literals, the encoding used by the system, and the
locale dependent encoding used by the C and C++ standard libraries.
This goal can't be accomplished without some encoding name or
identifier. One of the goals was to enable this identifier to be used
in order to select a (compatible) encoding when interoperating with
iconv, ICU, Windows APIs, etc...
It sounds like your perspective is that such goals should be
accomplished in some other way. For example, by having the
implementation provide a codec rather than an identifier; ideally a
codec that could be used in interaction with iconv, ICU, etc... (though
this would clearly require enhancements to those code bases).
Do you have other suggestions for how to think about this?
>
> At this point in history, I would recommend using the WHATWG names and
> behaviors exactly. Anything further requires a specific repository of
> mappings and behaviors. Perhaps there could be a namespaced use, such
> as “icu:ibm-1251_P100-1995” or “15897:ISO-8859-1" which precisely
> specifies one table.
WHATWG specifies a more limited set of encodings than IANA does. I'm not
sure how to square this comment with your later one stating that the
IANA mappings are insufficient. If IANA is insufficient, what is it
about the WHATWG standard that would make it sufficient?
>
>> El mar. 24, 2020, a las 8:02 a. m., keld--- via SG16
>> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> escribió:
>>
>> iso 15897 provide3s actual mappings to iso 10646 in posix compatible
>> charmap farmat.
>> names are compatible with iana, built from some of the same sources.
>> unicode inc. wanted to reinvent the wheel.
>
> Hi, Keld. Actually, this work is based on IBM mapping tables and the
> customer need to explicitly specify character encoding mappings. We
> have critical customer data that would be damaged if we only used IANA
> mappings. The mappings needed aren’t in the 15897 registry.
Thanks, I think this is useful information that the IANA registry is
insufficient in practice for known use cases.
Tom.
> However, The CDRA/CCSID and ICU converter tables are widely
> implemented. For one thing, POSIX charmaps supported the substitution
> controls and multi way fallback behavior that was needed for some
> tables. That’s my recollection. Also, many converters are better
> specified as algorithms than tables.
>
> --
> Steven R. Loomis | @srl295 | git.io/srl295 <http://git.io/srl295>
>
>
>
>> El mar. 24, 2020, a las 8:42 a. m., Corentin
>> <corentin.jabot_at_[hidden] <mailto:corentin.jabot_at_[hidden]>> escribió:
>> On Tue, 24 Mar 2020 at 15:42, Steven R. Loomis <srl295_at_[hidden]
>> <mailto:srl295_at_[hidden]>> wrote:
>>
>> Corentin,
>> Please see some of the work done in ICU on encodings.
>>
>> In particular, IANA does not specify the actual mapping. So we
>> have found the IANA names insufficient to distinguish two actual
>> encodings, shift_jis is an example. Comment and datafile:
>> https://github.com/unicode-org/icu/blob/master/icu4c/source/data/mappings/convrtrs.txt#L93
>>
>> So while IANA names are widely used from a spec point of view, in
>> practice there are many, many challenges with their use in
>> implementation.
>>
>>
>> This proposal is solely about names and not encoding conversion
>> facilities
>
> I understand, but that is exactly how we get into compatibility
> problems today. I mentioned Shift_Jis
> <https://en.wikipedia.org/wiki/Shift_JIS#Multiple_versions> - standard
> name, incompatible implementations. There are many other issues which
> are visible from the mapping table, where an IANA name alone is not
> sufficient.
It seems that Big5 suffers a similar issue. If my research is correct,
IANA recognizes Big5 and Big5-HKSCS, but the Big5 variant in the
Encoding Standard is a merged version of them that is not a super set of
either.
>
> Giving only names without specifying encoding conversion is less than
> helpful, indeed harmful.
> We know there are incompatibilities. Why give a false sense of
> security for something that’s clearly underspecified?
The initial motivation for this feature was to allow a C++
implementation to communicate to a program the encoding used to encode
character and string literals, the encoding used by the system, and the
locale dependent encoding used by the C and C++ standard libraries.
This goal can't be accomplished without some encoding name or
identifier. One of the goals was to enable this identifier to be used
in order to select a (compatible) encoding when interoperating with
iconv, ICU, Windows APIs, etc...
It sounds like your perspective is that such goals should be
accomplished in some other way. For example, by having the
implementation provide a codec rather than an identifier; ideally a
codec that could be used in interaction with iconv, ICU, etc... (though
this would clearly require enhancements to those code bases).
Do you have other suggestions for how to think about this?
>
> At this point in history, I would recommend using the WHATWG names and
> behaviors exactly. Anything further requires a specific repository of
> mappings and behaviors. Perhaps there could be a namespaced use, such
> as “icu:ibm-1251_P100-1995” or “15897:ISO-8859-1" which precisely
> specifies one table.
WHATWG specifies a more limited set of encodings than IANA does. I'm not
sure how to square this comment with your later one stating that the
IANA mappings are insufficient. If IANA is insufficient, what is it
about the WHATWG standard that would make it sufficient?
>
>> El mar. 24, 2020, a las 8:02 a. m., keld--- via SG16
>> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> escribió:
>>
>> iso 15897 provide3s actual mappings to iso 10646 in posix compatible
>> charmap farmat.
>> names are compatible with iana, built from some of the same sources.
>> unicode inc. wanted to reinvent the wheel.
>
> Hi, Keld. Actually, this work is based on IBM mapping tables and the
> customer need to explicitly specify character encoding mappings. We
> have critical customer data that would be damaged if we only used IANA
> mappings. The mappings needed aren’t in the 15897 registry.
Thanks, I think this is useful information that the IANA registry is
insufficient in practice for known use cases.
Tom.
> However, The CDRA/CCSID and ICU converter tables are widely
> implemented. For one thing, POSIX charmaps supported the substitution
> controls and multi way fallback behavior that was needed for some
> tables. That’s my recollection. Also, many converters are better
> specified as algorithms than tables.
>
> --
> Steven R. Loomis | @srl295 | git.io/srl295 <http://git.io/srl295>
>
>
Received on 2020-03-26 10:49:02