On 3/25/20 3:41 PM, Steven R. Loomis via SG16 wrote:

El mar. 24, 2020, a las 8:42 a. m., Corentin <corentin.jabot@gmail.com> escribió:

On Tue, 24 Mar 2020 at 15:42, Steven R. Loomis <srl295@gmail.com> wrote:

Corentin,
Please see some of the work done in ICU on encodings.

In particular, IANA does not specify the actual mapping. So we have found the IANA names insufficient to distinguish two actual encodings, shift_jis is an example. Comment and datafile:

https://github.com/unicode-org/icu/blob/master/icu4c/source/data/mappings/convrtrs.txt#L93

So while IANA names are widely used from a spec point of view, in practice there are many, many challenges with their use in implementation.

This proposal is solely about names and not encoding conversion facilities

I understand, but that is exactly how we get into compatibility problems today. I mentioned Shift_Jis <https://en.wikipedia.org/wiki/Shift_JIS#Multiple_versions> - standard name, incompatible implementations. There are many other issues which are visible from the mapping table, where an IANA name alone is not sufficient.

It seems that Big5 suffers a similar issue. If my research is correct, IANA recognizes Big5 and Big5-HKSCS, but the Big5 variant in the Encoding Standard is a merged version of them that is not a super set of either.

Giving only names without specifying encoding conversion is less than helpful, indeed harmful.

We know there are incompatibilities. Why give a false sense of security for something that’s clearly underspecified?

The initial motivation for this feature was to allow a C++ implementation to communicate to a program the encoding used to encode character and string literals, the encoding used by the system, and the locale dependent encoding used by the C and C++ standard libraries. This goal can't be accomplished without some encoding name or identifier. One of the goals was to enable this identifier to be used in order to select a (compatible) encoding when interoperating with iconv, ICU, Windows APIs, etc...

It sounds like your perspective is that such goals should be accomplished in some other way. For example, by having the implementation provide a codec rather than an identifier; ideally a codec that could be used in interaction with iconv, ICU, etc... (though this would clearly require enhancements to those code bases).

Do you have other suggestions for how to think about this?

At this point in history, I would recommend using the WHATWG names and behaviors exactly. Anything further requires a specific repository of mappings and behaviors. Perhaps there could be a namespaced use, such as “icu:ibm-1251_P100-1995” or “15897:ISO-8859-1" which precisely specifies one table.

WHATWG specifies a more limited set of encodings than IANA does. I'm not sure how to square this comment with your later one stating that the IANA mappings are insufficient. If IANA is insufficient, what is it about the WHATWG standard that would make it sufficient?

El mar. 24, 2020, a las 8:02 a. m., keld--- via SG16 <sg16@lists.isocpp.org> escribió:

iso 15897 provide3s actual mappings to iso 10646 in posix compatible charmap farmat.
names are compatible with iana, built from some of the same sources.
unicode inc. wanted to reinvent the wheel.

Hi, Keld. Actually, this work is based on IBM mapping tables and the customer need to explicitly specify character encoding mappings. We have critical customer data that would be damaged if we only used IANA mappings. The mappings needed aren’t in the 15897 registry.

Thanks, I think this is useful information that the IANA registry is insufficient in practice for known use cases.

Tom.

However, The CDRA/CCSID and ICU converter tables are widely implemented. For one thing, POSIX charmaps supported the substitution controls and multi way fallback behavior that was needed for some tables. That’s my recollection. Also, many converters are better specified as algorithms than tables.

--
Steven R. Loomis | @srl295 | git.io/srl295