C++ Logo

SG16

Advanced search

Subject: Re: UK national body concerns about P1885R1 'Naming Text Encodings to Demystify Them'
From: Steven R. Loomis (srl295_at_[hidden])
Date: 2020-03-25 14:41:36


> El mar. 24, 2020, a las 8:42 a. m., Corentin <corentin.jabot_at_[hidden]> escribió:
> On Tue, 24 Mar 2020 at 15:42, Steven R. Loomis <srl295_at_[hidden] <mailto:srl295_at_[hidden]>> wrote:
> Corentin,
> Please see some of the work done in ICU on encodings.
>
> In particular, IANA does not specify the actual mapping. So we have found the IANA names insufficient to distinguish two actual encodings, shift_jis is an example. Comment and datafile:
> https://github.com/unicode-org/icu/blob/master/icu4c/source/data/mappings/convrtrs.txt#L93 <https://github.com/unicode-org/icu/blob/master/icu4c/source/data/mappings/convrtrs.txt#L93>
>
> So while IANA names are widely used from a spec point of view, in practice there are many, many challenges with their use in implementation.
>
> This proposal is solely about names and not encoding conversion facilities

I understand, but that is exactly how we get into compatibility problems today. I mentioned Shift_Jis <https://en.wikipedia.org/wiki/Shift_JIS#Multiple_versions <https://en.wikipedia.org/wiki/Shift_JIS#Multiple_versions>> - standard name, incompatible implementations. There are many other issues which are visible from the mapping table, where an IANA name alone is not sufficient.

Giving only names without specifying encoding conversion is less than helpful, indeed harmful.
We know there are incompatibilities. Why give a false sense of security for something that’s clearly underspecified?

At this point in history, I would recommend using the WHATWG names and behaviors exactly. Anything further requires a specific repository of mappings and behaviors. Perhaps there could be a namespaced use, such as “icu:ibm-1251_P100-1995” or “15897:ISO-8859-1" which precisely specifies one table.

> El mar. 24, 2020, a las 8:02 a. m., keld--- via SG16 <sg16_at_[hidden]> escribió:
>
> iso 15897 provide3s actual mappings to iso 10646 in posix compatible charmap farmat.
> names are compatible with iana, built from some of the same sources.
> unicode inc. wanted to reinvent the wheel.

Hi, Keld. Actually, this work is based on IBM mapping tables and the customer need to explicitly specify character encoding mappings. We have critical customer data that would be damaged if we only used IANA mappings. The mappings needed aren’t in the 15897 registry.
However, The CDRA/CCSID and ICU converter tables are widely implemented. For one thing, POSIX charmaps supported the substitution controls and multi way fallback behavior that was needed for some tables. That’s my recollection. Also, many converters are better specified as algorithms than tables.

--
Steven R. Loomis | @srl295 | git.io/srl295


SG16 list run by herb.sutter at gmail.com