C++ Logo

SG16

Advanced search

Subject: Re: UK national body concerns about P1885R1 'Naming Text Encodings to Demystify Them'
From: Corentin (corentin.jabot_at_[hidden])
Date: 2020-03-26 11:41:41


On Thu, 26 Mar 2020 at 16:46, Tom Honermann <tom_at_[hidden]> wrote:

> On 3/25/20 3:41 PM, Steven R. Loomis via SG16 wrote:
>
>
> El mar. 24, 2020, a las 8:42 a. m., Corentin <corentin.jabot_at_[hidden]>
> escribió:
> On Tue, 24 Mar 2020 at 15:42, Steven R. Loomis <srl295_at_[hidden]> wrote:
>
>> Corentin,
>> Please see some of the work done in ICU on encodings.
>>
>> In particular, IANA does not specify the actual mapping. So we have found
>> the IANA names insufficient to distinguish two actual encodings, shift_jis
>> is an example. Comment and datafile:
>>
>> https://github.com/unicode-org/icu/blob/master/icu4c/source/data/mappings/convrtrs.txt#L93
>>
>> So while IANA names are widely used from a spec point of view, in
>> practice there are many, many challenges with their use in implementation.
>>
>
> This proposal is solely about names and not encoding conversion facilities
>
>
> I understand, but that is exactly how we get into compatibility problems
> today. I mentioned Shift_Jis <
> https://en.wikipedia.org/wiki/Shift_JIS#Multiple_versions> - standard
> name, incompatible implementations. There are many other issues which are
> visible from the mapping table, where an IANA name alone is not sufficient.
>
> It seems that Big5 suffers a similar issue. If my research is correct,
> IANA recognizes Big5 and Big5-HKSCS, but the Big5 variant in the Encoding
> Standard is a merged version of them that is not a super set of either.
>
>
> Giving only names without specifying encoding conversion is less than
> helpful, indeed harmful.
> We know there are incompatibilities. Why give a false sense of security
> for something that’s clearly underspecified?
>
> The initial motivation for this feature was to allow a C++ implementation
> to communicate to a program the encoding used to encode character and
> string literals, the encoding used by the system, and the locale dependent
> encoding used by the C and C++ standard libraries. This goal can't be
> accomplished without some encoding name or identifier. One of the goals
> was to enable this identifier to be used in order to select a (compatible)
> encoding when interoperating with iconv, ICU, Windows APIs, etc...
>

+1

> It sounds like your perspective is that such goals should be accomplished
> in some other way. For example, by having the implementation provide a
> codec rather than an identifier; ideally a codec that could be used in
> interaction with iconv, ICU, etc... (though this would clearly require
> enhancements to those code bases).
>
Not being tied to an encoder is key to that proposal, both because we are
trying to solve the blackbox problem that the C functions have, and because
this is intended to be low cost and free standing

> Do you have other suggestions for how to think about this?
>
>
> At this point in history, I would recommend using the WHATWG names and
> behaviors exactly. Anything further requires a specific repository of
> mappings and behaviors. Perhaps there could be a namespaced use, such as
> “icu:ibm-1251_P100-1995” or “15897:ISO-8859-1" which precisely specifies
> one table.
>
> An implementation could return icu:ibm-1251_P100-1995

> WHATWG specifies a more limited set of encodings than IANA does. I'm not
> sure how to square this comment with your later one stating that the IANA
> mappings are insufficient. If IANA is insufficient, what is it about the
> WHATWG standard that would make it sufficient?
>

In particular doesn't start to cover the set of encodings supported buy
compilers and systems

El mar. 24, 2020, a las 8:02 a. m., keld--- via SG16 <sg16_at_[hidden]>
> escribió:
>
> iso 15897 provide3s actual mappings to iso 10646 in posix compatible
> charmap farmat.
> names are compatible with iana, built from some of the same sources.
> unicode inc. wanted to reinvent the wheel.
>
>
> Hi, Keld. Actually, this work is based on IBM mapping tables and the
> customer need to explicitly specify character encoding mappings. We have
> critical customer data that would be damaged if we only used IANA mappings.
> The mappings needed aren’t in the 15897 registry.
>
> Thanks, I think this is useful information that the IANA registry is
> insufficient in practice for known use cases.
>
> Tom.
>
> However, The CDRA/CCSID and ICU converter tables are widely implemented.
> For one thing, POSIX charmaps supported the substitution controls and
> multi way fallback behavior that was needed for some tables. That’s my
> recollection. Also, many converters are better specified as algorithms
> than tables.
>
> --
> Steven R. Loomis | @srl295 | git.io/srl295
>
>
>
>



SG16 list run by herb.sutter at gmail.com