Date: Tue, 16 Nov 2021 16:33:33 -0600
Hi, Tom.
> Can anyone offer an explanation for these conflicts?
I’m not sure what kind of explanation you are looking for, so my apologies in advance if this is obvious or doesn’t answer your question. These conflicts basically show why the “provider” tag mechanism was added to ICU, and why the IBM CDRA naming scheme uses such verbose names such as "ibm-33722_P120-1999”. Put in prose, the strings “Shift_JIS”, "TIS-620” (and from another thread, I could add GB18030) are insufficient to select an actual encoding, due to existing conflicting implementations. The tags are useful to disambiguate these cases. If you know the “user” is working in a Java context, or IANA (Content-Type), or with a Windows codepage, then a different encoding may be preferred.
If you are curious about very specific differences, such as the family of TIS-620 differences, please see the comments in the convrtrs.txt file itself. These resulted from painstaking comparisons and equally painful user experience problems.
https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/convrtrs.txt
Hope this helps
-s
PS: I have not reviewed P1885 in detail (will print it off and take a look), but I wouldn’t consider IANA as *the* primary source for ICU. IBM CDRA is the primary source for ICU.
From P1885:
> The intent of this proposal is that the names refer to the character encodings as described by IANA…
> This blanket wording allows an implementation to offer a behavior that matches existing expectations.
> … "Shift-JIS” refers to several “slightly different” encodings.
It’s great to try to improve things in this space, but allowing for slightly different behavior may not move things forward.
Whatwg encoding <https://encoding.spec.whatwg.org> actually specifies the behavior of, for example, Shift-JIS.
I’d recommend actually locking down behavior to something such as ICU behavior.
There was some discussion of an XML format for charsets, which could lead to some kind of a common repository for encodings. That effort is not active.
> El nov. 15, 2021, a las 6:20 p. m., Tom Honermann via SG16 <sg16_at_[hidden]> escribió:
>
> I conducted an audit of all of the encoding names recognized by ICU with the goal of identifying any cases where comparison under the COMP_NAME loose matching algorithm specified in P1885 <https://wg21.link/p1885> would lead to a conflict in selecting an ICU converter. The good news is that no conflicts were identified that can be attributed to the loose matching algorithm. However, I found that the same alias is used for different encodings in multiple cases as described in the table below. These can be verified with ICU Converter Explorer <https://icu4c-demos.unicode.org/icu-bin/convexp?s=UTR22&s=IBM&s=WINDOWS&s=JAVA&s=IANA&s=MIME&s=-&s=ALL&ShowUnavailable=>.
>
> I did not scrape the ICU Converter Explorer page to perform the audit. The data I worked off of was produced with ICU 70.1 by running uconv -l --canon and then massaging the output.
>
> Each row of the table describes a conflict between two ICU encodings, each of which is named in the left most and right most columns respectively. The inner columns list the specific aliases that conflict and which provider they correspond to.
>
> For at least some of these, one has to wonder if the ICU data is simply incorrect. Cases that only involve a conflict with an untagged alias are illustrated in gray so that the others stand out.
>
> Can anyone offer an explanation for these conflicts? Do these reflect defects in ICU (particularly for the cases where the untagged aliases disagree with)?
>
> ICU encoding
> Encoding alias (provider)
> Encoding alias (provider) ICU encoding
> ibm-943_P15A-2003
> cp932 (Windows)
> cp932 (Untagged)
> ibm-942_P12A-1999
> ibm-943_P130-1999
> ibm-943 (IBM)
> ibm-943 (Java) ibm-943 (Untagged)
> ibm-943_P15A-2003
> ibm-943_P130-1999
> Shift_JIS (Untagged)
> Shift_JIS (Windows)
> Shift_JIS (Java)
> Shift_JIS (IANA)
> Shift_JIS (MIME)
> ibm-943_P15A-2003
> ibm-33722_P120-1999
> ibm-33722 (IBM)
> ibm-33722 (Java) ibm-33722 (Untagged)
> ibm-33722_P12A_P12A-2009_U2
> ibm-33722_P120-1999
> ibm-5050 (IBM)
> ibm-5050 (Untagged)
> ibm-33722_P12A_P12A-2009_U2
> windows-950-2000
> windows-950 (Windows)
> windows-950 (Untagged)
> ibm-1373_P100-2002
> ibm-5471_P100-2006
> Big5-HKSCS (Untagged)
> Big5-HKSCS (Java)
> Big5-HKSCS (IANA)
> ibm-1375_P100-2008
> windows-936-2000
> windows-936 (Windows)
> windows-936 (Java)
> windows-936 (IANA)
> windows-936 (Untagged)
> ibm-1386_P100-2001
> ibm-949_P11A-1999
> ibm-949 (Untagged)
> ibm-949 (IBM)
> ibm-949 (Java)
> ibm-949_P110-1999
> ibm-1363_P11B-1998
> KS_C_5601-1987 (IANA)
> KS_C_5601-1987 (Java)
> ibm-970_P110_P110-2006_U2
> ibm-1363_P11B-1998
> KSC_5601 (IANA)
> KSC_5601 (Java)
> ibm-970_P110_P110-2006_U2
> ibm-1363_P11B-1998
> 5601 (Untagged)
> 5601 (Java)
> ibm-970_P110_P110-2006_U2
> ibm-1363_P110-1997
> ibm-1363 (IBM)
> ibm-1363 (Untagged)
> ibm-1363_P11B-1998
> windows-949-2000
> windows-949 (Windows)
> windows-949 (Java)
> windows-949 (Untagged)
> ibm-1363_P11B-1998
> windows-949-2000
> KS_C_5601-1987 (Windows)
> KS_C_5601-1987 (Java)
> ibm-970_P110_P110-2006_U2
> windows-949-2000
> KS_C_5601-1989 (Windows)
> KS_C_5601-1989 (IANA)
> ibm-1363_P11B-1998
> windows-949-2000
> KSC_5601 (Windows)
> KSC_5601 (MIME)
> KSC_5601 (Java)
> ibm-970_P110_P110-2006_U2
> windows-949-2000
> csKSC56011987 (Windows)
> csKSC56011987 (IANA)
> ibm-1363_P11B-1998
> windows-949-2000
> korean (Windows)
> korean (IANA)
> ibm-1363_P11B-1998
> windows-949-2000
> iso-ir-149 (Windows)
> iso-ir-149 (IANA)
> ibm-1363_P11B-1998
> ibm-874_P100-1995
> TIS-620 (Java)
> TIS-620 (IANA)
> TIS-620 (Windows)
> windows-874-2000
> ibm-1250_P100-1995
> windows-1250 (Untagged)
> windows-1250 (Windows)
> windows-1250 (Java)
> windows-1250 (IANA)
> ibm-5346_P100-1998
> ibm-1251_P100-1995
> windows-1251 (Untagged)
> windows-1251 (Windows)
> windows-1251 (Java)
> windows-1251 (IANA) ibm-5347_P100-1998
> ibm-1252_P100-2000
> windows-1252 (Untagged)
> windows-1252 (Windows)
> windows-1252 (Java)
> windows-1252 (IANA) ibm-5348_P100-1997
> ibm-1253_P100-1995
> windows-1253 (Untagged)
> windows-1253 (Windows)
> windows-1253 (Java)
> windows-1253 (IANA) ibm-5349_P100-1998
> ibm-1254_P100-1995
> windows-1254 (Untagged)
> windows-1254 (Windows)
> windows-1254 (Java)
> windows-1254 (IANA) ibm-5350_P100-1998
> ibm-5351_P100-1998
> windows-1255 (Untagged)
> windows-1255 (Windows)
> windows-1255 (Java)
> windows-1255 (IANA) ibm-9447_P100-2002
> ibm-5352_P100-1998
> windows-1256 (Untagged)
> windows-1256 (Windows)
> windows-1256 (Java)
> windows-1256 (IANA) ibm-9448_X100-2005
> ibm-5353_P100-1998
> windows-1257 (Untagged)
> windows-1257 (Windows)
> windows-1257 (Java)
> windows-1257 (IANA) ibm-9449_P100-2002
> ibm-1258_P100-1997
> windows-1258 (Untagged)
> windows-1258 (Windows)
> windows-1258 (Java)
> windows-1258 (IANA) ibm-5354_P100-1998
> Tom.
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
> Can anyone offer an explanation for these conflicts?
I’m not sure what kind of explanation you are looking for, so my apologies in advance if this is obvious or doesn’t answer your question. These conflicts basically show why the “provider” tag mechanism was added to ICU, and why the IBM CDRA naming scheme uses such verbose names such as "ibm-33722_P120-1999”. Put in prose, the strings “Shift_JIS”, "TIS-620” (and from another thread, I could add GB18030) are insufficient to select an actual encoding, due to existing conflicting implementations. The tags are useful to disambiguate these cases. If you know the “user” is working in a Java context, or IANA (Content-Type), or with a Windows codepage, then a different encoding may be preferred.
If you are curious about very specific differences, such as the family of TIS-620 differences, please see the comments in the convrtrs.txt file itself. These resulted from painstaking comparisons and equally painful user experience problems.
https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/convrtrs.txt
Hope this helps
-s
PS: I have not reviewed P1885 in detail (will print it off and take a look), but I wouldn’t consider IANA as *the* primary source for ICU. IBM CDRA is the primary source for ICU.
From P1885:
> The intent of this proposal is that the names refer to the character encodings as described by IANA…
> This blanket wording allows an implementation to offer a behavior that matches existing expectations.
> … "Shift-JIS” refers to several “slightly different” encodings.
It’s great to try to improve things in this space, but allowing for slightly different behavior may not move things forward.
Whatwg encoding <https://encoding.spec.whatwg.org> actually specifies the behavior of, for example, Shift-JIS.
I’d recommend actually locking down behavior to something such as ICU behavior.
There was some discussion of an XML format for charsets, which could lead to some kind of a common repository for encodings. That effort is not active.
> El nov. 15, 2021, a las 6:20 p. m., Tom Honermann via SG16 <sg16_at_[hidden]> escribió:
>
> I conducted an audit of all of the encoding names recognized by ICU with the goal of identifying any cases where comparison under the COMP_NAME loose matching algorithm specified in P1885 <https://wg21.link/p1885> would lead to a conflict in selecting an ICU converter. The good news is that no conflicts were identified that can be attributed to the loose matching algorithm. However, I found that the same alias is used for different encodings in multiple cases as described in the table below. These can be verified with ICU Converter Explorer <https://icu4c-demos.unicode.org/icu-bin/convexp?s=UTR22&s=IBM&s=WINDOWS&s=JAVA&s=IANA&s=MIME&s=-&s=ALL&ShowUnavailable=>.
>
> I did not scrape the ICU Converter Explorer page to perform the audit. The data I worked off of was produced with ICU 70.1 by running uconv -l --canon and then massaging the output.
>
> Each row of the table describes a conflict between two ICU encodings, each of which is named in the left most and right most columns respectively. The inner columns list the specific aliases that conflict and which provider they correspond to.
>
> For at least some of these, one has to wonder if the ICU data is simply incorrect. Cases that only involve a conflict with an untagged alias are illustrated in gray so that the others stand out.
>
> Can anyone offer an explanation for these conflicts? Do these reflect defects in ICU (particularly for the cases where the untagged aliases disagree with)?
>
> ICU encoding
> Encoding alias (provider)
> Encoding alias (provider) ICU encoding
> ibm-943_P15A-2003
> cp932 (Windows)
> cp932 (Untagged)
> ibm-942_P12A-1999
> ibm-943_P130-1999
> ibm-943 (IBM)
> ibm-943 (Java) ibm-943 (Untagged)
> ibm-943_P15A-2003
> ibm-943_P130-1999
> Shift_JIS (Untagged)
> Shift_JIS (Windows)
> Shift_JIS (Java)
> Shift_JIS (IANA)
> Shift_JIS (MIME)
> ibm-943_P15A-2003
> ibm-33722_P120-1999
> ibm-33722 (IBM)
> ibm-33722 (Java) ibm-33722 (Untagged)
> ibm-33722_P12A_P12A-2009_U2
> ibm-33722_P120-1999
> ibm-5050 (IBM)
> ibm-5050 (Untagged)
> ibm-33722_P12A_P12A-2009_U2
> windows-950-2000
> windows-950 (Windows)
> windows-950 (Untagged)
> ibm-1373_P100-2002
> ibm-5471_P100-2006
> Big5-HKSCS (Untagged)
> Big5-HKSCS (Java)
> Big5-HKSCS (IANA)
> ibm-1375_P100-2008
> windows-936-2000
> windows-936 (Windows)
> windows-936 (Java)
> windows-936 (IANA)
> windows-936 (Untagged)
> ibm-1386_P100-2001
> ibm-949_P11A-1999
> ibm-949 (Untagged)
> ibm-949 (IBM)
> ibm-949 (Java)
> ibm-949_P110-1999
> ibm-1363_P11B-1998
> KS_C_5601-1987 (IANA)
> KS_C_5601-1987 (Java)
> ibm-970_P110_P110-2006_U2
> ibm-1363_P11B-1998
> KSC_5601 (IANA)
> KSC_5601 (Java)
> ibm-970_P110_P110-2006_U2
> ibm-1363_P11B-1998
> 5601 (Untagged)
> 5601 (Java)
> ibm-970_P110_P110-2006_U2
> ibm-1363_P110-1997
> ibm-1363 (IBM)
> ibm-1363 (Untagged)
> ibm-1363_P11B-1998
> windows-949-2000
> windows-949 (Windows)
> windows-949 (Java)
> windows-949 (Untagged)
> ibm-1363_P11B-1998
> windows-949-2000
> KS_C_5601-1987 (Windows)
> KS_C_5601-1987 (Java)
> ibm-970_P110_P110-2006_U2
> windows-949-2000
> KS_C_5601-1989 (Windows)
> KS_C_5601-1989 (IANA)
> ibm-1363_P11B-1998
> windows-949-2000
> KSC_5601 (Windows)
> KSC_5601 (MIME)
> KSC_5601 (Java)
> ibm-970_P110_P110-2006_U2
> windows-949-2000
> csKSC56011987 (Windows)
> csKSC56011987 (IANA)
> ibm-1363_P11B-1998
> windows-949-2000
> korean (Windows)
> korean (IANA)
> ibm-1363_P11B-1998
> windows-949-2000
> iso-ir-149 (Windows)
> iso-ir-149 (IANA)
> ibm-1363_P11B-1998
> ibm-874_P100-1995
> TIS-620 (Java)
> TIS-620 (IANA)
> TIS-620 (Windows)
> windows-874-2000
> ibm-1250_P100-1995
> windows-1250 (Untagged)
> windows-1250 (Windows)
> windows-1250 (Java)
> windows-1250 (IANA)
> ibm-5346_P100-1998
> ibm-1251_P100-1995
> windows-1251 (Untagged)
> windows-1251 (Windows)
> windows-1251 (Java)
> windows-1251 (IANA) ibm-5347_P100-1998
> ibm-1252_P100-2000
> windows-1252 (Untagged)
> windows-1252 (Windows)
> windows-1252 (Java)
> windows-1252 (IANA) ibm-5348_P100-1997
> ibm-1253_P100-1995
> windows-1253 (Untagged)
> windows-1253 (Windows)
> windows-1253 (Java)
> windows-1253 (IANA) ibm-5349_P100-1998
> ibm-1254_P100-1995
> windows-1254 (Untagged)
> windows-1254 (Windows)
> windows-1254 (Java)
> windows-1254 (IANA) ibm-5350_P100-1998
> ibm-5351_P100-1998
> windows-1255 (Untagged)
> windows-1255 (Windows)
> windows-1255 (Java)
> windows-1255 (IANA) ibm-9447_P100-2002
> ibm-5352_P100-1998
> windows-1256 (Untagged)
> windows-1256 (Windows)
> windows-1256 (Java)
> windows-1256 (IANA) ibm-9448_X100-2005
> ibm-5353_P100-1998
> windows-1257 (Untagged)
> windows-1257 (Windows)
> windows-1257 (Java)
> windows-1257 (IANA) ibm-9449_P100-2002
> ibm-1258_P100-1997
> windows-1258 (Untagged)
> windows-1258 (Windows)
> windows-1258 (Java)
> windows-1258 (IANA) ibm-5354_P100-1998
> Tom.
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
Received on 2021-11-16 16:33:37