Hi, Tom.
 > Can anyone offer an explanation for these conflicts?

I’m not sure what kind of explanation you are looking for, so my apologies in advance if this is obvious or doesn’t answer your question. These conflicts basically show why the “provider” tag mechanism was added to ICU, and why the IBM CDRA naming scheme uses such verbose names such as "ibm-33722_P120-1999”. Put in prose, the strings “Shift_JIS”, "TIS-620” (and from another thread, I could add GB18030) are  insufficient to select an actual encoding, due to existing conflicting implementations. The tags are useful to disambiguate these cases. If you know the “user” is working in a Java context, or IANA (Content-Type), or with a Windows codepage, then a different encoding may be preferred.

If you are curious about very specific differences, such as the family of TIS-620 differences, please see the comments in the convrtrs.txt file itself. These resulted from painstaking comparisons and equally painful user experience problems.

https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/convrtrs.txt

Hope this helps

-s

PS: I have not reviewed P1885 in detail (will print it off and take a look), but I wouldn’t consider IANA as *the* primary source for ICU.  IBM CDRA is the primary source for ICU.

From P1885:
> The intent of this proposal is that the names refer to the character encodings as described by IANA…
> This blanket wording allows an implementation to offer a behavior that matches existing expectations.
> … "Shift-JIS” refers to several “slightly different” encodings.

It’s great to try to improve things in this space, but allowing for slightly different behavior may not move things forward.
Whatwg encoding <https://encoding.spec.whatwg.org> actually specifies the behavior of, for example, Shift-JIS. 
I’d recommend actually locking down behavior to something such as ICU behavior.

There was some discussion of an XML format for charsets, which could lead to some kind of a common repository for encodings. That effort is not active.


El nov. 15, 2021, a las 6:20 p. m., Tom Honermann via SG16 <sg16@lists.isocpp.org> escribió:

I conducted an audit of all of the encoding names recognized by ICU with the goal of identifying any cases where comparison under the COMP_NAME loose matching algorithm specified in P1885 would lead to a conflict in selecting an ICU converter. The good news is that no conflicts were identified that can be attributed to the loose matching algorithm. However, I found that the same alias is used for different encodings in multiple cases as described in the table below. These can be verified with ICU Converter Explorer.

I did not scrape the ICU Converter Explorer page to perform the audit. The data I worked off of was produced with ICU 70.1 by running uconv -l --canon and then massaging the output.

Each row of the table describes a conflict between two ICU encodings, each of which is named in the left most and right most columns respectively. The inner columns list the specific aliases that conflict and which provider they correspond to.

For at least some of these, one has to wonder if the ICU data is simply incorrect. Cases that only involve a conflict with an untagged alias are illustrated in gray so that the others stand out.

Can anyone offer an explanation for these conflicts? Do these reflect defects in ICU (particularly for the cases where the untagged aliases disagree with)?

ICU encoding
Encoding alias (provider)
Encoding alias (provider) ICU encoding
ibm-943_P15A-2003
cp932 (Windows)
cp932 (Untagged)
ibm-942_P12A-1999
ibm-943_P130-1999
ibm-943 (IBM)
ibm-943 (Java)
ibm-943 (Untagged)
ibm-943_P15A-2003
ibm-943_P130-1999
Shift_JIS (Untagged)
Shift_JIS (Windows)
Shift_JIS (Java)
Shift_JIS (IANA)
Shift_JIS (MIME)
ibm-943_P15A-2003
ibm-33722_P120-1999
ibm-33722 (IBM)
ibm-33722 (Java)
ibm-33722 (Untagged)
ibm-33722_P12A_P12A-2009_U2
ibm-33722_P120-1999
ibm-5050 (IBM)
ibm-5050 (Untagged)
ibm-33722_P12A_P12A-2009_U2
windows-950-2000
windows-950 (Windows)
windows-950 (Untagged)
ibm-1373_P100-2002
ibm-5471_P100-2006
Big5-HKSCS (Untagged)
Big5-HKSCS (Java)
Big5-HKSCS (IANA)
ibm-1375_P100-2008
windows-936-2000
windows-936 (Windows)
windows-936 (Java)
windows-936 (IANA)
windows-936 (Untagged)
ibm-1386_P100-2001
ibm-949_P11A-1999
ibm-949 (Untagged)
ibm-949 (IBM)
ibm-949 (Java)
ibm-949_P110-1999
ibm-1363_P11B-1998
KS_C_5601-1987 (IANA)
KS_C_5601-1987 (Java)
ibm-970_P110_P110-2006_U2
ibm-1363_P11B-1998
KSC_5601 (IANA)
KSC_5601 (Java)
ibm-970_P110_P110-2006_U2
ibm-1363_P11B-1998
5601 (Untagged)
5601 (Java)
ibm-970_P110_P110-2006_U2
ibm-1363_P110-1997
ibm-1363 (IBM)
ibm-1363 (Untagged)
ibm-1363_P11B-1998
windows-949-2000
windows-949 (Windows)
windows-949 (Java)
windows-949 (Untagged)
ibm-1363_P11B-1998
windows-949-2000
KS_C_5601-1987 (Windows)
KS_C_5601-1987 (Java)
ibm-970_P110_P110-2006_U2
windows-949-2000
KS_C_5601-1989 (Windows)
KS_C_5601-1989 (IANA)
ibm-1363_P11B-1998
windows-949-2000
KSC_5601 (Windows)
KSC_5601 (MIME)
KSC_5601 (Java)
ibm-970_P110_P110-2006_U2
windows-949-2000
csKSC56011987 (Windows)
csKSC56011987 (IANA)
ibm-1363_P11B-1998
windows-949-2000
korean (Windows)
korean (IANA)
ibm-1363_P11B-1998
windows-949-2000
iso-ir-149 (Windows)
iso-ir-149 (IANA)
ibm-1363_P11B-1998
ibm-874_P100-1995
TIS-620 (Java)
TIS-620 (IANA)
TIS-620 (Windows)
windows-874-2000
ibm-1250_P100-1995
windows-1250 (Untagged)
windows-1250 (Windows)
windows-1250 (Java)
windows-1250 (IANA)
ibm-5346_P100-1998
ibm-1251_P100-1995
windows-1251 (Untagged)
windows-1251 (Windows)
windows-1251 (Java)
windows-1251 (IANA)
ibm-5347_P100-1998
ibm-1252_P100-2000
windows-1252 (Untagged)
windows-1252 (Windows)
windows-1252 (Java)
windows-1252 (IANA)
ibm-5348_P100-1997
ibm-1253_P100-1995
windows-1253 (Untagged)
windows-1253 (Windows)
windows-1253 (Java)
windows-1253 (IANA)
ibm-5349_P100-1998
ibm-1254_P100-1995
windows-1254 (Untagged)
windows-1254 (Windows)
windows-1254 (Java)
windows-1254 (IANA)
ibm-5350_P100-1998
ibm-5351_P100-1998
windows-1255 (Untagged)
windows-1255 (Windows)
windows-1255 (Java)
windows-1255 (IANA)
ibm-9447_P100-2002
ibm-5352_P100-1998
windows-1256 (Untagged)
windows-1256 (Windows)
windows-1256 (Java)
windows-1256 (IANA)
ibm-9448_X100-2005
ibm-5353_P100-1998
windows-1257 (Untagged)
windows-1257 (Windows)
windows-1257 (Java)
windows-1257 (IANA)
ibm-9449_P100-2002
ibm-1258_P100-1997
windows-1258 (Untagged)
windows-1258 (Windows)
windows-1258 (Java)
windows-1258 (IANA)
ibm-5354_P100-1998

Tom.

--
SG16 mailing list
SG16@lists.isocpp.org
https://lists.isocpp.org/mailman/listinfo.cgi/sg16