C++ Logo


Advanced search

[SG16] P1885: Naming text encodings: Curation and provenance of aliases

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Wed, 8 Sep 2021 12:08:56 -0400
P1885 presents an "extension point" of sorts with its aliases() interface.
In addition to the aliases listed in the IANA character set registry,
implementations may introduce additional strings.

This raises a question:
What strings should implementations consider as candidates for adding?

On the surface, one possible source for additional aliases would be ICU.
Looking into ICU, however, will lead one to notice that:
It uses the same converter for csShiftJIS and csWindows31J.
It has a concept of ambiguous aliases.

At least in some cases, these ambiguous aliases in ICU arise because ICU
collects aliases from various sources (different implementations or
environments; ICU calls these "standards" and an alias for a converter may
be "tagged" as being from zero or more standards) and these sources may
associate an alias with subtly different character sets.
More generally, aliases in ICU are not aliases for all purposes (i.e.,
aliases in the identity or strict validation sense), but instead aliases
meeting the design requirements of ICU.

Since P1885 intends to "reliably identify encoding across implementations
and systems", it would seem that, in practice, aliases must come only from
widely-accepted sources that agree with each other.
More practically, under the current design, the bar for adding an alias to
the implementation-defined list should be very high. I do not think that
the paper makes this obvious.

A design that incorporates tags in the style of ICU might be more flexible
(and it may be possible to add this later if the current aliases are
understood to be a curated subset).

As it is, I think it is worthwhile to revisit whether the generality of the
implementation-defined behaviour is advisable. It seems that, as the paper
evolved, at least one implementation-injected alias was meant to be the
"preferred name" on the system returned or recognized by various APIs
(e.g., iconv_open). Even that is problematic: There is a tendency in
converter applications to treat a de facto "reigning" extension as being
what is meant when the non-extended standard is requested. In highly
architected environments, the csShiftJIS and csWindows31J "problem" that is
present in ICU would manifest as there being only one API-recognized
"preferred name". The present design intent of P1885 in having
non-overlapping sets of aliases is in conflict with the desire to associate
the "preferred name" as an alias in such situations.

Received on 2021-09-08 11:09:27