C++ Logo

sg16

Advanced search

[SG16] P1885: Naming text encodings: Encodings in the environment versus registered character sets

From: Hubert Tong <hubert.reinterpretcast_at_[hidden]>
Date: Mon, 13 Sep 2021 01:44:04 -0400
In P1885, a registered character set is one that is in (at the point when
the paper was written) the IANA character set registry. P1885 also provides
static functions to query about the encoding used in either the translation
or the execution environment. In some cases (involving subsets or
supersets), there are questions of when an implementation should return a
registered character set as the result of such static functions.

The environment-implements-superset case presents itself in relation to
csBig5. The system encodings for "big5" on Windows and AIX contain
characters that are not part of the common base of Big5; however, both are
also missing characters from Big5-2003:
Big5-2003 has U+7881 as F9 D6 and U+2460 as C6 A1.
Windows has U+7881 as F9 D6 but not U+2460 as C6 A1.
AIX does not have U+7881 as F9 D6 but does have U+2460 as C6 A1.

So, the environment-implements-superset case can, in practical terms, be
generalized as being about divergent implementations of "charsets".
Of course, that generalization could also account for some
environment-implements-subset cases; however, in addition to more mundane
reasons, the environment-implements-subset case also arises from a
technicality: It is questionable whether or not a POSIX environment that
uses a UTF-8 encoding paired with a 2-byte (UCS-2) wchar_t can be said to
have UTF-8 as the environment text encoding because the characters outside
of the BMP cannot (based on wchar_t-representability) be considered members
of the character set associated with the environment.

So it seems we have some questions:
Are the design goals better met or not by allowing divergent
implementations of "charsets" to be identified as being the same registered
character set?
When an implementation indicates a specific environment encoding, do the
design goals require that all associated characters or members of the
associated code space be wchar_t-representable?

It may be useful to characterize the questions as whether the result of the
static functions are meant to be more of a hint (with few guarantees) or
more of a promise.

Received on 2021-09-13 00:44:36