The following are questions/concerns that came up during the
various SG16 reviews of P1885 that are not strongly
SG16 related and are therefore being delegated to LEWG.
Minutes for prior SG16 reviews of P1885, in chronological order,
are available at:
Additional archived email discussion can be found at:
Questions raised include:
- The text_encoding type represents an encoding name
and/or identifier as opposed to a type that provides encoding
services. Should the name more strongly reflect that intended
use as a name/identifier?
- The id and mib() members of text_encoding
correspond to IANA-specific values and terms. It is
conceivable that mappings to a different/additional registry
could be desired at some time in the future. Should these
names more strongly reflect their IANA association?
- The enumerators of text_encoding::id were obtained
by, for each IANA registered encoding, taking the "cs"
prefixed alias name (of which there is always exactly one),
and dropping the "cs" prefix. A special change was then made
to rename the one that would have been "Unicode" to "UCS2".
Many of the resulting names consist of only capital letters
and may be mistaken for macros. Are these names ok? Or do
they intrude too much on the namespace of user identifiers?
- The literal() and wide_literal() members
of text_encoding return names for what the standard
calls the execution character set and execution-wide
character set. Are these names ok? (SG16 has discussed
updating terminology used within the standard, but has not yet
forwarded a paper containing such a proposal).
- The system() and wide_system() members
of text_encoding return names for the locale
sensitive run-time encoding that was active at the start of
the process (e.g., before any calls to setlocale()).
Are these names ok? On Windows, system() would
return an encoding corresponding to GetACP().
- The text_encoding type, if implemented as shown
with the exposition data members, would have a minimum size of
68 bytes. SG16 has discussed future use of this type as a tag
type or non-type template parameter to select an encoding at
compile time. Does the size of the type raise any concerns
for such use?
- The max_name_length member of text_encoding
is specified with a length of 63 (not including a string
terminator). The IANA
character set registry introductory text states that
"The character set names may be up to 40 characters taken from
the printable characters of US-ASCII". Should this length be
adjusted to match or should the current length be retained?
Additional encoding names known to ICU that are not registered
with the IANA registry can be browsed with ICU's
Converter Explorer. The longest name there (which
includes IANA names) appears to be 27 characters.
- The proposed design exposes a library solution that is not
accessible to the preprocessor. Is LEWG ok with the (wide)
execution character set continuing to be unknown for
preprocessor directives? (a patch
accepted for gcc 11 to provide this information for the
purposes of implementing this feature will expose the names of
these encodings as string literals via new __GNUC_EXECUTION_CHARSET_NAME
and __GNUC_WIDE_EXECUTION_CHARSET_NAME predefined
- The interface allows implementations to extend the set of
recognized encodings beyond those registered with IANA in a
way that permits those additional implementation known
encodings to have associated aliases (e.g., the implementation
could use negative values for additional text_encoding::id
3808 states only positive values will be used).
However, similar extension is not possible for user code (User
code can construct instances of text_encoding with
unrecognized names, but cannot establish alias sets for
them). This means polyfill will not be possible. Is this ok?
- The interface does not provide indication of an unrecognized
encoding name other than by querying the mib()
member to see if the name was mapped to other (in
which case, it could still correspond to an encoding known to
the implementation that is not registered with IANA). This is
intentional since the application has no other mechanism for
validating names and support for unknown names is an explicit
design goal. Is this ok?
- Equality is defined partially, but not solely, in terms of text_encoding::id
such that the following expression all evaluate as indicated
(where Foo, Bar, cz123, and CZ-12.3 are all unrecognized
text_encoding("US-ASCII") == text_encoding("ISO646-US")
// True because .mib() returns the same value for each.
text_encoding("Bar") // False despite .mib() returning
the same value (id::other) for each.
text_encoding("CZ-12.3") // True because the names match
(cas-insensitive ignoring '-' and '.') despite .mib()
returning id::other for each.
Is this ok?
- Is the ability to compare a text_encoding object
directly with an ID desirable?
text_encoding("US-ASCII") == text_encoding::id::ASCII
as opposed to requiring:
- Is the name comparison algorithm denoted by COMP_NAME()
acceptable? This algorithm corresponds to Unicode UTS#22
which notes that it results in ambiguities for some of the
IANA registered names.
- Are the preconditions for the text_encoding
- Name sources:
- Is dependence solely on the IANA registry acceptable? Some
concerns were noted in the various mailing list discussions.
Converter Explorer provides convenient means to browse
encodings known to ICU that are not registered with IANA (Make
sure "IANA" is selected along with other desired sources, then
look for rows that have no entry in the IANA column).
- This is more of a question for LWG. The IANA registry is
not versioned, but does contain a last updated time stamp. No
stability guarantees are provided, nor is there an obvious way
to access older revisions of the registry. Is a reference
ok? Or do we need to include the contents in the standard?
The IANA registry had not been updated for many years until
just a month ago when "UTF-7-IMAP" was added.