sg16: Re: [SG16] [isocpp-lib-ext] Questions for LEWG for P1885: Naming Text Encodings to Demystify Them

From: Jens Maurer <Jens.Maurer_at_[hidden]>
Date: Sun, 7 Feb 2021 21:57:15 +0100

On 07/02/2021 08.47, Tom Honermann via Lib-Ext wrote:
> 1. Naming:
> 1. The text_encoding type represents an encoding name and/or identifier as opposed to a type that provides encoding services. Should the name more strongly reflect that intended use as a name/identifier?

I'd expect to see a list of alternatives and some rationale in
the prose section of the paper.

> 2. The id and mib() members of text_encoding correspond to IANA-specific values and terms. It is conceivable that mappings to a different/additional registry could be desired at some time in the future. Should these names more strongly reflect their IANA association?

The sad part is that these names are internally inconsistent.

We have ISOLatin1 and ISO885913, for instance.
Or Windows30Latin1 and windows1250.

See also the next section.

> 3. The enumerators of text_encoding::id were obtained by, for each IANA registered encoding, taking the "cs" prefixed alias name (of which there is always exactly one), and dropping the "cs" prefix. A special change was then made to rename the one that would have been "Unicode" to "UCS2". Many of the resulting names consist of only capital letters and may be mistaken for macros. Are these names ok? Or do they intrude too much on the namespace of user identifiers?

I think we have a naming convention for enum names in the
standard library (lower_snake_case), and these names should
fit in. Some of the names such as OSDEBCDICDF0415 are hard
to pronounce without additional underscores.

Maybe we should just take the numerical values and invent our
own proper names for these encodings.

> 4. The literal() and wide_literal() members of text_encoding return names for what the standard calls the /execution character set/ and /execution-wide character set/. Are these names ok? (SG16 has discussed updating terminology used within the standard, but has not yet forwarded a paper containing such a proposal).

There's an upcoming paper to rename "execution (wide) character set" to "(wide) literal encoding",
so this seems reasonable. The literal and wide_literal functions are consteval, though,
which (I believe) means they can only be called at compile-time (not at runtime).
That seems not helpful; making them "constexpr" should be good enough.

> 5. The system() and wide_system() members of text_encoding return names for the locale sensitive run-time encoding that was active at the start of the process (e.g., before any calls to setlocale()). Are these names ok? On Windows, system() would return an encoding corresponding to GetACP() <https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp>.

The specification "Return the presumed system narrow encoding." uses a lot of
undefined terms: Who presumes what? What system are we talking about?
What's this "narrow" talk?

> 2. Interface:
> 1. The text_encoding type, if implemented as shown with the exposition data members, would have a minimum size of 68 bytes. SG16 has discussed future use of this type as a tag type or non-type template parameter to select an encoding at compile time. Does the size of the type raise any concerns for such use?

I believe that spelling out the underlying type for
the "id" enum is overspecification.

Other than that, maybe we should forego the idea of user-defined
encoding names, and reduce the facility to just naming one of the
predefined encodings. (Noting similarities with enum class endian;
we don't support VAX-endianness, for example.)

> 2. The max_name_length member of text_encoding is specified with a length of 63 (not including a string terminator). The IANA character set registry <https://www.iana.org/assignments/character-sets/character-sets.xhtml> introductory text states that "The character set names may be up to 40 characters taken from the printable characters of US-ASCII". Should this length be adjusted to match or should the current length be retained? Additional encoding names known to ICU that are not registered with the IANA registry can be browsed with ICU's Converter Explorer <https://icu4c-demos.unicode.org/icu-bin/convexp?s=UTR22&s=IBM&s=WINDOWS&s=JAVA&s=IANA&s=MIME&s=-&s=ALL&ShowUnavailable=>. The longest name there (which includes IANA names) appears to be 27 characters.

Smaller is better, if at all.

> 3. The proposed design exposes a library solution that is not accessible to the preprocessor. Is LEWG ok with the (wide) execution character set continuing to be unknown for preprocessor directives? (a patch <https://github.com/gcc-mirror/gcc/commit/eccec8684142e05f2f92f0f5bd5b47dda3ba1529> accepted for gcc 11 to provide this information for the purposes of implementing this feature will expose the names of these encodings as string literals via new __GNUC_EXECUTION_CHARSET_NAME and __GNUC_WIDE_EXECUTION_CHARSET_NAME predefined macros).

No opinion, but maybe the C liaison group wants to chime in here.

> 4. The interface allows implementations to extend the set of recognized encodings beyond those registered with IANA in a way that permits those additional implementation known encodings to have associated aliases (e.g., the implementation could use negative values for additional text_encoding::id enumerators; RFC 3808 states only positive values will be used <https://tools.ietf.org/html/rfc3808#section-3>). However, similar extension is not possible for user code (User code can construct instances of text_encoding with unrecognized names, but cannot establish alias sets for them). This means polyfill will not be possible. Is this ok?

The whole purpose of this facility to allow portable programs to adapt to the
surrounding encoding. Implementations that invent their own encodings can't
be reasonably addressed by portable programs.

The proposed wording needs to extend the "normative references" and/or "bibliography"
in the standard for all these RFCs it is talking about.
(I think "Bibliography" is sufficient, given that all the
needed information is presented here.)

> 6. Equality is defined partially, but not solely, in terms of text_encoding::id such that the following expression all evaluate as indicated (where Foo, Bar, cz123, and CZ-12.3 are all unrecognized encoding names):
> text_encoding("US-ASCII") == text_encoding("ISO646-US") // True because .mib() returns the same value for each.
> text_encoding("Foo") == text_encoding("Bar") // False despite .mib() returning the same value (id::other) for each.
> text_encoding("cz123") == text_encoding("CZ-12.3") // True because the names match (cas-insensitive ignoring '-' and '.') despite .mib() returning id::other for each
> Is this ok?

I guess this establishes equivalence classes, so it's at least sound.

> 7. Is the ability to compare a text_encoding object directly with an ID desirable?
> text_encoding("US-ASCII") == text_encoding::id::ASCII
> as opposed to requiring:
> text_encoding("US-ASCII").mib() == text_encoding::id::ASCII

I don't find this convenience, saving a ".mib()" member function call,
compulsory to provide.

> 8. Is the name comparison algorithm denoted by COMP_NAME() acceptable? This algorithm corresponds to Unicode UTS#22 <https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching> which notes that it results in ambiguities for some of the IANA registered names.

It would be nice to see some explanation of this in the design section of the paper, including examples
of the ambiguities. At the face of it, "ambiguity" sounds frightening.

> 9. Are the preconditions for the text_encoding constructors acceptable?

What is "MIBenum" referring to?

> 4. References:
> 1. This is more of a question for LWG. The IANA registry is not versioned, but does contain a last updated time stamp. No stability guarantees are provided, nor is there an obvious way to access older revisions of the registry. Is a reference ok? Or do we need to include the contents in the standard? The IANA registry had not been updated for many years until just a month ago when "UTF-7-IMAP" was added.

We already use IANA timezones. However, those only appear as strings (not as enumerators),
and we don't have a funny name comparison algorithm to cope with.

Which probably means we need to run a registry for these names in the C++ standard,
which is utterly disgusting.

General notes on the proposed wording:

- There is funny text "described by [rfc2978] and [rfc3808]"
This is not how RFCs should be referenced in running standard text.
I think something like "RFC 2978" should be used.

- "struct text_encoding{" missing space before brace

- misspelling "lenght" (more than once)

- There are multiple mentions of "text_encoding::id" within the scope of text_encoding.
Shorten to just "id".

- "registered-character-set" should be italicized only where it is defined
(i.e. once) and should be spelled without hyphens.

- "known of" -> "known to"

- misspelling "precedeed"

- text_encoding constructor from string_view: This is broken.
For the name of a registered character set (which seems to be distinct
from an alias), I need to return "other".
The note about freestanding lacks backing in normative text.
Excise or turn into normative "Remarks" text.

- text_encoding constructor from a mib: This is "noexcept", yet
has a precondition. Is that intentional? It's at least surprising.

- "implementation-defined" should have a hyphen

- "null-terminated string" should be NTBS or NTMBS, whichever is desired.

- "on that platform": Which "platform" are we talking about? This is an
undefined term in C++.

- aliases: An "object" r is not a thing you can apply decltype to.
Suggestion: "implementation-defined object r of type R" and use R instead
of decltype(r).
Making the function "constexpr" makes it hard to use third-party services
for this.

- wide_system(): The narrow system encoding has a clear POSIX call to use;
this one doesn't. Please show POSIX function calls.

- "lifetime" -> "execution" (lifetime is a well-defined term in C++ and
means something else entirely)

- "Returns: Equivalent to" doesn't exist. Use "Effects: Equivalent to: return system() == id_"

Jens

Received on 2021-02-07 14:57:24