On Tue, Jan 7, 2020, 05:24 Tom Honermann <tom@honermann.net> wrote:

On 1/6/20 8:27 AM, Thiago Macieira via SG16 wrote:
On Monday, 6 January 2020 10:09:26 -03 Corentin Jabot wrote:
And yet that is nonsense. It can't convert a codec name to its MIB number
unless that is in a table somewhere the implementation has access to. So
by
definition, the text_encoding_id is limited to the codecs the Standard
Library
knows about. Other libraries should deploy their own text_encoding_id
equivalents.
That is a good point for which i think the solution might be to force
hosted implementation to
always provide the entire table (which is really not that big) ?
It's not, but even then you have the problem that the table in the vendor's 
implementation may be out of date compared to what the application expects. 
And are vendors allowed to extend the table with other names, such as WTF-8?

Like I said, if all you wanted was the table, you can get the table. I'll 
write an XSL-T script for you to generate the table....
I think at some point we lost track of what the proposal is about:
It's about answering:
- What is the execution character encoding (which only the implementation
can do)
- What is the environment encoding (which the implementation can do better)
Ok, good points. If we restrict text_encoding_id to those, then 
text_encoding_id has no need to support the full table or unknown codecs. By 
definition, it supports only what the implementation supports.
In Belfast, we discussed the following example in the context of [time.duration.io]p4; printing of the micro units suffix:

template<class traits, class Rep, class Period>void print_fancy_suffix(basic_ostream<char, traits>& os, const duration<Rep, Period>& d){ if constexpr (text_encoding::literal().mib == UTF-8) { os << d.count() << "\u00B5s"; } else { os << d.count() << "us"; }}

I see that as one of the primary motivating use cases at present. However, I don't think this represents the extent of use cases well.

I would like to see these encoding identifiers adopted for use in ICU, iconv, QT, or other encoding providers. I think these encoding identifiers could be useful in the context of P1629.

I don't want to see code doing string comparisons to match encodings.

Yet that is how it has to work. iconv only exposes name based interface.

Qt does provide both name and mib based interface and can provide a text_encoding based interface.

The set of encodings an implementation cares about is hard to determine since it crosses compiler, standard library, and third party boundaries. For example, my understanding is that gcc relies on the host system's iconv() implementation to determine the valid execution character set targets and to transcode from source file encoding, at least for many encodings. So, for gcc, the set of encodings needed to provide complete support (e.g., to avoid .mib() returning other or unknown for a supported encoding) would involve negotiation between the compiler, run-time library, and host system.

Which is not implementable.

Either we force hosted implentation to provide the full database or we accept that the list might be incomplete.

In practice on a given platform there is a direct relation between the encodings supported by the compiler and the system on which it is run

And have that information be consistent across platform when possible (for
interaction with libraries such as Qt, icu, iconv) - everything else is
secondary.
Which means an implementation will provide informations about encoding
relevant to the platform.

Now, an encoding id is 3 things:
- A name,
- A mib when applicable
- Aliases when applicable
Agreed, though implementations should be wary that the alias list might be 
empty. Portable applications should rely on the MIB and on the official name.
What are you referring to as the "official name"? The IANA character registry lists two names and a set of aliases. One of the names is labeled as "Preferred MIME Name", the other is just "Name". Not all registered character sets have a "Preferred MIME Name". They all do have a "Name". There are cases where neither the "Preferred MIME Name" nor the "Name" are reflected in the list of aliases. All of the registered sets also contain an identifier friendly alias starting with "cs". The "Name" name is not a particularly friendly name (since it includes a version date), nor is it particularly familiar in many cases (e.g., "Extended_UNIX_Code_Packed_Format_for_Japanese" vs "EUC-JP").

The mime name is the same as the name if unspecified.

And the name used to construct the object is used to lookup the extra
optional informations.
I think the only reason to differentiate "unknown" and "other" in the way
you suggest is if
we need to support aliases for non registered encodings.
Is that the case?
I think the implementation should strive to never return "unknown", except in 
case of an internal failure to determine what the encoding is. As a matter of 
quality, implementations should be designed not to do that.

And yet providing a list of well-known MIBs is useful in and of itself. In 
that case, mib::unknown is a valid and well-known value.
I think the proposal is leaning too heavily on the IANA registry. For example, operator== is specified in terms of what the .mib() member function returns. In previous emails, Thiago suggested that the text_encoding_id class could be more opaque; e.g., it could have its own internal system for identifying whether two names refer to the same, potentially unregistered, encoding (in which case, .mib() would return other, but this would not impact the behavior of operator==). I strongly agree with this direction.

This direction is not implementable portably.

The iana registry was always an implentation detail but it is an important implentation detail nonetheless.

We cannot offer reliable and consistent comparison without it.

The discussion assumes that there exist unregistered encodings which have many different names on a given platform and I don't see evidence of that.

I would like to see a concrete example of situation in which the provided comparison algorithm is not sufficient.

Tom.