C++ Logo


Advanced search

Subject: Re: Bike shedding for Christmas: P1885 Naming Text Encodings
From: Corentin Jabot (corentinjabot_at_[hidden])
Date: 2020-01-07 01:34:39

On Tue, Jan 7, 2020, 05:24 Tom Honermann <tom_at_[hidden]> wrote:

> On 1/6/20 8:27 AM, Thiago Macieira via SG16 wrote:
> On Monday, 6 January 2020 10:09:26 -03 Corentin Jabot wrote:
> And yet that is nonsense. It can't convert a codec name to its MIB number
> unless that is in a table somewhere the implementation has access to. So
> by
> definition, the text_encoding_id is limited to the codecs the Standard
> Library
> knows about. Other libraries should deploy their own text_encoding_id
> equivalents.
> That is a good point for which i think the solution might be to force
> hosted implementation to
> always provide the entire table (which is really not that big) ?
> It's not, but even then you have the problem that the table in the vendor's
> implementation may be out of date compared to what the application expects.
> And are vendors allowed to extend the table with other names, such as WTF-8?
> Like I said, if all you wanted was the table, you can get the table. I'll
> write an XSL-T script for you to generate the table....
> I think at some point we lost track of what the proposal is about:
> It's about answering:
> - What is the execution character encoding (which only the implementation
> can do)
> - What is the environment encoding (which the implementation can do better)
> Ok, good points. If we restrict text_encoding_id to those, then
> text_encoding_id has no need to support the full table or unknown codecs. By
> definition, it supports only what the implementation supports.
> In Belfast, we discussed the following example in the context of
> [time.duration.io]p4 <http://eel.is/c++draft/time.duration#io-4>;
> printing of the micro units suffix:
> template<class traits, class Rep, class Period>
> void print_fancy_suffix(basic_ostream<char, traits>& os, const
> duration<Rep, Period>& d)
> {
> if constexpr (text_encoding::literal().mib == UTF-8) {
> os << d.count() << "\u00B5s";
> } else {
> os << d.count() << "us";
> }
> }
> I see that as one of the primary motivating use cases at present.
> However, I don't think this represents the extent of use cases well.
> I would like to see these encoding identifiers adopted for use in ICU,
> iconv, QT, or other encoding providers. I think these encoding identifiers
> could be useful in the context of P1629 <https://wg21.link/p1629>.
> I don't want to see code doing string comparisons to match encodings.

Yet that is how it has to work. iconv only exposes name based interface.
Qt does provide both name and mib based interface and can provide a
text_encoding based interface.

The set of encodings an implementation cares about is hard to determine
> since it crosses compiler, standard library, and third party boundaries.
> For example, my understanding is that gcc relies on the host system's
> iconv() implementation to determine the valid execution character set
> targets and to transcode from source file encoding, at least for many
> encodings. So, for gcc, the set of encodings needed to provide complete
> support (e.g., to avoid .mib() returning other or unknown for a supported
> encoding) would involve negotiation between the compiler, run-time library,
> and host system.
Which is not implementable.
Either we force hosted implentation to provide the full database or we
accept that the list might be incomplete.
In practice on a given platform there is a direct relation between the
encodings supported by the compiler and the system on which it is run

> And have that information be consistent across platform when possible (for
> interaction with libraries such as Qt, icu, iconv) - everything else is
> secondary.
> Which means an implementation will provide informations about encoding
> relevant to the platform.
> Now, an encoding id is 3 things:
> - A name,
> - A mib when applicable
> - Aliases when applicable
> Agreed, though implementations should be wary that the alias list might be
> empty. Portable applications should rely on the MIB and on the official name.
> What are you referring to as the "official name"? The IANA character
> registry
> <https://www.iana.org/assignments/character-sets/character-sets.xhtml>
> lists two names and a set of aliases. One of the names is labeled as
> "Preferred MIME Name", the other is just "Name". Not all registered
> character sets have a "Preferred MIME Name". They all do have a "Name".
> There are cases where neither the "Preferred MIME Name" nor the "Name" are
> reflected in the list of aliases. All of the registered sets also contain
> an identifier friendly alias starting with "cs". The "Name" name is not a
> particularly friendly name (since it includes a version date), nor is it
> particularly familiar in many cases (e.g.,
> "Extended_UNIX_Code_Packed_Format_for_Japanese" vs "EUC-JP").

The mime name is the same as the name if unspecified.

> And the name used to construct the object is used to lookup the extra
> optional informations.
> I think the only reason to differentiate "unknown" and "other" in the way
> you suggest is if
> we need to support aliases for non registered encodings.
> Is that the case?
> I think the implementation should strive to never return "unknown", except in
> case of an internal failure to determine what the encoding is. As a matter of
> quality, implementations should be designed not to do that.
> And yet providing a list of well-known MIBs is useful in and of itself. In
> that case, mib::unknown is a valid and well-known value.
> I think the proposal is leaning too heavily on the IANA registry. For
> example, operator== is specified in terms of what the .mib() member
> function returns. In previous emails, Thiago suggested that the
> text_encoding_id class could be more opaque; e.g., it could have its own
> internal system for identifying whether two names refer to the same,
> potentially unregistered, encoding (in which case, .mib() would return
> other, but this would not impact the behavior of operator==). I strongly
> agree with this direction.

This direction is not implementable portably.

The iana registry was always an implentation detail but it is an important
implentation detail nonetheless.
We cannot offer reliable and consistent comparison without it.
The discussion assumes that there exist unregistered encodings which have
many different names on a given platform and I don't see evidence of that.

I would like to see a concrete example of situation in which the provided
comparison algorithm is not sufficient.

> Tom.

SG16 list run by herb.sutter at gmail.com