sg16: Re: [SG16] Bike shedding for Christmas: P1885 Naming Text Encodings

From: Tom Honermann <tom_at_[hidden]>
Date: Mon, 6 Jan 2020 23:24:23 -0500

On 1/6/20 8:27 AM, Thiago Macieira via SG16 wrote:
> On Monday, 6 January 2020 10:09:26 -03 Corentin Jabot wrote:
>>> And yet that is nonsense. It can't convert a codec name to its MIB number
>>> unless that is in a table somewhere the implementation has access to. So
>>> by
>>> definition, the text_encoding_id is limited to the codecs the Standard
>>> Library
>>> knows about. Other libraries should deploy their own text_encoding_id
>>> equivalents.
>> That is a good point for which i think the solution might be to force
>> hosted implementation to
>> always provide the entire table (which is really not that big) ?
> It's not, but even then you have the problem that the table in the vendor's
> implementation may be out of date compared to what the application expects.
> And are vendors allowed to extend the table with other names, such as WTF-8?
>
> Like I said, if all you wanted was the table, you can get the table. I'll
> write an XSL-T script for you to generate the table....
>
>> I think at some point we lost track of what the proposal is about:
>> It's about answering:
>> - What is the execution character encoding (which only the implementation
>> can do)
>> - What is the environment encoding (which the implementation can do better)
> Ok, good points. If we restrict text_encoding_id to those, then
> text_encoding_id has no need to support the full table or unknown codecs. By
> definition, it supports only what the implementation supports.
In Belfast, we discussed the following example in the context of
[time.duration.io]p4 <http://eel.is/c++draft/time.duration#io-4>;
printing of the micro units suffix:

    template<class traits, class Rep, class Period>
    void print_fancy_suffix(basic_ostream<char, traits>& os, const
    duration<Rep, Period>& d)
    {
       if constexpr (text_encoding::literal().mib == UTF-8) {
         os << d.count() << "\u00B5s";
       } else {
         os << d.count() << "us";
       }
    }

I see that as one of the primary motivating use cases at present.
However, I don't think this represents the extent of use cases well.

I would like to see these encoding identifiers adopted for use in ICU,
iconv, QT, or other encoding providers. I think these encoding
identifiers could be useful in the context of P1629
<https://wg21.link/p1629>.

I don't want to see code doing string comparisons to match encodings.

The set of encodings an implementation cares about is hard to determine
since it crosses compiler, standard library, and third party
boundaries. For example, my understanding is that gcc relies on the
host system's iconv() implementation to determine the valid execution
character set targets and to transcode from source file encoding, at
least for many encodings. So, for gcc, the set of encodings needed to
provide complete support (e.g., to avoid .mib() returning other or
unknown for a supported encoding) would involve negotiation between the
compiler, run-time library, and host system.

>
>> And have that information be consistent across platform when possible (for
>> interaction with libraries such as Qt, icu, iconv) - everything else is
>> secondary.
>> Which means an implementation will provide informations about encoding
>> relevant to the platform.
>>
>> Now, an encoding id is 3 things:
>> - A name,
>> - A mib when applicable
>> - Aliases when applicable
> Agreed, though implementations should be wary that the alias list might be
> empty. Portable applications should rely on the MIB and on the official name.

What are you referring to as the "official name"? The IANA character
registry
<https://www.iana.org/assignments/character-sets/character-sets.xhtml>
lists two names and a set of aliases. One of the names is labeled as
"Preferred MIME Name", the other is just "Name". Not all registered
character sets have a "Preferred MIME Name". They all do have a
"Name". There are cases where neither the "Preferred MIME Name" nor the
"Name" are reflected in the list of aliases. All of the registered sets
also contain an identifier friendly alias starting with "cs". The
"Name" name is not a particularly friendly name (since it includes a
version date), nor is it particularly familiar in many cases (e.g.,
"Extended_UNIX_Code_Packed_Format_for_Japanese" vs "EUC-JP").

>
>> And the name used to construct the object is used to lookup the extra
>> optional informations.
>> I think the only reason to differentiate "unknown" and "other" in the way
>> you suggest is if
>> we need to support aliases for non registered encodings.
>> Is that the case?
> I think the implementation should strive to never return "unknown", except in
> case of an internal failure to determine what the encoding is. As a matter of
> quality, implementations should be designed not to do that.
>
> And yet providing a list of well-known MIBs is useful in and of itself. In
> that case, mib::unknown is a valid and well-known value.
>
I think the proposal is leaning too heavily on the IANA registry. For
example, operator== is specified in terms of what the .mib() member
function returns. In previous emails, Thiago suggested that the
text_encoding_id class could be more opaque; e.g., it could have its own
internal system for identifying whether two names refer to the same,
potentially unregistered, encoding (in which case, .mib() would return
other, but this would not impact the behavior of operator==). I
strongly agree with this direction.

Tom.

Received on 2020-01-06 22:26:54