sg16: Re: [SG16] Bike shedding for Christmas: P1885 Naming Text Encodings

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Tue, 7 Jan 2020 14:37:59 +0100

On Tue, 7 Jan 2020 at 14:24, Thiago Macieira <thiago_at_[hidden]> wrote:

> On Tuesday, 7 January 2020 01:34:39 CST Corentin Jabot wrote:
> > > > I think at some point we lost track of what the proposal is about:
> > > > It's about answering:
> > > > - What is the execution character encoding (which only the
> > > > implementation
> > > > can do)
> > > > - What is the environment encoding (which the implementation can do
> > > > better)
> > >
> > > Ok, good points. If we restrict text_encoding_id to those, then
> > > text_encoding_id has no need to support the full table or unknown
> codecs.
> > > By definition, it supports only what the implementation supports.
>
> Actually, I question the need for a class at all. As Tom's example below
> showed, we need an API to get the character sets of your two points above.
> We
> do not need to overdo this.
>
> > > template<class traits, class Rep, class Period>
> > > void print_fancy_suffix(basic_ostream<char, traits>& os, const
> > > duration<Rep, Period>& d)
> > > {
> > > if constexpr (text_encoding::literal().mib == UTF-8) {
> > > os << d.count() << "\u00B5s";
> > > } else {
> > > os << d.count() << "us";
> > > }
> > > }
> > >
> > > I see that as one of the primary motivating use cases at present.
> > > However, I don't think this represents the extent of use cases well.
>
> Then let's enumerate the use-cases we think this would be useful for.
>
> > > I would like to see these encoding identifiers adopted for use in ICU,
> > > iconv, QT, or other encoding providers. I think these encoding
> > > identifiers
> > > could be useful in the context of P1629 <https://wg21.link/p1629>.
> > >
> > > I don't want to see code doing string comparisons to match encodings.
>
> You should ask whether those libraries want to adopt your new class.
> Speaking
> for Qt, the text_encoding_id would be at best a holder for information we
> already have in other forms (namely, MIB number and text name of the
> encoding). If it provides the dashless string comparison function or if it
> carries a MIB and alias database, it's duplicating information we already
> have. Qt needs codecs for two things:
>
> 1) conversion to/from the locale encoding on Windows[*] and UTF-16.
> [*]: As of Qt 6.0, we're declaring all Unix systems that are not using
> UTF-8
> as their locale encoding to be misconfigured; will complain and then
> proceed to use UTF-8 anyway.
>
> 2) arbitrary conversion for legacy decoding of file formats and network
> protocols. For this, we need a library that has a good set of codecs, like
> ICU
> does and like GNU libc's gconv. However, this is easy to make a
> compile-time
> choice too, so the user can choose not to use the support for one or more
> libraries. We do this in QTimezone class, where we have ICU, IANA DB,
> Windows
> and Android backends.
>
> There's a third thing that users ask for and I don't want to provide and I
> don't think should be in the Standard Library's API either:
> detecting/guessing
> the encoding of a given text. QTextCodec has functions like "codecForHtml"
> which simply try to decode the HTML looking for the <meta charset> line
> and
> extract the name from there, but what users want is to do what browsers
> do: be
> given some content and detect whether it's KOI8-R, EUC-JP, Shift-JIS,
> Windows
> 1252, or UTF-8.
>
> Note we will take care of converting from UTF-16 to UTF-8, Latin1 and
> UTF-32
> and vice-versa. We will *not* use the standard library or any other
> library
> for those until I see at least one with an implementation that is better
> than
> mine. (see Daniel Lemire and cpplang's #x86 channel for implementations
> that
> beat mine in some cases)
>
> Finally, note that things can change between now and 2027 (assuming this
> makes
> into C++23).
>
> > Yet that is how it has to work. iconv only exposes name based interface.
> > Qt does provide both name and mib based interface and can provide a
> > text_encoding based interface.
>
> But what's the point? Aside from holding a MIB-or-text name, what can the
> class do? What's its value?
>

Holding a MIB and/or name :)
That's it (and give you aliases, proper name and mib comparison).

If and when we have encoder/decoder objects in thee standard, they would
return a text_encoding_id in an id() or info() function.

Where Qt does that (will do that) in a single class we would have one for
the name and one for the actual converting facility.
It does not want to be more than a fancy name.

> > Which is not implementable.
> > Either we force hosted implentation to provide the full database or we
> > accept that the list might be incomplete.
>
> And then we get to the problem of the database getting updated out of
> cycle
> with the standards. I suspect that having a superset is acceptable, though.
>
> > The mime name is the same as the name if unspecified.
>
> And in the absence of further information, I would call this the official
> name. It's what content creators are supposed to use in their Content-Type
> lines.
>

+1

>
> > > And the name used to construct the object is used to lookup the extra
> > > optional informations.
> > > I think the only reason to differentiate "unknown" and "other" in the
> way
> > > you suggest is if
> > > we need to support aliases for non registered encodings.
> > > Is that the case?
>
> unknown = I don't know what this is
> other = I know what this is but it doesn't have a MIB number I know about
>

We agree, I _think_

> > The iana registry was always an implentation detail but it is an
> important
> > implentation detail nonetheless.
> > We cannot offer reliable and consistent comparison without it.
> > The discussion assumes that there exist unregistered encodings which have
> > many different names on a given platform and I don't see evidence of
> that.
> >
> > I would like to see a concrete example of situation in which the provided
> > comparison algorithm is not sufficient.
>
> I'm not questioning whether it's sufficient. I'm questioning whether it's
> necessary in the first place.
>

Are you questioning that it should have a comparison operator?
For example, I want to be able to do things like

assert(text_encoding::system() == text_encoding::literal());
assert(text_encoding::system() == text_encoding::utf8);

>
> --
> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
> Software Architect - Intel System Software Products
>
>
>
>

Received on 2020-01-07 07:40:42