sg16: Re: [SG16] Bike shedding for Christmas: P1885 Naming Text Encodings

From: Thiago Macieira <thiago_at_[hidden]>
Date: Tue, 07 Jan 2020 07:24:15 -0600

On Tuesday, 7 January 2020 01:34:39 CST Corentin Jabot wrote:
> > > I think at some point we lost track of what the proposal is about:
> > > It's about answering:
> > > - What is the execution character encoding (which only the
> > > implementation
> > > can do)
> > > - What is the environment encoding (which the implementation can do
> > > better)
> >
> > Ok, good points. If we restrict text_encoding_id to those, then
> > text_encoding_id has no need to support the full table or unknown codecs.
> > By definition, it supports only what the implementation supports.

Actually, I question the need for a class at all. As Tom's example below
showed, we need an API to get the character sets of your two points above. We
do not need to overdo this.

> > template<class traits, class Rep, class Period>
> > void print_fancy_suffix(basic_ostream<char, traits>& os, const
> > duration<Rep, Period>& d)
> > {
> > if constexpr (text_encoding::literal().mib == UTF-8) {
> > os << d.count() << "\u00B5s";
> > } else {
> > os << d.count() << "us";
> > }
> > }
> >
> > I see that as one of the primary motivating use cases at present.
> > However, I don't think this represents the extent of use cases well.

Then let's enumerate the use-cases we think this would be useful for.

> > I would like to see these encoding identifiers adopted for use in ICU,
> > iconv, QT, or other encoding providers. I think these encoding
> > identifiers
> > could be useful in the context of P1629 <https://wg21.link/p1629>.
> >
> > I don't want to see code doing string comparisons to match encodings.

You should ask whether those libraries want to adopt your new class. Speaking
for Qt, the text_encoding_id would be at best a holder for information we
already have in other forms (namely, MIB number and text name of the
encoding). If it provides the dashless string comparison function or if it
carries a MIB and alias database, it's duplicating information we already
have. Qt needs codecs for two things:

1) conversion to/from the locale encoding on Windows[*] and UTF-16.
[*]: As of Qt 6.0, we're declaring all Unix systems that are not using UTF-8
as their locale encoding to be misconfigured; will complain and then
proceed to use UTF-8 anyway.

2) arbitrary conversion for legacy decoding of file formats and network
protocols. For this, we need a library that has a good set of codecs, like ICU
does and like GNU libc's gconv. However, this is easy to make a compile-time
choice too, so the user can choose not to use the support for one or more
libraries. We do this in QTimezone class, where we have ICU, IANA DB, Windows
and Android backends.

There's a third thing that users ask for and I don't want to provide and I
don't think should be in the Standard Library's API either: detecting/guessing
the encoding of a given text. QTextCodec has functions like "codecForHtml"
which simply try to decode the HTML looking for the <meta charset> line and
extract the name from there, but what users want is to do what browsers do: be
given some content and detect whether it's KOI8-R, EUC-JP, Shift-JIS, Windows
1252, or UTF-8.

Note we will take care of converting from UTF-16 to UTF-8, Latin1 and UTF-32
and vice-versa. We will *not* use the standard library or any other library
for those until I see at least one with an implementation that is better than
mine. (see Daniel Lemire and cpplang's #x86 channel for implementations that
beat mine in some cases)

Finally, note that things can change between now and 2027 (assuming this makes
into C++23).

> Yet that is how it has to work. iconv only exposes name based interface.
> Qt does provide both name and mib based interface and can provide a
> text_encoding based interface.

But what's the point? Aside from holding a MIB-or-text name, what can the
class do? What's its value?

> Which is not implementable.
> Either we force hosted implentation to provide the full database or we
> accept that the list might be incomplete.

And then we get to the problem of the database getting updated out of cycle
with the standards. I suspect that having a superset is acceptable, though.

> The mime name is the same as the name if unspecified.

And in the absence of further information, I would call this the official
name. It's what content creators are supposed to use in their Content-Type
lines.

> > And the name used to construct the object is used to lookup the extra
> > optional informations.
> > I think the only reason to differentiate "unknown" and "other" in the way
> > you suggest is if
> > we need to support aliases for non registered encodings.
> > Is that the case?

unknown = I don't know what this is
other = I know what this is but it doesn't have a MIB number I know about

> The iana registry was always an implentation detail but it is an important
> implentation detail nonetheless.
> We cannot offer reliable and consistent comparison without it.
> The discussion assumes that there exist unregistered encodings which have
> many different names on a given platform and I don't see evidence of that.
>
> I would like to see a concrete example of situation in which the provided
> comparison algorithm is not sufficient.

I'm not questioning whether it's sufficient. I'm questioning whether it's
necessary in the first place.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
   Software Architect - Intel System Software Products

Received on 2020-01-07 07:26:58