sg16: Re: [SG16] Bike shedding for Christmas: P1885 Naming Text Encodings

From: Corentin Jabot <corentinjabot_at_[hidden]>
Date: Sun, 29 Dec 2019 00:32:35 +0100

On Sat, Dec 28, 2019, 22:21 Tom Honermann <tom_at_[hidden]> wrote:

> On 12/27/19 6:28 AM, Corentin Jabot via SG16 wrote:
>
> Hello
>
> In P1885, I introduce the name "text_encoding" for the class representing
> the name of a text encoding.
> I wonder whether that might conflict or interfere with actual
> encoding/decoder classes and would like your opinion.
>
> Here are a few possible names:
> * Charset (IANA nomenclature, posix)
> * text_codec (Qt)
> * text_encoding
> * text_encoding_name (encoding is used by posix / python /
>
> Unicode nomenclature would favor encoding (Unicode is a charset of which
> utf-8 and utf-16 are both are encodings)
>
> I suggest text_encoding_id. I'd like to preserve text_encoding for a tag
> type (or concept) that can be used at compile time to specify a
> (compile-time) encoding as in a template parameter to std::text.
>
> Tangent 1: the proposed text_encoding is not extensible, at least not in
> a very meaningful way. I suggest we do one of the following:
>
> 1. Remove the text_encoding(const char*) constructor. It doesn't
> allow setting the MIB ID, so is unsatisfactory at present.
> 2. Allow first class extension by, for example, reserving the full
> range of IANA MIB values, defining a "private use" range of values, and
> modifying the text_encoding(const char*) constructor to also accept a
> MIB value (and perhaps make the name parameter optional such that, if
> specified, it would override the internally known name for the provided MIB
> value and if not specified, name() would return a suitable default).
>
>

It is extensible in multiple-choice ways:
- implentation can provide their own aliases for existing mib
- the other mib + custom name can be used to use a non register encoding.
Hence the existence of both unknown and other

I will not support custom mib as it is not in line with the rfc - the mib
being a way to standardize names. Encoding have names, mib is very close to
an implentation details). For that same reason the name cannot be optional.
The name in parameter _always_ take precedence over the iana name so it can
roundtrip to iconv or similar APIs

>
> 1.
>
>
> if text_encoding remains the name of that class, encoder/decoder can be
> used for the class doing the actual conversions.
>
>
> I will further rename "system" to "environment" to be more generic and
> aligned with POSIX.
>
> Is text_encoding::system() intended to be equivalent to
> text_encoding::for_locale(std::locale{})? (I think the answer is, and
> should be, no; e.g., on Windows, this would query GetACP()).
>

It would query getacp which is equivalent to query the user ("") locale at
the start of the program.

> (user, environment and system are, for our purpose synonym and intended to
> mean "the encoding assumed and expected by whatever launched our program).
> Environment has the added benefit that it implies neither user or systems
> which makes it more friendly to embedded platforms
>
> Since locale settings are generally determined by environment (variables),
> use of the term "environment" may be confusing. I prefer system.
>
> Tangent 2: I don't recall if we discussed this in Belfast, but the paper
> identifies three sets of encodings to expose (literals, system, locale). A
> fourth would be terminal/console encoding. This encoding can be easily
> queried on Windows, but not on Linux/UNIX (though terminal encoding rarely
> differs from locale there, so it would be reasonable to just return the
> system encoding).
>
I considered that a few days ago. I am not aware of it being a thing on
other platforms than windows (I am probably wrong) and I don't believe we
can come up with a nice API in the short term to query that nicely.

> Tom.
>
>
> Thanks for your input,
>
> Corentin
>
>
>
>
>
>

Received on 2019-12-28 17:35:15