sg16: Re: [SG16] Bike shedding for Christmas: P1885 Naming Text Encodings

From: Steve Downey <sdowney_at_[hidden]>
Date: Mon, 30 Dec 2019 12:41:57 -0500

If you try to adjust the output of a program based on what th output is
attached to, you run the risks of bad feedback loops where the terminal
gets adjusted to adapt to bad output causing the output to change.
While there are many issues with posix locale, having separate control of
interpretation of output and control of output by varying the current
locale for each program independently isn't one of them.

Pipelines, tees, remote shells, ssh, sockets, and so on, all make the
notion of output and terminal suspect.

On Sun, Dec 29, 2019, 23:17 Tom Honermann via SG16 <sg16_at_[hidden]>
wrote:

> On 12/28/19 6:38 PM, Corentin Jabot via SG16 wrote:
>
>
>
> On Sun, Dec 29, 2019, 00:32 Corentin Jabot <corentinjabot_at_[hidden]>
> wrote:
>
>>
>>
>> On Sat, Dec 28, 2019, 22:21 Tom Honermann <tom_at_[hidden]> wrote:
>>
>>> On 12/27/19 6:28 AM, Corentin Jabot via SG16 wrote:
>>>
>>> Hello
>>>
>>> In P1885, I introduce the name "text_encoding" for the class
>>> representing the name of a text encoding.
>>> I wonder whether that might conflict or interfere with actual
>>> encoding/decoder classes and would like your opinion.
>>>
>>> Here are a few possible names:
>>> * Charset (IANA nomenclature, posix)
>>> * text_codec (Qt)
>>> * text_encoding
>>> * text_encoding_name (encoding is used by posix / python /
>>>
>>> Unicode nomenclature would favor encoding (Unicode is a charset of which
>>> utf-8 and utf-16 are both are encodings)
>>>
>>> I suggest text_encoding_id. I'd like to preserve text_encoding for a
>>> tag type (or concept) that can be used at compile time to specify a
>>> (compile-time) encoding as in a template parameter to std::text.
>>>
>>> Tangent 1: the proposed text_encoding is not extensible, at least not
>>> in a very meaningful way. I suggest we do one of the following:
>>>
>>> 1. Remove the text_encoding(const char*) constructor. It doesn't
>>> allow setting the MIB ID, so is unsatisfactory at present.
>>> 2. Allow first class extension by, for example, reserving the full
>>> range of IANA MIB values, defining a "private use" range of values, and
>>> modifying the text_encoding(const char*) constructor to also accept
>>> a MIB value (and perhaps make the name parameter optional such that, if
>>> specified, it would override the internally known name for the provided MIB
>>> value and if not specified, name() would return a suitable default).
>>>
>>>
>>
>> It is extensible in multiple-choice ways:
>> - implentation can provide their own aliases for existing mib
>> - the other mib + custom name can be used to use a non register encoding.
>> Hence the existence of both unknown and other
>>
>> I will not support custom mib as it is not in line with the rfc - the
>> mib being a way to standardize names.
>>
> I'm not sure whether you are referring to RFC 2978 or 3808 here.
> Regardless, the design of both the RFCs and your paper is such that there
> are multiple possible names for any encoding. A program that wishes to
> identify an encoding for which the implementation does not supply a MIB ID
> would have to be expected to recognize all aliases. That doesn't seem
> realistic to me. Additionally, as specified, the proposal only includes
> enumerators for a select subset of the IANA registered names with no
> facility (other than implementation provided extension) for creating a
> text_encoding object with a MIB ID for any other IANA registered encoding.
>
> Refined suggestions:
>
> 1. Include all IANA MIB IDs in the set of enumerators of
> text_encoding::id. This may require either a normative reference to (a
> dated version of) the IANA registry or that we copy from a version of the
> IANA registry and update it for each C++ standard. Note that support for
> names need not imply support for the named encoding; these are just
> identifiers.
> 2. Use the IANA "cs" prefixed names for the names of the enumerators
> of text_encoding::id. They may not be the prettiest names, but using
> these names will better facilitate automation and they are intended to be
> used as identifiers.
> 3. Provide a text_encoding(text_encoding::id) constructor that enables
> creation of an instance with a particular ID. If the implementation
> doesn't have a registered name for the provided ID, then use a name like
> "MIB#42".
>
> Encoding have names, mib is very close to an implentation details). For
>> that same reason the name cannot be optional. The name in parameter
>> _always_ take precedence over the iana name so it can roundtrip to iconv or
>> similar APIs
>>
>
>
> To rephrase that:
>
> Different names may map to the same mib but two encoding with the same
> names have to compare equal.
>
> I agree with that, but there is also present in the proposal that two
> encodings with different names that map to the same MIB ID compare equal.
> The currently proposed design has the following issue: IANA doesn't have a
> registration for the WTF-8 encoding today, but it is conceivable that it
> could be registered in the future. As proposed, text_encoding("WTF8") !=
> text_encoding("WTF-8"), but if the encoding were to be registered in the
> future with both of those names as aliases, then they would compare equal
> despite having different names. It seems to me that either 1)
> text_encoding shouldn't have equality comparison operators (though
> text_encoding::id should) or, 2) text_encoding should be split into two
> facilities, one that stores an integral ID, and another that performs name
> lookups and returns a value of the former. The latter would also
> facilitate dynamic registration of additional names.
>
>
>
>>> 1.
>>>
>>>
>>> if text_encoding remains the name of that class, encoder/decoder can be
>>> used for the class doing the actual conversions.
>>>
>>>
>>> I will further rename "system" to "environment" to be more generic and
>>> aligned with POSIX.
>>>
>>> Is text_encoding::system() intended to be equivalent to
>>> text_encoding::for_locale(std::locale{})? (I think the answer is, and
>>> should be, no; e.g., on Windows, this would query GetACP()).
>>>
>>
>>
>> It would query getacp which is equivalent to query the user ("") locale
>> at the start of the program.
>>
> Good, that sounds right.
>
> (user, environment and system are, for our purpose synonym and intended to
>>> mean "the encoding assumed and expected by whatever launched our program).
>>> Environment has the added benefit that it implies neither user or
>>> systems which makes it more friendly to embedded platforms
>>>
>>> Since locale settings are generally determined by environment
>>> (variables), use of the term "environment" may be confusing. I prefer
>>> system.
>>>
>>> Tangent 2: I don't recall if we discussed this in Belfast, but the paper
>>> identifies three sets of encodings to expose (literals, system, locale). A
>>> fourth would be terminal/console encoding. This encoding can be easily
>>> queried on Windows, but not on Linux/UNIX (though terminal encoding rarely
>>> differs from locale there, so it would be reasonable to just return the
>>> system encoding).
>>>
>> I considered that a few days ago. I am not aware of it being a thing on
>> other platforms than windows (I am probably wrong) and I don't believe we
>> can come up with a nice API in the short term to query that nicely.
>>
> It is a thing on other platforms in that terminals (typically) have an
> encoding setting that specifies how characters are translated for
> input/display purposes. What isn't common is a facility for querying the
> terminal for its encoding setting (as far as I can tell, there is no
> terminfo capability specified, nor have I found escape sequences that can
> request it though there are escape sequences for some terminal types to
> provide a character set map).
>
> Tom.
>
> Tom.
>>>
>>>
>>> Thanks for your input,
>>>
>>> Corentin
>>>
>>>
>>>
>>>
>>>
>>>
>
> --
> SG16 mailing list
> SG16_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>

Received on 2019-12-30 11:44:37