C++ Logo


Advanced search

Subject: Re: Bike shedding for Christmas: P1885 Naming Text Encodings
From: Tom Honermann (tom_at_[hidden])
Date: 2019-12-31 17:52:20

On 12/30/19 12:41 PM, Steve Downey via SG16 wrote:
> If you try to adjust the output of a program based on what th output
> is attached to, you run the risks of bad feedback loops where the
> terminal gets adjusted to adapt to bad output causing the output to
> change.
That is certainly a possibility.  In my experience though,
terminals/consoles are rarely configured for specific applications.
> While there are many issues with posix locale, having separate control
> of interpretation of output and control of output by varying the
> current locale for each program independently isn't one of them.
I agree, the concern is Windows.
> Pipelines, tees, remote shells, ssh, sockets, and so on, all make the
> notion of output and terminal suspect.

I think they make it more complicated.


> On Sun, Dec 29, 2019, 23:17 Tom Honermann via SG16
> <sg16_at_[hidden] <mailto:sg16_at_[hidden]>> wrote:
> On 12/28/19 6:38 PM, Corentin Jabot via SG16 wrote:
>> On Sun, Dec 29, 2019, 00:32 Corentin Jabot
>> <corentinjabot_at_[hidden] <mailto:corentinjabot_at_[hidden]>> wrote:
>> On Sat, Dec 28, 2019, 22:21 Tom Honermann <tom_at_[hidden]
>> <mailto:tom_at_[hidden]>> wrote:
>> On 12/27/19 6:28 AM, Corentin Jabot via SG16 wrote:
>>> Hello
>>> In P1885, I introduce the name "text_encoding"  for the
>>> class representing the name of a text encoding.
>>> I wonder whether that might conflict or interfere with
>>> actual encoding/decoder classes and would like your opinion.
>>> Here are a few possible names:
>>> * Charset (IANA nomenclature, posix)
>>> * text_codec (Qt)
>>> * text_encoding
>>> * text_encoding_name (encoding is used by posix / python /
>>> Unicode nomenclature would favor encoding (Unicode is a
>>> charset of which utf-8 and utf-16 are both are encodings)
>> I suggest text_encoding_id. I'd like to preserve
>> text_encoding for a tag type (or concept) that can be
>> used at compile time to specify a (compile-time) encoding
>> as in a template parameter to std::text.
>> Tangent 1: the proposed text_encoding is not extensible,
>> at least not in a very meaningful way.  I suggest we do
>> one of the following:
>> 1. Remove the text_encoding(const char*) constructor. 
>> It doesn't allow setting the MIB ID, so is
>> unsatisfactory at present.
>> 2. Allow first class extension by, for example,
>> reserving the full range of IANA MIB values, defining
>> a "private use" range of values, and modifying the
>> text_encoding(const char*) constructor to also accept
>> a MIB value (and perhaps make the name parameter
>> optional such that, if specified, it would override
>> the internally known name for the provided MIB value
>> and if not specified, name() would return a suitable
>> default).
>> It is extensible in multiple-choice ways:
>> - implentation can provide their own aliases for existing mib
>> - the other mib + custom name can be used to use a non
>> register encoding. Hence the existence of both unknown and other
>> I will not support custom mib as it is not in line with the
>> rfc -  the mib being a way to standardize names.
> I'm not sure whether you are referring to RFC 2978 or 3808 here. 
> Regardless, the design of both the RFCs and your paper is such
> that there are multiple possible names for any encoding.  A
> program that wishes to identify an encoding for which the
> implementation does not supply a MIB ID would have to be expected
> to recognize all aliases.  That doesn't seem realistic to me.
> Additionally, as specified, the proposal only includes enumerators
> for a select subset of the IANA registered names with no facility
> (other than implementation provided extension) for creating a
> text_encoding object with a MIB ID for any other IANA registered
> encoding.
> Refined suggestions:
> 1. Include all IANA MIB IDs in the set of enumerators of
> text_encoding::id.  This may require either a normative
> reference to (a dated version of) the IANA registry or that we
> copy from a version of the IANA registry and update it for
> each C++ standard.  Note that support for names need not imply
> support for the named encoding; these are just identifiers.
> 2. Use the IANA "cs" prefixed names for the names of the
> enumerators of text_encoding::id.  They may not be the
> prettiest names, but using these names will better facilitate
> automation and they are intended to be used as identifiers.
> 3. Provide a text_encoding(text_encoding::id) constructor that
> enables creation of an instance with a particular ID.  If the
> implementation doesn't have a registered name for the provided
> ID, then use a name like "MIB#42".
>> Encoding have names, mib is very close to an implentation
>> details). For that same reason the name cannot be optional.
>> The name in parameter _always_ take precedence over the iana
>> name so it can roundtrip to iconv or similar APIs
>> To rephrase that:
>> Different names may map to the same mib but two encoding with the
>> same names have to compare equal.
> I agree with that, but there is also present in the proposal that
> two encodings with different names that map to the same MIB ID
> compare equal.  The currently proposed design has the following
> issue:  IANA doesn't have a registration for the WTF-8 encoding
> today, but it is conceivable that it could be registered in the
> future.  As proposed, text_encoding("WTF8") !=
> text_encoding("WTF-8"), but if the encoding were to be registered
> in the future with both of those names as aliases, then they would
> compare equal despite having different names.  It seems to me that
> either 1) text_encoding shouldn't have equality comparison
> operators (though text_encoding::id should) or, 2) text_encoding
> should be split into two facilities, one that stores an integral
> ID, and another that performs name lookups and returns a value of
> the former.  The latter would also facilitate dynamic registration
> of additional names.
>> 1.
>>> if text_encoding remains the name of that class,
>>> encoder/decoder can be used for the class doing the
>>> actual conversions.
>>> I will further rename "system" to "environment" to be
>>> more generic and aligned with POSIX.
>> Is text_encoding::system() intended to be equivalent to
>> text_encoding::for_locale(std::locale{})? (I think the
>> answer is, and should be, no; e.g., on Windows, this
>> would query GetACP()).
>> It would query getacp which is equivalent to query the user
>> ("") locale at the start of the program.
> Good, that sounds right.
>>> (user, environment and system are, for our purpose
>>> synonym and intended to mean "the encoding assumed and
>>> expected by whatever launched our program).
>>> Environment has the added benefit that it implies
>>> neither user or systems which makes it more friendly to
>>> embedded platforms
>> Since locale settings are generally determined by
>> environment (variables), use of the term "environment"
>> may be confusing.  I prefer system.
>> Tangent 2: I don't recall if we discussed this in
>> Belfast, but the paper identifies three sets of encodings
>> to expose (literals, system, locale).  A fourth would be
>> terminal/console encoding.  This encoding can be easily
>> queried on Windows, but not on Linux/UNIX (though
>> terminal encoding rarely differs from locale there, so it
>> would be reasonable to just return the system encoding).
>> I considered that a few days ago. I am not aware of it being
>> a thing on other platforms than windows (I am probably wrong)
>> and I don't believe we can come up with a nice API in the
>> short term to query that nicely.
> It is a thing on other platforms in that terminals (typically)
> have an encoding setting that specifies how characters are
> translated for input/display purposes. What isn't common is a
> facility for querying the terminal for its encoding setting (as
> far as I can tell, there is no terminfo capability specified, nor
> have I found escape sequences that can request it though there are
> escape sequences for some terminal types to provide a character
> set map).
> Tom.
>> Tom.
>>> Thanks for your input,
>>> Corentin
> --
> SG16 mailing list
> SG16_at_[hidden] <mailto:SG16_at_[hidden]>
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16

SG16 list run by sg16-owner@lists.isocpp.org