C++ Logo

SG16

Advanced search

Subject: Re: Bike shedding for Christmas: P1885 Naming Text Encodings
From: Tom Honermann (tom_at_[hidden])
Date: 2020-01-03 10:13:41


On 1/1/20 6:50 AM, Corentin Jabot wrote:
>
>
> On Wed, 1 Jan 2020 at 00:49, Tom Honermann <tom_at_[hidden]
> <mailto:tom_at_[hidden]>> wrote:
>
> On 12/30/19 6:11 AM, Corentin Jabot wrote:
>>
>>
>> On Mon, Dec 30, 2019, 05:17 Tom Honermann <tom_at_[hidden]
>> <mailto:tom_at_[hidden]>> wrote:
>>
>> On 12/28/19 6:38 PM, Corentin Jabot via SG16 wrote:
>>>
>>>
>>> On Sun, Dec 29, 2019, 00:32 Corentin Jabot
>>> <corentinjabot_at_[hidden] <mailto:corentinjabot_at_[hidden]>>
>>> wrote:
>>>
>>>
>>>
>>> On Sat, Dec 28, 2019, 22:21 Tom Honermann
>>> <tom_at_[hidden] <mailto:tom_at_[hidden]>> wrote:
>>>
>>> On 12/27/19 6:28 AM, Corentin Jabot via SG16 wrote:
>>>> Hello
>>>>
>>>> In P1885, I introduce the name "text_encoding"  for
>>>> the class representing the name of a text encoding.
>>>> I wonder whether that might conflict or
>>>> interfere with actual encoding/decoder classes and
>>>> would like your opinion.
>>>>
>>>> Here are a few possible names:
>>>> * Charset (IANA nomenclature, posix)
>>>> * text_codec (Qt)
>>>> * text_encoding
>>>> * text_encoding_name (encoding is used by posix /
>>>> python /
>>>>
>>>> Unicode nomenclature would favor encoding (Unicode
>>>> is a charset of which utf-8 and utf-16 are both are
>>>> encodings)
>>>
>>> I suggest text_encoding_id. I'd like to preserve
>>> text_encoding for a tag type (or concept) that can
>>> be used at compile time to specify a (compile-time)
>>> encoding as in a template parameter to std::text.
>>>
>>> Tangent 1: the proposed text_encoding is not
>>> extensible, at least not in a very meaningful way. 
>>> I suggest we do one of the following:
>>>
>>> 1. Remove the text_encoding(const char*)
>>> constructor.  It doesn't allow setting the MIB
>>> ID, so is unsatisfactory at present.
>>> 2. Allow first class extension by, for example,
>>> reserving the full range of IANA MIB values,
>>> defining a "private use" range of values, and
>>> modifying the text_encoding(const char*)
>>> constructor to also accept a MIB value (and
>>> perhaps make the name parameter optional such
>>> that, if specified, it would override the
>>> internally known name for the provided MIB value
>>> and if not specified, name() would return a
>>> suitable default).
>>>
>>>
>>>
>>> It is extensible in multiple-choice ways:
>>> - implentation can provide their own aliases for
>>> existing mib
>>> - the other mib + custom name can be used to use a non
>>> register encoding. Hence the existence of both unknown
>>> and other
>>>
>>> I will not support custom mib as it is not in line with
>>> the rfc -  the mib being a way to standardize names.
>>>
>> I'm not sure whether you are referring to RFC 2978 or 3808
>> here.  Regardless, the design of both the RFCs and your paper
>> is such that there are multiple possible names for any
>> encoding.  A program that wishes to identify an encoding for
>> which the implementation does not supply a MIB ID would have
>> to be expected to recognize all aliases.  That doesn't seem
>> realistic to me.  Additionally, as specified, the proposal
>> only includes enumerators for a select subset of the IANA
>> registered names with no facility (other than implementation
>> provided extension) for creating a text_encoding object with
>> a MIB ID for any other IANA registered encoding.
>>
>> Refined suggestions:
>>
>> 1. Include all IANA MIB IDs in the set of enumerators of
>> text_encoding::id.  This may require either a normative
>> reference to (a dated version of) the IANA registry or
>> that we copy from a version of the IANA registry and
>> update it for each C++ standard.  Note that support for
>> names need not imply support for the named encoding;
>> these are just identifiers.
>>
>>
>> I did that but it poses extensibility issues (list has to be
>> maintained) any offers little benefits.
>
> I don't understand how this poses an extensibility issue.  I agree
> it poses a maintenance issue, but that is true regardless (at
> least for implementors that extend the set of enumerators).
>
> Implementors are not allowed to do that.
Why not?
> If the list was complete we would have to provide a way for them to do so.

I don't understand this your thought process here.  One one hand you are
saying that the list is incomplete and implementors are not allowed to
extend it.  On the other, you are stating that if the list were
complete, then implementors would need the ability to extend it
further.  This seems contradictory to me.

As currently proposed, what would you expect text_encoding::literal() to
return when compiling with Visual C++ in the common case where the
execution character set is Windows-1252?

>
> The benefit is that including all of them avoids the problem of
> implementors offering extensions with inconsistent or conflicting
> names.  It also doesn't put us in the position of deciding which
> encodings are "important".  IANA provides a good specification to
> follow.  I don't think we should be subsetting, at least not
> without some clear criteria for determining which encodings make
> the cut.  For example, I suspect Shift-JIS gets more use than
> UTF-32, but the former is not included in the proposal and the
> latter is.
>
>
> Again, implementors are not allowed to:
> * provide unregister mib
> * not provide all the know alias for a provided mib
>
> They can only provide their own aliases on top of the non existing one
I'm asking about the rationale for this position.  I believe both GNU
iconv and ICU support encodings that do not appear in the IANA registry.
>
>> 1.
>>
>>
>> 2. Use the IANA "cs" prefixed names for the names of the
>> enumerators of text_encoding::id. They may not be the
>> prettiest names, but using these names will better
>> facilitate automation and they are intended to be used as
>> identifiers.
>> 3. Provide a text_encoding(text_encoding::id) constructor
>> that enables creation of an instance with a particular
>> ID.  If the implementation doesn't have a registered name
>> for the provided ID, then use a name like "MIB#42".
>>
>> What is the use case for that?
> The set of recognized names are not necessarily portable since
> they are implementation-defined.  This ensures that an encoding
> object for the desired encoding can be created regardless of name.
>> I am not opposed to the idea.
>> QTextCodec has a fromMiB function after all.
>> I am however opposed to a constructor that would accept either
>> both a name and a mib or would otherwise not check the existence
>> of said mib
> I understand and agree with the goal of ensuring that names are
> appropriately recognized and unique across MIB IDs.  But I also
> recognize the need to use MIB IDs or names that are not known to
> the implementation.  If the implementation were to reject names
> that were known to be associated with a MIB ID other than what was
> provided, I think that would be reasonable (though that would give
> the interface a wider contract than I would prefer).
>
>
> If a name  is known to be associated with a mib, the text endiding
> will be associated with that mib - otherwise it is associated with the
> "unknown "  mib.
I agree with that.  What I'm after is the ability for the user to create
an encoding ID object that names an encoding not known to the
implementation, but where the programmer does know the MIB ID.
>
>> 1.
>>
>>
>>> Encoding have names, mib is very close to an
>>> implentation details). For that same reason the name
>>> cannot be optional. The name in parameter _always_ take
>>> precedence over the iana name so it can roundtrip to
>>> iconv or similar APIs
>>>
>>
>>>
>>> To rephrase that:
>>>
>>> Different names may map to the same mib but two encoding
>>> with the same names have to compare equal.
>>
>> I agree with that, but there is also present in the proposal
>> that two encodings with different names that map to the same
>> MIB ID compare equal.  The currently proposed design has the
>> following issue: IANA doesn't have a registration for the
>> WTF-8 encoding today, but it is conceivable that it could be
>> registered in the future. As proposed, text_encoding("WTF8")
>> != text_encoding("WTF-8"), but if the encoding were to be
>> registered in the future with both of those names as aliases,
>> then they would compare equal despite having different
>> names.  It seems to me that either 1) text_encoding shouldn't
>> have equality comparison operators (though text_encoding::id
>> should) or, 2) text_encoding should be split into two
>> facilities, one that stores an integral ID, and another that
>> performs name lookups and returns a value of the former.  The
>> latter would also facilitate dynamic registration of
>> additional names.
>>
>>
>> Dynamic registration defeats the purpose.
> How so?  I see at least two use cases for dynamic registration. 
> 1) To support encodings that are standardized in newer versions of
> the IANA registry than the implementation is aware of, and 2) to
> support private encodings that are not (yet) registered with IANA.
>> I agree that a name becoming registered do change the behavior of
>> the program.
>> However, consider the case where you have
>> text_encoding("wtf8",  68854);
>> text_encoding("WTF-8");
>>
>> That does not solve anything.
> I don't agree; it still results in IDs being created as
> specified.  I think the recognized names being
> implementation-defined is the real problem here
>
>
> They are not implementation defined?
The wording in the paper is explicit in the specification for
text_encoding::aliases() that the implementation may provide additional
names.
>
>> Support for multiple names exist for legacy reasons, people using
>> non registered encodings have to maintain consistency of names
>> regardless.
> I agree; and I think they should be able to use this facility to
> do so.
>>
>> Java does for example not expose the mib at all.
>> There is just a isRegistered method that tells you if IANA knows
>> about the encoding
>>
>>>
>>> 1.
>>>
>>>
>>>
>>>> if text_encoding remains the name of that class,
>>>> encoder/decoder can be used for the class doing the
>>>> actual conversions.
>>>>
>>>>
>>>> I will further rename "system" to "environment" to
>>>> be more generic and aligned with POSIX.
>>> Is text_encoding::system() intended to be equivalent
>>> to text_encoding::for_locale(std::locale{})? (I
>>> think the answer is, and should be, no; e.g., on
>>> Windows, this would query GetACP()).
>>>
>>>
>>>
>>> It would query getacp which is equivalent to query the
>>> user ("") locale at the start of the program.
>>>
>> Good, that sounds right.
>>>
>>>> (user, environment and system are, for our purpose
>>>> synonym and intended to mean "the encoding assumed
>>>> and expected by whatever launched our program).
>>>> Environment has the added benefit that it implies
>>>> neither user or systems which makes it more
>>>> friendly to embedded platforms
>>>
>>> Since locale settings are generally determined by
>>> environment (variables), use of the term
>>> "environment" may be confusing.  I prefer system.
>>>
>>> Tangent 2: I don't recall if we discussed this in
>>> Belfast, but the paper identifies three sets of
>>> encodings to expose (literals, system, locale).  A
>>> fourth would be terminal/console encoding.  This
>>> encoding can be easily queried on Windows, but not
>>> on Linux/UNIX (though terminal encoding rarely
>>> differs from locale there, so it would be reasonable
>>> to just return the system encoding).
>>>
>>> I considered that a few days ago. I am not aware of it
>>> being a thing on other platforms than windows (I am
>>> probably wrong) and I don't believe we can come up with
>>> a nice API in the short term to query that nicely.
>>>
>> It is a thing on other platforms in that terminals
>> (typically) have an encoding setting that specifies how
>> characters are translated for input/display purposes.  What
>> isn't common is a facility for querying the terminal for its
>> encoding setting (as far as I can tell, there is no terminfo
>> capability specified, nor have I found escape sequences that
>> can request it though there are escape sequences for some
>> terminal types to provide a character set map).
>>
>>
>> My understanding is that on Linux the terminal exists prior to
>> the program and cannot be changed.
> Terminals (as in, their inodes) can technically be created,
> changed, and associated with new processes.  In practice, this is
> almost never done and isn't something I see any reason to spend
> time on.
>> So that implies the terminal and the environment are one and the
>> same.
> In terms of encoding, I agree, the terminal encoding would have to
> be assumed to match the locale.
>> Which is different than the windows situation where the desktop
>> is the environment and a console can be attached and modified at
>> runtime.
> Correct, the goal would be to support the Windows situation in a
> way that is not inconsistent with other platforms.
>>
>> But maybe there is a way to do that on Linux?
>>
>> My point is, the proposal offers a way to query the encoding
>> attached to std::cout, and anything else would implies doing
>> something for consoles specifically, which seems like a larger
>> discussion.
>
> That is the discussion I'm trying to have here.  I think there is
> a legitimate use case for Windows; GetConsoleCP() is used,
> presumably for legitimate reasons.
>
> I would love to discuss consoles but i do not think it should be
> related to the proposal at all if we ever want to ship something :)

I don't see it as being that complicated.  I'm envisioning an additional
text_encoding::terminal() function that behaves similarly (perhaps
identically on some platforms) to text_encoding::system().  I'd like to
hear more about the complications you see here.

Just to be clear, I'm not strongly of the opinion that we should do
something to address console encoding now; it can be added later.

Tom.

> If we ever have a console object or set of facilities, we can add a
> method to query its encoding
>
> Tom.
>
>> Tom.
>>
>>> Tom.
>>>
>>>>
>>>> Thanks for your input,
>>>>
>>>> Corentin
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>



SG16 list run by herb.sutter at gmail.com