C++ Logo

SG16

Advanced search

Subject: Re: Bike shedding for Christmas: P1885 Naming Text Encodings
From: Corentin Jabot (corentinjabot_at_[hidden])
Date: 2020-01-01 05:50:59


On Wed, 1 Jan 2020 at 00:49, Tom Honermann <tom_at_[hidden]> wrote:

> On 12/30/19 6:11 AM, Corentin Jabot wrote:
>
>
>
> On Mon, Dec 30, 2019, 05:17 Tom Honermann <tom_at_[hidden]> wrote:
>
>> On 12/28/19 6:38 PM, Corentin Jabot via SG16 wrote:
>>
>>
>>
>> On Sun, Dec 29, 2019, 00:32 Corentin Jabot <corentinjabot_at_[hidden]>
>> wrote:
>>
>>>
>>>
>>> On Sat, Dec 28, 2019, 22:21 Tom Honermann <tom_at_[hidden]> wrote:
>>>
>>>> On 12/27/19 6:28 AM, Corentin Jabot via SG16 wrote:
>>>>
>>>> Hello
>>>>
>>>> In P1885, I introduce the name "text_encoding" for the class
>>>> representing the name of a text encoding.
>>>> I wonder whether that might conflict or interfere with actual
>>>> encoding/decoder classes and would like your opinion.
>>>>
>>>> Here are a few possible names:
>>>> * Charset (IANA nomenclature, posix)
>>>> * text_codec (Qt)
>>>> * text_encoding
>>>> * text_encoding_name (encoding is used by posix / python /
>>>>
>>>> Unicode nomenclature would favor encoding (Unicode is a charset of
>>>> which utf-8 and utf-16 are both are encodings)
>>>>
>>>> I suggest text_encoding_id. I'd like to preserve text_encoding for a
>>>> tag type (or concept) that can be used at compile time to specify a
>>>> (compile-time) encoding as in a template parameter to std::text.
>>>>
>>>> Tangent 1: the proposed text_encoding is not extensible, at least not
>>>> in a very meaningful way. I suggest we do one of the following:
>>>>
>>>> 1. Remove the text_encoding(const char*) constructor. It doesn't
>>>> allow setting the MIB ID, so is unsatisfactory at present.
>>>> 2. Allow first class extension by, for example, reserving the full
>>>> range of IANA MIB values, defining a "private use" range of values, and
>>>> modifying the text_encoding(const char*) constructor to also accept
>>>> a MIB value (and perhaps make the name parameter optional such that, if
>>>> specified, it would override the internally known name for the provided MIB
>>>> value and if not specified, name() would return a suitable default).
>>>>
>>>>
>>>
>>> It is extensible in multiple-choice ways:
>>> - implentation can provide their own aliases for existing mib
>>> - the other mib + custom name can be used to use a non register
>>> encoding. Hence the existence of both unknown and other
>>>
>>> I will not support custom mib as it is not in line with the rfc - the
>>> mib being a way to standardize names.
>>>
>> I'm not sure whether you are referring to RFC 2978 or 3808 here.
>> Regardless, the design of both the RFCs and your paper is such that there
>> are multiple possible names for any encoding. A program that wishes to
>> identify an encoding for which the implementation does not supply a MIB ID
>> would have to be expected to recognize all aliases. That doesn't seem
>> realistic to me. Additionally, as specified, the proposal only includes
>> enumerators for a select subset of the IANA registered names with no
>> facility (other than implementation provided extension) for creating a
>> text_encoding object with a MIB ID for any other IANA registered encoding.
>>
>> Refined suggestions:
>>
>> 1. Include all IANA MIB IDs in the set of enumerators of
>> text_encoding::id. This may require either a normative reference to (a
>> dated version of) the IANA registry or that we copy from a version of the
>> IANA registry and update it for each C++ standard. Note that support for
>> names need not imply support for the named encoding; these are just
>> identifiers.
>>
>>
> I did that but it poses extensibility issues (list has to be maintained)
> any offers little benefits.
>
> I don't understand how this poses an extensibility issue. I agree it
> poses a maintenance issue, but that is true regardless (at least for
> implementors that extend the set of enumerators).
>
Implementors are not allowed to do that.
If the list was complete we would have to provide a way for them to do so.

> The benefit is that including all of them avoids the problem of
> implementors offering extensions with inconsistent or conflicting names.
> It also doesn't put us in the position of deciding which encodings are
> "important". IANA provides a good specification to follow. I don't think
> we should be subsetting, at least not without some clear criteria for
> determining which encodings make the cut. For example, I suspect Shift-JIS
> gets more use than UTF-32, but the former is not included in the proposal
> and the latter is.
>

Again, implementors are not allowed to:
* provide unregister mib
* not provide all the know alias for a provided mib

They can only provide their own aliases on top of the non existing one

> Another tangent: I think the enumeration should have an explicit
> underlying type to ensure it is able to hold all IANA assigned MIB IDs.
>
Agreed, that was already suggested by victor

>> 1.
>> 2. Use the IANA "cs" prefixed names for the names of the enumerators
>> of text_encoding::id. They may not be the prettiest names, but using
>> these names will better facilitate automation and they are intended to be
>> used as identifiers.
>> 3. Provide a text_encoding(text_encoding::id) constructor that
>> enables creation of an instance with a particular ID. If the
>> implementation doesn't have a registered name for the provided ID, then use
>> a name like "MIB#42".
>>
>> What is the use case for that?
>
> The set of recognized names are not necessarily portable since they are
> implementation-defined. This ensures that an encoding object for the
> desired encoding can be created regardless of name.
>
> I am not opposed to the idea.
> QTextCodec has a fromMiB function after all.
> I am however opposed to a constructor that would accept either both a name
> and a mib or would otherwise not check the existence of said mib
>
> I understand and agree with the goal of ensuring that names are
> appropriately recognized and unique across MIB IDs. But I also recognize
> the need to use MIB IDs or names that are not known to the implementation.
> If the implementation were to reject names that were known to be associated
> with a MIB ID other than what was provided, I think that would be
> reasonable (though that would give the interface a wider contract than I
> would prefer).
>

If a name is known to be associated with a mib, the text endiding will be
associated with that mib - otherwise it is associated with the "unknown "
mib.

>
>> 1.
>>
>> Encoding have names, mib is very close to an implentation details). For
>>> that same reason the name cannot be optional. The name in parameter
>>> _always_ take precedence over the iana name so it can roundtrip to iconv or
>>> similar APIs
>>>
>>
>>
>> To rephrase that:
>>
>> Different names may map to the same mib but two encoding with the same
>> names have to compare equal.
>>
>> I agree with that, but there is also present in the proposal that two
>> encodings with different names that map to the same MIB ID compare equal.
>> The currently proposed design has the following issue: IANA doesn't have a
>> registration for the WTF-8 encoding today, but it is conceivable that it
>> could be registered in the future. As proposed, text_encoding("WTF8")
>> != text_encoding("WTF-8"), but if the encoding were to be registered in
>> the future with both of those names as aliases, then they would compare
>> equal despite having different names. It seems to me that either 1)
>> text_encoding shouldn't have equality comparison operators (though
>> text_encoding::id should) or, 2) text_encoding should be split into two
>> facilities, one that stores an integral ID, and another that performs name
>> lookups and returns a value of the former. The latter would also
>> facilitate dynamic registration of additional names.
>>
>
> Dynamic registration defeats the purpose.
>
> How so? I see at least two use cases for dynamic registration. 1) To
> support encodings that are standardized in newer versions of the IANA
> registry than the implementation is aware of, and 2) to support private
> encodings that are not (yet) registered with IANA.
>
> I agree that a name becoming registered do change the behavior of the
> program.
> However, consider the case where you have
> text_encoding("wtf8", 68854);
> text_encoding("WTF-8");
>
> That does not solve anything.
>
> I don't agree; it still results in IDs being created as specified. I
> think the recognized names being implementation-defined is the real problem
> here
>

They are not implementation defined?

> Support for multiple names exist for legacy reasons, people using non
> registered encodings have to maintain consistency of names regardless.
>
> I agree; and I think they should be able to use this facility to do so.
>
>
> Java does for example not expose the mib at all.
> There is just a isRegistered method that tells you if IANA knows about the
> encoding
>
>>
>>
>>>> 1.
>>>>
>>>>
>>>> if text_encoding remains the name of that class, encoder/decoder can be
>>>> used for the class doing the actual conversions.
>>>>
>>>>
>>>> I will further rename "system" to "environment" to be more generic and
>>>> aligned with POSIX.
>>>>
>>>> Is text_encoding::system() intended to be equivalent to
>>>> text_encoding::for_locale(std::locale{})? (I think the answer is, and
>>>> should be, no; e.g., on Windows, this would query GetACP()).
>>>>
>>>
>>>
>>> It would query getacp which is equivalent to query the user ("") locale
>>> at the start of the program.
>>>
>> Good, that sounds right.
>>
>> (user, environment and system are, for our purpose synonym and intended
>>>> to mean "the encoding assumed and expected by whatever launched our
>>>> program).
>>>> Environment has the added benefit that it implies neither user or
>>>> systems which makes it more friendly to embedded platforms
>>>>
>>>> Since locale settings are generally determined by environment
>>>> (variables), use of the term "environment" may be confusing. I prefer
>>>> system.
>>>>
>>>> Tangent 2: I don't recall if we discussed this in Belfast, but the
>>>> paper identifies three sets of encodings to expose (literals, system,
>>>> locale). A fourth would be terminal/console encoding. This encoding can
>>>> be easily queried on Windows, but not on Linux/UNIX (though terminal
>>>> encoding rarely differs from locale there, so it would be reasonable to
>>>> just return the system encoding).
>>>>
>>> I considered that a few days ago. I am not aware of it being a thing on
>>> other platforms than windows (I am probably wrong) and I don't believe we
>>> can come up with a nice API in the short term to query that nicely.
>>>
>> It is a thing on other platforms in that terminals (typically) have an
>> encoding setting that specifies how characters are translated for
>> input/display purposes. What isn't common is a facility for querying the
>> terminal for its encoding setting (as far as I can tell, there is no
>> terminfo capability specified, nor have I found escape sequences that can
>> request it though there are escape sequences for some terminal types to
>> provide a character set map).
>>
>
> My understanding is that on Linux the terminal exists prior to the program
> and cannot be changed.
>
> Terminals (as in, their inodes) can technically be created, changed, and
> associated with new processes. In practice, this is almost never done and
> isn't something I see any reason to spend time on.
>
> So that implies the terminal and the environment are one and the same.
>
> In terms of encoding, I agree, the terminal encoding would have to be
> assumed to match the locale.
>
> Which is different than the windows situation where the desktop is the
> environment and a console can be attached and modified at runtime.
>
> Correct, the goal would be to support the Windows situation in a way that
> is not inconsistent with other platforms.
>
>
> But maybe there is a way to do that on Linux?
>
> My point is, the proposal offers a way to query the encoding attached to
> std::cout, and anything else would implies doing something for consoles
> specifically, which seems like a larger discussion.
>
> That is the discussion I'm trying to have here. I think there is a
> legitimate use case for Windows; GetConsoleCP() is used, presumably for
> legitimate reasons.
>
I would love to discuss consoles but i do not think it should be related to
the proposal at all if we ever want to ship something :)
If we ever have a console object or set of facilities, we can add a method
to query its encoding

> Tom.
>
> Tom.
>>
>> Tom.
>>>>
>>>>
>>>> Thanks for your input,
>>>>
>>>> Corentin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>>
>



SG16 list run by herb.sutter at gmail.com