C++ Logo

SG16

Advanced search

Subject: Re: [SG16-Unicode] Questions about some corner cases of proposed std::basic_text encoding implementation
From: Tom Honermann (tom_at_[hidden])
Date: 2019-11-27 07:57:25


Replying so that this message is reflected in the SG16 list archives due
to an archive failure following the transition to the new SG16 mailing list.

Tom.

On 11/22/19 4:25 PM, Ansel Sermersheim via SG16 wrote:
> On 11/8/19 4:21 PM, JeanHeyd Meneide wrote:
>> Dear Lyberta and Ansel,
>>
>>      Thank you for the comments sent here! My responses will be
>> below. Do let me know if I accidentally misunderstood or messed
>> something up in my understanding, as I wasn't exactly sure I captured
>> all of the concerns properly.
>
> I think you have understood a good portion, but see my expanded
> explanation below of some points.
>
>
>> > 3) By a similar construction and often overlapping concerns, the
>> > availability of a standardized way for encodings to declare which
>> > version of unicode they support is quite important. It's also
>> not clear
>> > how some of the round trip encodings can possibly be fully
>> specified in
>> > the type system. For example, how could I properly encode
>> "UTF-8 Unicode
>> > version 10" text containing emoji into "UTF-16 Unicode version
>> 5" text
>> > using the PUA for representation for display on OS X 10.7?
>>
>> Different versions of Unicode and PUA are a job for
>> std::unicode::character_database.
>>
>> Perhaps it is a job for the database, but I want to be clear: what
>> this proposal wants to deal with are encodings and -- potentially --
>> Normalization Forms. Encodings do not affect the interpretation of
>> Emoji, and Normalization Forms have forward-compatibility guarantees
>> since Unicode version 3.x. If emojis not defined in Unicode Version 5
>> are given to an application that only has a knowledge of Unicode
>> Version 5, then the default answer is {however the application
>> handles error state in its text}. For example, web browsers that do
>> not understand X emoji display the codepoint value boxes. Other
>> applications display "?". Some display the missing-value "�". It is
>> up to the application to process characters that are within the 21
>> bits allotted by Unicode but have no understanding to process it how
>> they see fit. It's on an application, its text renderer, etc.
>
> Let me try to expand on my point above, because there is a subtlety
> here which I think slipped through the cracks.
>
> Suppose I have two sets of logs, which I want to merge into a single
> output. One set contains UTF-8 text which was generated on OS X 10.7.
> This conforms to "Unicode 5 with the softbank PUA interpretation".
> Another set contains UTF-8 text generated on a modern machine, which
> conforms to "Unicode 12.1, no PUA." Assume that I am processing these
> data files in C++ on some third machine. Further, assume I am willing
> to implement on my own any encoding required to solve my problem. I
> always want the output to be in UTF-8, conforming to Unicode v12.1.
>
> Let's look at a specific mapping here, from the Unicode 6+ codepoint
> U+1F604, called "Smiling face with open mouth and smiling eyes." In
> Unicode 5 this has no encoding, but in the Softbank PUA it is U+FB55.
> Suppose I used this emoji as a delimiter in these logs.
>
> 1) I really want to have a single data type that can represent a
> "unicode string" in my program. I'm willing to do any transformation
> at the input and output. What data type do I use?
>
> If I use std::text::u8text, which seems most natural and is likely to
> work best with other libraries, then the delimiter search code is
> horrible. It will be something like:
>
>> #ifdef MY_MACHINE_UNICODE_VERSION >= 6.0
>>    tokens = split(myString, U'\U0001F604');
>> #elsif MY_MACHINE_USES_SOFTBANK_PUA
>>    tokens = split(myString, U'\uFB55');
>> #else
>> #error "This machine can't process log files, upgrade the standard
>> library"
>> #endif
>>
> Depending on the implementation of the transcoding operation, this
> might have to be done at runtime using some metadata:
>
>> if(myStringMetadata.originalUnicodeVersion >= 6.0) {
>>    tokens = split(myString, U'\U0001F604');
>> } else if(myStringMetadata.originalPUAInterpretation ==
>> std::pua::softbank) {
>>    tokens = split(myString, U'\uFB55');
>> } else {
>>    throw "This particular string cannot be parsed."
>> }
> This is even more painful, because now all the code gets compiled in.
> It's also a testing nightmare. Neither of these approaches scale,
> because if i have a piece of code which needs to use emoji and
> codepoints from GB18030, I get a combinatorial explosion. In order to
> process a string I need to know the provenance. Don't even get me
> started on the regex situation.
>
> 2) Ok, that didn't work. Let's try using a
> std::text::basic_text<my_guaranteed_utf8v12_encoding_no_matter_what_standard_version>
> class. Life is better from an internal text processing point of view.
> However, the "map_XX_codepoint_to_v12_codepoint" function I will have
> to write is just as bad. Even worse, I don't know when to call it. In
> transcoding from or to another encoding, I need to know whether to
> expect or produce U+1F604 or U+FB55. That will vary depending on which
> version of the standard the *other* encoding object observes, which
> *may* involve asking questions about the standard library implementation.
>
> I don't know what a good answer to this situation would look like. In
> CsString we don't handle this case because we don't use the PUA at all
> and  CopperSpice simply don't support these encodings. However that's
> not a very future-proof situation and not a limitation that I think
> the standard should accept. The majority of encoding policy classes
> will not need to know which version of the standard a particular
> string is in. But there are enough high-profile encodings which need
> to know, particular the whole GB series, that this is a problem which
> will need to be addressed.
>
>> The encoding layer is only to check for valid encodings (e.g., "fits
>> in the 21-bit of Unicode and is not an invalid sequence according to
>> the specification of this encoding"): subscribing meaning to that is
>> a layer above this and something every application has different
>> answers to. Some error and crash, some display replacement text, some
>> ignore the text entirely, some tank the HTTP request, etc.
>
> As the above example shows, this idea presents a nice clean
> abstraction but doesn't necessarily scale. This situation will get
> worse, not better, over time. Most of the cases right now are rather
> obscure but we will need some way to solve this problem in the general
> case, because it seems to be becoming more common in the CJK area than
> it has been in the past.
>
> -Ansel
>
>



SG16 list run by herb.sutter at gmail.com