Replying so that this message is reflected in the SG16 list archives due to an archive failure following the transition to the new SG16 mailing list.

Tom.

On 11/22/19 4:25 PM, Ansel Sermersheim via SG16 wrote:
On 11/8/19 4:21 PM, JeanHeyd Meneide wrote:
Dear Lyberta and Ansel,

     Thank you for the comments sent here! My responses will be below. Do let me know if I accidentally misunderstood or messed something up in my understanding, as I wasn't exactly sure I captured all of the concerns properly.

I think you have understood a good portion, but see my expanded explanation below of some points.


> 3) By a similar construction and often overlapping concerns, the
> availability of a standardized way for encodings to declare which
> version of unicode they support is quite important. It's also not clear
> how some of the round trip encodings can possibly be fully specified in
> the type system. For example, how could I properly encode "UTF-8 Unicode
> version 10" text containing emoji into "UTF-16 Unicode version 5" text
> using the PUA for representation for display on OS X 10.7?

Different versions of Unicode and PUA are a job for
std::unicode::character_database.
 
Perhaps it is a job for the database, but I want to be clear: what this proposal wants to deal with are encodings and -- potentially -- Normalization Forms. Encodings do not affect the interpretation of Emoji, and Normalization Forms have forward-compatibility guarantees since Unicode version 3.x. If emojis not defined in Unicode Version 5 are given to an application that only has a knowledge of Unicode Version 5, then the default answer is {however the application handles error state in its text}. For example, web browsers that do not understand X emoji display the codepoint value boxes. Other applications display "?". Some display the missing-value "�". It is up to the application to process characters that are within the 21 bits allotted by Unicode but have no understanding to process it how they see fit. It's on an application, its text renderer, etc.

Let me try to expand on my point above, because there is a subtlety here which I think slipped through the cracks.

Suppose I have two sets of logs, which I want to merge into a single output. One set contains UTF-8 text which was generated on OS X 10.7. This conforms to "Unicode 5 with the softbank PUA interpretation". Another set contains UTF-8 text generated on a modern machine, which conforms to "Unicode 12.1, no PUA." Assume that I am processing these data files in C++ on some third machine. Further, assume I am willing to implement on my own any encoding required to solve my problem. I always want the output to be in UTF-8, conforming to Unicode v12.1.

Let's look at a specific mapping here, from the Unicode 6+ codepoint U+1F604, called "Smiling face with open mouth and smiling eyes." In Unicode 5 this has no encoding, but in the Softbank PUA it is U+FB55. Suppose I used this emoji as a delimiter in these logs.

1) I really want to have a single data type that can represent a "unicode string" in my program. I'm willing to do any transformation at the input and output. What data type do I use?

If I use std::text::u8text, which seems most natural and is likely to work best with other libraries, then the delimiter search code is horrible. It will be something like:

#ifdef MY_MACHINE_UNICODE_VERSION >= 6.0
   tokens = split(myString, U'\U0001F604');
#elsif MY_MACHINE_USES_SOFTBANK_PUA
   tokens = split(myString, U'\uFB55');
#else
#error "This machine can't process log files, upgrade the standard library"
#endif

Depending on the implementation of the transcoding operation, this might have to be done at runtime using some metadata:

if(myStringMetadata.originalUnicodeVersion >= 6.0) {
   tokens = split(myString, U'\U0001F604');
} else if(myStringMetadata.originalPUAInterpretation == std::pua::softbank) {
   tokens = split(myString, U'\uFB55');
} else {
   throw "This particular string cannot be parsed."
}
This is even more painful, because now all the code gets compiled in. It's also a testing nightmare. Neither of these approaches scale, because if i have a piece of code which needs to use emoji and codepoints from GB18030, I get a combinatorial explosion. In order to process a string I need to know the provenance. Don't even get me started on the regex situation.

2) Ok, that didn't work. Let's try using a std::text::basic_text<my_guaranteed_utf8v12_encoding_no_matter_what_standard_version> class. Life is better from an internal text processing point of view. However, the "map_XX_codepoint_to_v12_codepoint" function I will have to write is just as bad. Even worse, I don't know when to call it. In transcoding from or to another encoding, I need to know whether to expect or produce U+1F604 or U+FB55. That will vary depending on which version of the standard the *other* encoding object observes, which *may* involve asking questions about the standard library implementation.

I don't know what a good answer to this situation would look like. In CsString we don't handle this case because we don't use the PUA at all and  CopperSpice simply don't support these encodings. However that's not a very future-proof situation and not a limitation that I think the standard should accept. The majority of encoding policy classes will not need to know which version of the standard a particular string is in. But there are enough high-profile encodings which need to know, particular the whole GB series, that this is a problem which will need to be addressed.

The encoding layer is only to check for valid encodings (e.g., "fits in the 21-bit of Unicode and is not an invalid sequence according to the specification of this encoding"): subscribing meaning to that is a layer above this and something every application has different answers to. Some error and crash, some display replacement text, some ignore the text entirely, some tank the HTTP request, etc.

As the above example shows, this idea presents a nice clean abstraction but doesn't necessarily scale. This situation will get worse, not better, over time. Most of the cases right now are rather obscure but we will need some way to solve this problem in the general case, because it seems to be becoming more common in the CJK area than it has been in the past.

-Ansel