Dear Lyberta and Ansel,

Thank you for the comments sent here! My responses will be below. Do let me know if I accidentally misunderstood or messed something up in my understanding, as I wasn't exactly sure I captured all of the concerns properly.

On Sat, Nov 2, 2019 at 12:11 PM Lyberta <lyberta@lyberta.net> wrote:

Ansel Sermersheim:
> 1) There was some discussion about whether or not char32_t is guaranteed
> to be a Unicode Code Point. JeanHeyd pointed me to
> https://wg21.link/p1041, which makes it clear that for string literals
> at least this is guaranteed.

Yes, char32_t is a bad type along with char8_t and char16_t. For that
reason I'm proposing strong types with proper guarantees:

https://github.com/Lyberta/cpp-unicode/blob/master/Fundamental.md

You can put ill-formed Unicode in string literals via escape codes. This
is also bad.

Responding to both bites at once: you can do bad things with escape codes, but I am okay with that because you've written it using _escape codes_. Non-escaped text and \u and \U qualified text is mandated to be well-formed (and people are working on tightening the constraints here). I am perfectly okay with people having back doors and ways out, as long as the back doors and ways out are sufficiently grep-able / easy to identify.

>
> However, this is not sufficiently specified for all cases. For instance,
> a GB 18030 encoding *must* use codepoints in the PUA. If a string
> literal contains a PUA code point, how can you know the interpretation?
> Making this a compile error seems problematic, but the right answer is
> not clear to me.

Can probably be solved by custom instance of
std::unicode::character_database.

I use char32_t solely out of the problem not being fully solved by SG16 at the moment. There was great interest in providing strong types for unicode_code_point; even in my implementation, I use the name unicode_code_point. Right now it defaults to an alias of char32_t since we have not fully decided which direction is worth taking here:

people with field experience and existing codebases want char32_t / uint32_t here;
previous implementations of text_view/text from Tom use an implementation where each encoding gets its own strong code point type;
other implementations just use char32_t and deem that to be fine enough.

I am heavily leaning towards char32_t representing a UTF32 code point, with the caveat that certain encodings which may use 32-bit types to represent its PUA characters will still be able to define a strong code_point type on their encoding and then use the various levers present in the paper and implementation to make it clear that their "code point" is different from the typical unicode code point, because it carries different semantic meaning. For example, a gb18030 and gb18030_web encoding object would use a gb_code_point type, which is more or less morally equivalent to a char32_t save for some different interpretations with PUA characters. This allows the type to clearly differentiate between "gb18030 code points" and "normal unicode code points". This may also be where -- as Lyberta has pointed out -- room for a custom character database in use with algorithms (normalization, collation, etc.) will come in handy.

All in all, Encoding Objects have a code_point type for exactly this reason. I suspect the majority will want to use char32_t for interopability and cohesion with the ecosystem at large. If someone uses special Private Use Area characters but still wants to present as a normal "unicode code point", they can do that. If they want a far more strict conversion policy, they can put a stronger type in there. The flexibility here is important to serve all use cases, and I think will work well here.

>
> 2) The issue of PUA usage also comes up in the implementation of
> Encoding Objects. It seems likely that the current direction will
> necessitate some third party library to handle encodings other than the
> main UTF ones. That seems reasonable. But without some sort of standard
> mechanism that at least enumerates other common interpretations, and
> allows third party libraries to declare their support for such, there
> will be a combinatorial explosion of mutually incompatible encodings.

I think providing conversions to and from Unicode scalar values is enough.

My long-term plan is that the entire slew of WHATWG encodings should end up in the Standard Library. But that is not feasible or rational as a "Step 0"; it is important to remember that as far as Unicode support goes, C++ and C are at Step -1 in the Standard. utf8/16/32, ascii, narrow_dynamic_encoding/wide_dynamic_encoding first, so we can get people from String Literal / Platform Strings -> Unicode, and then we can address the large body of encodings that exist outside of what is currently present and ships with "The System". The design, as pointed out in the last answer, is intentionally open to make sure we are not cutting CJK encodings or PUA encodings off at the legs here and asking them to walk when it comes time to make sure we include a broad suite of encodings.

> 3) By a similar construction and often overlapping concerns, the
> availability of a standardized way for encodings to declare which
> version of unicode they support is quite important. It's also not clear
> how some of the round trip encodings can possibly be fully specified in
> the type system. For example, how could I properly encode "UTF-8 Unicode
> version 10" text containing emoji into "UTF-16 Unicode version 5" text
> using the PUA for representation for display on OS X 10.7?

Different versions of Unicode and PUA are a job for
std::unicode::character_database.

Perhaps it is a job for the database, but I want to be clear: what this proposal wants to deal with are encodings and -- potentially -- Normalization Forms. Encodings do not affect the interpretation of Emoji, and Normalization Forms have forward-compatibility guarantees since Unicode version 3.x. If emojis not defined in Unicode Version 5 are given to an application that only has a knowledge of Unicode Version 5, then the default answer is {however the application handles error state in its text}. For example, web browsers that do not understand X emoji display the codepoint value boxes. Other applications display "?". Some display the missing-value "�". It is up to the application to process characters that are within the 21 bits allotted by Unicode but have no understanding to process it how they see fit. It's on an application, its text renderer, etc.

The encoding layer is only to check for valid encodings (e.g., "fits in the 21-bit of Unicode and is not an invalid sequence according to the specification of this encoding"): subscribing meaning to that is a layer above this and something every application has different answers to. Some error and crash, some display replacement text, some ignore the text entirely, some tank the HTTP request, etc.

>
> 4) The behavior of std::basic_text with respect to null termination is
> valid but seems potentially risky. As I understand it, std::basic_text
> will be null terminated if the underlying container is the default
> std::basic_string. This seems likely to result in encoding
> implementations which inadvertently assume null termination on their
> operands. Our work on early versions of the CsString library persuaded
> us that optional null termination is the source of some really obscure
> bugs of the buffer overrun variety, and we eventually elected to force
> null termination for all strings.

I think null termination is just bad design. Pointer + length is the way
to go.

Whether or not Null Termination is bad design, C and C++ are about compatibility with enormous hulking old beasts of existing practice and operating system APIs, many of which work in null termination land.

The current design does optionally null terminate. Many people have problems with the null terminator and are desperately trying to design performance around it (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1072r5.html). It may be dangerous, but the choice should be given to the user. std::vector<T> is as much a valid backing code unit storage as std::basic_string, and I do not think I should force users to only have basic_string-like semantics.

Absolutely, the default should be basic_string. I plan to make sure that `.c_str()` is a member of a text object if and only if the backing storage is basic_string. Otherwise, there will be no `.c_str()` argument, and code that uses .c_str() will stop compiling the moment someone changes the underlying container's string. Calls to .data() may still yet be problematic. I am glad for CsString's experience here; I will need to be very careful and bring this up to the Design Groups.