Hi Tom, thanks for sending the docs. I have had only a little time looking over your "SG16 direction" doc.

Some thoughts:
(My email to the list might bounce; please forward if so.)

SG16 direction

> Microsoft does not yet offer full support for UTF-8 as the execution encoding

True, and annoying. It might be useful for app developers to lobby Microsoft to support UTF-8 as a "system codepage", making a good technical case for why that's useful in addition to the UTF-16 system APIs. (E.g., easy char* output to a command line window, portable code using UTF-8.)

> On Windows, the execution encoding is determined at program startup based on the current active code page.

True, but mostly irrelevant. Almost all C/C++ code on Windows works with UTF-16 (typed as WCHAR). (Absolutely all of the .Net environment works with UTF-16.)

> Existing programs depend on the ability to dynamically change the execution encoding (within reason) in order for a server process to concurrently serve multiple clients with different locale settings.

This is possible, but it would be very unusual (and probably expensive and error-prone) to switch the process or thread locale on the fly to do so. Servers have been implemented for a long time using explicit conversion to/from explicitly named charsets.

> Since the char16_t and char32_t encodings are currently implementation defined, they too could vary at run-time. However, as noted earlier, all implementations currently use UTF-16 and UTF-32 and do not support such variance. P1041 will solidify current practice and ensure these encodings are known at comple-time.

Good!

> wchar_t is a portability deadend

Right.

> we'll need to ensure that proposals for new Unicode features are implementable using ICU.

ICU does not need much. UTF-8 and UTF-16 string literals are great.

If anything, it might be useful for C++ experts to help ICU become more convenient.

> Avoid gratuitous departure from C

Thank you!

> The ordinary and wide execution encodings are not going away

They aren't, but you can ignore wchar_t except where it is the same as UTF-16 or UTF-32. I would not waste energy making new stuff work with arbitrary wchar_t.

> do we design for a single well known (though possibly implementation defined) internal encoding? Or do we continue the current practice of each program choosing its own internal encoding?

You seem to assume that a "program" generally uses a single internal encoding. In my experience, either UTF-8 or UTF-16 dominates, but both are used, together with some UTF-32, so that a nontrivial program works with different third-party libraries. It is also common to convert to UTF-16 or UTF-32 temporarily for certain processing, such as dealing with CJK or emoji.

> ... char8_t ...

In my experience and opinion, a distinct-but-equal primitive type does more harm than good. If you want type safety, you need to define a string class. If you want it to validate for well-formed UTF-whatever strings, you need to build that into the class implementation.

Adding a new type will not change decades of existing code that has been using char* strings and now largely assumes those to be in UTF-8, validated at boundaries.

> Or are contiguous iterators and ranges over code units needed to achieve acceptable performance?

Users want Unicode but want it as fast as ASCII processing. You get good performance only by implementing at least fastpaths for a fixed, known encoding, duplicating similar but different code for both UTF-8 and UTF-16 where both are needed.

Many algorithms, even case mappings, require context, rather than working linearly on code points. None are defined on grapheme clusters.

Best regards,
markus