sg16: Re: [SG16-Unicode] Draft SG16 direction paper

From: Markus Scherer <markus.icu_at_[hidden]>
Date: Wed, 3 Oct 2018 09:41:18 -0700

Hi Tom, thanks for sending the docs. I have had only a little time looking
over your "SG16 direction" doc.

Some thoughts:
(My email to the list might bounce; please forward if so.)

SG16 direction

> Microsoft does not yet offer full support for UTF-8 as the execution
encoding

True, and annoying. It might be useful for app developers to lobby
Microsoft to support UTF-8 as a "system codepage", making a good technical
case for why that's useful in addition to the UTF-16 system APIs. (E.g.,
easy char* output to a command line window, portable code using UTF-8.)

> On Windows, the execution encoding is determined at program startup based
on the current active code page.

True, but mostly irrelevant. Almost all C/C++ code on Windows works with
UTF-16 (typed as WCHAR). (Absolutely all of the .Net environment works with
UTF-16.)

> Existing programs depend on the ability to dynamically change the
execution encoding (within reason) in order for a server process to
concurrently serve multiple clients with different locale settings.

This is possible, but it would be very unusual (and probably expensive and
error-prone) to switch the process or thread locale on the fly to do so.
Servers have been implemented for a long time using explicit conversion
to/from explicitly named charsets.

> Since the char16_t and char32_t encodings are currently implementation
defined, they too could vary at run-time. However, as noted earlier, all
implementations currently use UTF-16 and UTF-32 and do not support such
variance. P1041 will solidify current practice and ensure these encodings
are known at comple-time.

Good!

> wchar_t is a portability deadend

Right.

> we'll need to ensure that proposals for new Unicode features are
implementable using ICU.

ICU does not need much. UTF-8 and UTF-16 string literals are great.

If anything, it might be useful for C++ experts to help ICU become more
convenient.

> Avoid gratuitous departure from C

Thank you!

> The ordinary and wide execution encodings are not going away

They aren't, but you can ignore wchar_t except where it is the same as
UTF-16 or UTF-32. I would not waste energy making new stuff work with
arbitrary wchar_t.

> do we design for a single well known (though possibly implementation
defined) internal encoding? Or do we continue the current practice of each
program choosing its own internal encoding?

You seem to assume that a "program" generally uses a single internal
encoding. In my experience, either UTF-8 or UTF-16 dominates, but both are
used, together with some UTF-32, so that a nontrivial program works with
different third-party libraries. It is also common to convert to UTF-16 or
UTF-32 temporarily for certain processing, such as dealing with CJK or
emoji.

> ... char8_t ...

In my experience and opinion, a distinct-but-equal primitive type does more
harm than good. If you want type safety, you need to define a string class.
If you want it to validate for well-formed UTF-whatever strings, you need
to build that into the class implementation.

Adding a new type will not change decades of existing code that has been
using char* strings and now largely assumes those to be in UTF-8, validated
at boundaries.

> Or are contiguous iterators and ranges over code units needed to achieve
acceptable performance?

Users want Unicode but want it as fast as ASCII processing. You get good
performance only by implementing at least fastpaths for a fixed, known
encoding, duplicating similar but different code for both UTF-8 and UTF-16
where both are needed.

Many algorithms, even case mappings, require context, rather than working
linearly on code points. None are defined on grapheme clusters.

Best regards,
markus

Received on 2018-10-03 18:41:34