Subject: Re: [SG16-Unicode] Draft SG16 direction paper
From: Tom Honermann (tom_at_[hidden])
Date: 2018-10-06 02:47:42
Thanks, Markus!Â I incorporated some of this feedback into an update of
the draft.Â A few other inline notes below.
On 10/03/2018 12:41 PM, Markus Scherer wrote:
> Hi Tom, thanks for sending the docs. I have had only a little time
> looking over your "SG16 direction" doc.
> Some thoughts:
> (My email to the list might bounce; please forward if so.)
> SG16 direction
> >Â Microsoft does not yet offer full support for UTF-8 as the execution
> True, and annoying. It might be useful for app developers to lobby
> Microsoft to support UTF-8 as a "system codepage", making a good
> technical case for why that's useful in addition to the UTF-16 system
> APIs. (E.g., easy char* output to a command line window, portable code
> using UTF-8.)
Microsoft seems to be listening as recent Windows 10 builds now support
this as a beta feature.Â Information on this is hard to find, but here
are a few links that discuss it:
I'm not sure what Microsoft's rollout strategy is here.Â Enabling the
feature changes the system code page for all programs and, well, some
aren't going to like that :)
> >Â On Windows, the execution encoding is determined at program startup
> based on the current active code page.
> True, but mostly irrelevant. Almost all C/C++ code on Windows works
> with UTF-16 (typed as WCHAR). (Absolutely all of the .Net environment
> works with UTF-16.)
I've worked (and continue to work) on (large) projects that run on
Windows, and that use UTF-16 only when calling some Win32 APIs.
Regardless, text files are not typically UTF-16 and are often
interpreted as encoded according to the active code page.
> >Â Existing programs depend on the ability to dynamically change the
> execution encoding (within reason) in order for a server process to
> concurrently serve multiple clients with different locale settings.
> This is possible, but it would be very unusual (and probably expensive
> and error-prone) to switch the process or thread locale on the fly to
> do so. Servers have been implemented for a long time using explicit
> conversion to/from explicitly named charsets.
I agree.Â I didn't intend to imply this is a common practice.
> >Â Since the char16_t and char32_t encodings are currently
> implementation defined, they too could vary at run-time. However, as
> noted earlier, all implementations currently use UTF-16 and UTF-32 and
> do not support such variance. P1041 will solidify current practice and
> ensure these encodings are known at comple-time.
> >Â wchar_t is a portability deadend
> >Â we'll need to ensure that proposals for new Unicode features are
> implementable using ICU.
> ICU does not need much. UTF-8 and UTF-16 string literals are great.
> If anything, it might be useful for C++ experts to help ICU become
> more convenient.
> >Â Avoid gratuitous departure from C
> Thank you!
> >Â The ordinary and wide execution encodings are not going away
> They aren't, but you can ignore wchar_t except where it is the same as
> UTF-16 or UTF-32. I would not waste energy making new stuff work with
> arbitrary wchar_t.
> >Â do we design for a single well known (though possibly implementation
> defined) internal encoding?Â Or do we continue the current practice of
> each program choosing its own internal encoding?
> You seem to assume that a "program" generally uses a single internal
> encoding. In my experience, either UTF-8 or UTF-16 dominates, but both
> are used, together with some UTF-32, so that a nontrivial program
> works with different third-party libraries. It is also common to
> convert to UTF-16 or UTF-32 temporarily for certain processing, such
> as dealing with CJK or emoji.
I didn't intend to project that assumption.Â I updated the draft to
state "internal encoding(s)".Â I agree that, in practice, multiple
encodings get used.
> > ... char8_t ...
> In my experience and opinion, a distinct-but-equal primitive type does
> more harm than good. If you want type safety, you need to define a
> string class. If you want it to validate for well-formed UTF-whatever
> strings, you need to build that into the class implementation.
> Adding a new type will not change decades of existing code that has
> been using char* strings and now largely assumes those to be in UTF-8,
> validated at boundaries.
Clearly we'll have to talk more about char8_t :)
> >Â Or are contiguous iterators and ranges over code units needed to
> achieve acceptable performance?
> Users want Unicode but want it as fast as ASCII processing. You get
> good performance only by implementing at least fastpaths for a fixed,
> known encoding, duplicating similar but different code for both UTF-8
> and UTF-16 where both are needed.
> Many algorithms, even case mappings, require context, rather than
> working linearly on code points. None are defined on grapheme clusters.
> Best regards,
SG16 list run by herb.sutter at gmail.com