sg16: Re: [SG16-Unicode] Draft SG16 direction paper

From: Tom Honermann <tom_at_[hidden]>
Date: Sat, 6 Oct 2018 03:47:42 -0400

Thanks, Markus! I incorporated some of this feedback into an update of
the draft. A few other inline notes below.

On 10/03/2018 12:41 PM, Markus Scherer wrote:
> Hi Tom, thanks for sending the docs. I have had only a little time
> looking over your "SG16 direction" doc.
>
> Some thoughts:
> (My email to the list might bounce; please forward if so.)
>
> SG16 direction
>
> > Microsoft does not yet offer full support for UTF-8 as the execution
> encoding
>
> True, and annoying. It might be useful for app developers to lobby
> Microsoft to support UTF-8 as a "system codepage", making a good
> technical case for why that's useful in addition to the UTF-16 system
> APIs. (E.g., easy char* output to a command line window, portable code
> using UTF-8.)

Microsoft seems to be listening as recent Windows 10 builds now support
this as a beta feature. Information on this is hard to find, but here
are a few links that discuss it:
- https://news.ycombinator.com/item?id=15710685
- http://support.ricoh.com/html_gen/util/Info/Win10_Apr_2018.html

I'm not sure what Microsoft's rollout strategy is here. Enabling the
feature changes the system code page for all programs and, well, some
aren't going to like that :)

>
> > On Windows, the execution encoding is determined at program startup
> based on the current active code page.
>
> True, but mostly irrelevant. Almost all C/C++ code on Windows works
> with UTF-16 (typed as WCHAR). (Absolutely all of the .Net environment
> works with UTF-16.)

I've worked (and continue to work) on (large) projects that run on
Windows, and that use UTF-16 only when calling some Win32 APIs.
Regardless, text files are not typically UTF-16 and are often
interpreted as encoded according to the active code page.

>
> > Existing programs depend on the ability to dynamically change the
> execution encoding (within reason) in order for a server process to
> concurrently serve multiple clients with different locale settings.
>
> This is possible, but it would be very unusual (and probably expensive
> and error-prone) to switch the process or thread locale on the fly to
> do so. Servers have been implemented for a long time using explicit
> conversion to/from explicitly named charsets.

I agree. I didn't intend to imply this is a common practice.

>
> > Since the char16_t and char32_t encodings are currently
> implementation defined, they too could vary at run-time. However, as
> noted earlier, all implementations currently use UTF-16 and UTF-32 and
> do not support such variance. P1041 will solidify current practice and
> ensure these encodings are known at comple-time.
>
> Good!
>
> > wchar_t is a portability deadend
>
> Right.
>
> > we'll need to ensure that proposals for new Unicode features are
> implementable using ICU.
>
> ICU does not need much. UTF-8 and UTF-16 string literals are great.
>
> If anything, it might be useful for C++ experts to help ICU become
> more convenient.
>
> > Avoid gratuitous departure from C
>
> Thank you!
>
> > The ordinary and wide execution encodings are not going away
>
> They aren't, but you can ignore wchar_t except where it is the same as
> UTF-16 or UTF-32. I would not waste energy making new stuff work with
> arbitrary wchar_t.
>
> > do we design for a single well known (though possibly implementation
> defined) internal encoding? Or do we continue the current practice of
> each program choosing its own internal encoding?
>
> You seem to assume that a "program" generally uses a single internal
> encoding. In my experience, either UTF-8 or UTF-16 dominates, but both
> are used, together with some UTF-32, so that a nontrivial program
> works with different third-party libraries. It is also common to
> convert to UTF-16 or UTF-32 temporarily for certain processing, such
> as dealing with CJK or emoji.

I didn't intend to project that assumption. I updated the draft to
state "internal encoding(s)". I agree that, in practice, multiple
encodings get used.

>
> > ... char8_t ...
>
> In my experience and opinion, a distinct-but-equal primitive type does
> more harm than good. If you want type safety, you need to define a
> string class. If you want it to validate for well-formed UTF-whatever
> strings, you need to build that into the class implementation.
>
> Adding a new type will not change decades of existing code that has
> been using char* strings and now largely assumes those to be in UTF-8,
> validated at boundaries.

Clearly we'll have to talk more about char8_t :)

Tom.

>
> > Or are contiguous iterators and ranges over code units needed to
> achieve acceptable performance?
>
> Users want Unicode but want it as fast as ASCII processing. You get
> good performance only by implementing at least fastpaths for a fixed,
> known encoding, duplicating similar but different code for both UTF-8
> and UTF-16 where both are needed.
>
> Many algorithms, even case mappings, require context, rather than
> working linearly on code points. None are defined on grapheme clusters.
>
> Best regards,
> markus

Received on 2018-10-06 09:56:50