sg16: Re: [SG16-Unicode] It’s Time to Stop Adding New Features for Non-Unicode Execution Encodings in C++

From: JeanHeyd Meneide <phdofthehouse_at_[hidden]>
Date: Sat, 27 Apr 2019 12:43:19 -0400

On Sat, Apr 27, 2019 at 9:18 AM Ville Voutilainen <
ville.voutilainen_at_[hidden]> wrote:

> On Sat, 27 Apr 2019 at 15:21, Tom Honermann <tom_at_[hidden]> wrote:
> >
> > On 4/27/19 6:28 AM, Henri Sivonen wrote:
> > > I'm happy to see that so far there has not been opposition to the core
> > > point on my write-up: Not adding new features for non-UTF execution
> > > encodings. With that, let's talk about the details.
> >
> > I see no need to take a strong stance against adding such new features.
> > If there is consensus that a feature is useful (at least to some subset
> > of users), implementors are not opposed, and the feature won't
> > complicate further language evolution, then I see no reason to be
> > opposed to it. There are, and will be for a long time to come, programs
> > that do not require Unicode and that need to operate in non-Unicode
> > environments. We don't need to make them a priority, but we don't need
> > to stand in their way either.
>
> I do think Henri's concerns about supporting non-UTF encodings are
> valid; there's questionable
> bang-for-buck in doing so, and we're not exactly removing something
> users of such encodings had,
> and catering to those encodings is certainly not free and probably not
> cheap.
> There is some value in possibly providing such users with migration
> paths to UTF encodings, though.
>

I agree with Tom and Ville here. The way we support people who use non-UTF
encodings is to give them ways to transcode from their execution encoding
to Unicode (by using the UTF encoding of choice), and then they use the
Unicode Support that everyone else will be getting.

By now, people who are using non-UTF encodings have already rolled their
own libraries for it: they can continue to use those libraries. The
standard need not promise arbitrary range-based
to_lower/to_upper/casefold/etc. based on wchar_t and char_t: those are dead
ends.

The most the standard should offer (and the paper I am working on) is that
we will accept any type that conforms to the Encoding interface to
transcode. As long as that Encoding spits out Unicode, a user can enjoy the
abstractions of the standard. This allows many people who currently work in
environments where UTF is not used everywhere to transcode at program
boundaries and keep a well-supported Unicode machine internally.

I am undecided on having strong types for every encoding or for
unicode_code_point / unicode_scalar_value for each encoding (or at all).

I am strongly opposed to ALL encodings taking std::byte as the code unit.
This interface means that implementers must now be explicitly concerned
with endianness for anything that uses code units wider than 8 bits and is
a multiple of 2 (UTF16 and UTF32). We work with the natural width and
endianness of the machine by using the natural char8_t, char16_t, and
char32_t. If someone wants bytes in / bytes out, we should provide
encoding-form wrappers that put it in Little Endian or Big Endian on
explicit request:

encoding_form<utf16, little_endian> ef{}; // a wrapper that makes it
so it works on a byte-by-byte basis, with the specified endianness

(Again, encoding should not be mixed with IO types. Using std::byte is
making IO and endianness a concern of the encoding: this is a failure of
separation of concerns.)

Finally, exceptions are exactly the wrong mechanism for this and are
explicitly recommended against by the Unicode Technical Reports themselves.
As previously pointed out, Denial of Service from sending bad text is a bad
default to give programs. The replacement character is the chief mechanism
of error handling here that has industry experience and seen extensive
wins. However, someone should be able to choose a different error handling
mechanism if they pass it in or put it in the type system.

> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>

Received on 2019-04-27 18:43:34