On Sat, Apr 27, 2019 at 9:18 AM Ville Voutilainen <ville.voutilainen@gmail.com> wrote:

On Sat, 27 Apr 2019 at 15:21, Tom Honermann <tom@honermann.net> wrote:
>
> On 4/27/19 6:28 AM, Henri Sivonen wrote:
> > I'm happy to see that so far there has not been opposition to the core
> > point on my write-up: Not adding new features for non-UTF execution
> > encodings. With that, let's talk about the details.
>
> I see no need to take a strong stance against adding such new features.
> If there is consensus that a feature is useful (at least to some subset
> of users), implementors are not opposed, and the feature won't
> complicate further language evolution, then I see no reason to be
> opposed to it. There are, and will be for a long time to come, programs
> that do not require Unicode and that need to operate in non-Unicode
> environments. We don't need to make them a priority, but we don't need
> to stand in their way either.

I do think Henri's concerns about supporting non-UTF encodings are
valid; there's questionable
bang-for-buck in doing so, and we're not exactly removing something
users of such encodings had,
and catering to those encodings is certainly not free and probably not cheap.
There is some value in possibly providing such users with migration
paths to UTF encodings, though.

I agree with Tom and Ville here. The way we support people who use non-UTF encodings is to give them ways to transcode from their execution encoding to Unicode (by using the UTF encoding of choice), and then they use the Unicode Support that everyone else will be getting.

By now, people who are using non-UTF encodings have already rolled their own libraries for it: they can continue to use those libraries. The standard need not promise arbitrary range-based to_lower/to_upper/casefold/etc. based on wchar_t and char_t: those are dead ends.

The most the standard should offer (and the paper I am working on) is that we will accept any type that conforms to the Encoding interface to transcode. As long as that Encoding spits out Unicode, a user can enjoy the abstractions of the standard. This allows many people who currently work in environments where UTF is not used everywhere to transcode at program boundaries and keep a well-supported Unicode machine internally.

I am undecided on having strong types for every encoding or for unicode_code_point / unicode_scalar_value for each encoding (or at all).

I am strongly opposed to ALL encodings taking std::byte as the code unit. This interface means that implementers must now be explicitly concerned with endianness for anything that uses code units wider than 8 bits and is a multiple of 2 (UTF16 and UTF32). We work with the natural width and endianness of the machine by using the natural char8_t, char16_t, and char32_t. If someone wants bytes in / bytes out, we should provide encoding-form wrappers that put it in Little Endian or Big Endian on explicit request:

encoding_form<utf16, little_endian> ef{}; // a wrapper that makes it so it works on a byte-by-byte basis, with the specified endianness

(Again, encoding should not be mixed with IO types. Using std::byte is making IO and endianness a concern of the encoding: this is a failure of separation of concerns.)

Finally, exceptions are exactly the wrong mechanism for this and are explicitly recommended against by the Unicode Technical Reports themselves. As previously pointed out, Denial of Service from sending bad text is a bad default to give programs. The replacement character is the chief mechanism of error handling here that has industry experience and seen extensive wins. However, someone should be able to choose a different error handling mechanism if they pass it in or put it in the type system.

_______________________________________________
SG16 Unicode mailing list
Unicode@isocpp.open-std.org
http://www.open-std.org/mailman/listinfo/unicode