sg16: Re: [SG16-Unicode] It’s Time to Stop Adding New Features for Non-Unicode Execution Encodings in C++

From: Steve Downey <sdowney_at_[hidden]>
Date: Sat, 27 Apr 2019 13:20:16 -0400

Execution encoding is a trap, and has to be avoided or worked around for
server side programs. Data comes in often with self described encoding. You
can not use locale mechanisms to deal with it.

This is also why I am asking for byte, octet, options for decoding. Utf-16
does not arrive as char16_t, it arrives as a stream of bytes, and using
char or even unsigned char is misleading. It's the reason we have
std::byte.

There should never be a question of endianness for char16_t. It is in host
order. Any other choice is simply too dangerous. Specified order encoding
should be read or written from or to byte oriented streams.

On Sat, Apr 27, 2019, 12:43 JeanHeyd Meneide <phdofthehouse_at_[hidden]>
wrote:

> On Sat, Apr 27, 2019 at 9:18 AM Ville Voutilainen <
> ville.voutilainen_at_[hidden]> wrote:
>
>> On Sat, 27 Apr 2019 at 15:21, Tom Honermann <tom_at_[hidden]> wrote:
>> >
>> > On 4/27/19 6:28 AM, Henri Sivonen wrote:
>> > > I'm happy to see that so far there has not been opposition to the core
>> > > point on my write-up: Not adding new features for non-UTF execution
>> > > encodings. With that, let's talk about the details.
>> >
>> > I see no need to take a strong stance against adding such new features.
>> > If there is consensus that a feature is useful (at least to some subset
>> > of users), implementors are not opposed, and the feature won't
>> > complicate further language evolution, then I see no reason to be
>> > opposed to it. There are, and will be for a long time to come, programs
>> > that do not require Unicode and that need to operate in non-Unicode
>> > environments. We don't need to make them a priority, but we don't need
>> > to stand in their way either.
>>
>> I do think Henri's concerns about supporting non-UTF encodings are
>> valid; there's questionable
>> bang-for-buck in doing so, and we're not exactly removing something
>> users of such encodings had,
>> and catering to those encodings is certainly not free and probably not
>> cheap.
>> There is some value in possibly providing such users with migration
>> paths to UTF encodings, though.
>>
>
> I agree with Tom and Ville here. The way we support people who use non-UTF
> encodings is to give them ways to transcode from their execution encoding
> to Unicode (by using the UTF encoding of choice), and then they use the
> Unicode Support that everyone else will be getting.
>
> By now, people who are using non-UTF encodings have already rolled their
> own libraries for it: they can continue to use those libraries. The
> standard need not promise arbitrary range-based
> to_lower/to_upper/casefold/etc. based on wchar_t and char_t: those are dead
> ends.
>
> The most the standard should offer (and the paper I am working on) is that
> we will accept any type that conforms to the Encoding interface to
> transcode. As long as that Encoding spits out Unicode, a user can enjoy the
> abstractions of the standard. This allows many people who currently work in
> environments where UTF is not used everywhere to transcode at program
> boundaries and keep a well-supported Unicode machine internally.
>
> I am undecided on having strong types for every encoding or for
> unicode_code_point / unicode_scalar_value for each encoding (or at all).
>
> I am strongly opposed to ALL encodings taking std::byte as the code unit.
> This interface means that implementers must now be explicitly concerned
> with endianness for anything that uses code units wider than 8 bits and is
> a multiple of 2 (UTF16 and UTF32). We work with the natural width and
> endianness of the machine by using the natural char8_t, char16_t, and
> char32_t. If someone wants bytes in / bytes out, we should provide
> encoding-form wrappers that put it in Little Endian or Big Endian on
> explicit request:
>
> encoding_form<utf16, little_endian> ef{}; // a wrapper that makes it
> so it works on a byte-by-byte basis, with the specified endianness
>
> (Again, encoding should not be mixed with IO types. Using std::byte is
> making IO and endianness a concern of the encoding: this is a failure of
> separation of concerns.)
>
> Finally, exceptions are exactly the wrong mechanism for this and are
> explicitly recommended against by the Unicode Technical Reports themselves.
> As previously pointed out, Denial of Service from sending bad text is a bad
> default to give programs. The replacement character is the chief mechanism
> of error handling here that has industry experience and seen extensive
> wins. However, someone should be able to choose a different error handling
> mechanism if they pass it in or put it in the type system.
>
>
>> _______________________________________________
>> SG16 Unicode mailing list
>> Unicode_at_[hidden]
>> http://www.open-std.org/mailman/listinfo/unicode
>>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>

Received on 2019-04-27 19:20:30