C++ Logo


Advanced search

Re: [SG16-Unicode] It’s Time to Stop Adding New Features for Non-Unicode Execution Encodings in C++

From: Corentin <corentin.jabot_at_[hidden]>
Date: Sat, 27 Apr 2019 23:37:38 +0200
On Sat, 27 Apr 2019 at 19:20, Steve Downey <sdowney_at_[hidden]> wrote:

> Execution encoding is a trap, and has to be avoided or worked around for
> server side programs. Data comes in often with self described encoding. You
> can not use locale mechanisms to deal with it.
> This is also why I am asking for byte, octet, options for decoding. Utf-16
> does not arrive as char16_t, it arrives as a stream of bytes, and using
> char or even unsigned char is misleading. It's the reason we have
> std::byte.
> There should never be a question of endianness for char16_t. It is in host
> order. Any other choice is simply too dangerous. Specified order encoding
> should be read or written from or to byte oriented streams.

Modulo BOM, endianness is a concern for the underlying stream, not the text

> On Sat, Apr 27, 2019, 12:43 JeanHeyd Meneide <phdofthehouse_at_[hidden]>
> wrote:
>> On Sat, Apr 27, 2019 at 9:18 AM Ville Voutilainen <
>> ville.voutilainen_at_[hidden]> wrote:
>>> On Sat, 27 Apr 2019 at 15:21, Tom Honermann <tom_at_[hidden]> wrote:
>>> >
>>> > On 4/27/19 6:28 AM, Henri Sivonen wrote:
>>> > > I'm happy to see that so far there has not been opposition to the
>>> core
>>> > > point on my write-up: Not adding new features for non-UTF execution
>>> > > encodings. With that, let's talk about the details.
>>> >
>>> > I see no need to take a strong stance against adding such new features.
>>> > If there is consensus that a feature is useful (at least to some subset
>>> > of users), implementors are not opposed, and the feature won't
>>> > complicate further language evolution, then I see no reason to be
>>> > opposed to it. There are, and will be for a long time to come,
>>> programs
>>> > that do not require Unicode and that need to operate in non-Unicode
>>> > environments. We don't need to make them a priority, but we don't need
>>> > to stand in their way either.
>>> I do think Henri's concerns about supporting non-UTF encodings are
>>> valid; there's questionable
>>> bang-for-buck in doing so, and we're not exactly removing something
>>> users of such encodings had,
>>> and catering to those encodings is certainly not free and probably not
>>> cheap.
>>> There is some value in possibly providing such users with migration
>>> paths to UTF encodings, though.
>> I agree with Tom and Ville here. The way we support people who use
>> non-UTF encodings is to give them ways to transcode from their execution
>> encoding to Unicode (by using the UTF encoding of choice), and then they
>> use the Unicode Support that everyone else will be getting.
>> By now, people who are using non-UTF encodings have already rolled their
>> own libraries for it: they can continue to use those libraries. The
>> standard need not promise arbitrary range-based
>> to_lower/to_upper/casefold/etc. based on wchar_t and char_t: those are dead
>> ends.
>> The most the standard should offer (and the paper I am working on) is
>> that we will accept any type that conforms to the Encoding interface to
>> transcode. As long as that Encoding spits out Unicode, a user can enjoy the
>> abstractions of the standard. This allows many people who currently work in
>> environments where UTF is not used everywhere to transcode at program
>> boundaries and keep a well-supported Unicode machine internally.
>> I am undecided on having strong types for every encoding or for
>> unicode_code_point / unicode_scalar_value for each encoding (or at all).
>> I am strongly opposed to ALL encodings taking std::byte as the code unit.
>> This interface means that implementers must now be explicitly concerned
>> with endianness for anything that uses code units wider than 8 bits and is
>> a multiple of 2 (UTF16 and UTF32). We work with the natural width and
>> endianness of the machine by using the natural char8_t, char16_t, and
>> char32_t. If someone wants bytes in / bytes out, we should provide
>> encoding-form wrappers that put it in Little Endian or Big Endian on
>> explicit request:
>> encoding_form<utf16, little_endian> ef{}; // a wrapper that makes it
>> so it works on a byte-by-byte basis, with the specified endianness
>> (Again, encoding should not be mixed with IO types. Using std::byte is
>> making IO and endianness a concern of the encoding: this is a failure of
>> separation of concerns.)
>> Finally, exceptions are exactly the wrong mechanism for this and are
>> explicitly recommended against by the Unicode Technical Reports themselves.
>> As previously pointed out, Denial of Service from sending bad text is a bad
>> default to give programs. The replacement character is the chief mechanism
>> of error handling here that has industry experience and seen extensive
>> wins. However, someone should be able to choose a different error handling
>> mechanism if they pass it in or put it in the type system.
>>> _______________________________________________
>>> SG16 Unicode mailing list
>>> Unicode_at_[hidden]
>>> http://www.open-std.org/mailman/listinfo/unicode
>> _______________________________________________
>> SG16 Unicode mailing list
>> Unicode_at_[hidden]
>> http://www.open-std.org/mailman/listinfo/unicode
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode

Received on 2019-04-27 23:37:52