C++ Logo


Advanced search

Re: [SG16-Unicode] It’s Time to Stop Adding New Features for Non-Unicode Execution Encodings in C++

From: JeanHeyd Meneide <phdofthehouse_at_[hidden]>
Date: Sun, 28 Apr 2019 01:42:14 -0400
char16_t and char32_t will always be in host order.

A sequence of std::byte may not be.

On Sat, Apr 27, 2019 at 5:37 PM Corentin <corentin.jabot_at_[hidden]> wrote:

> On Sat, 27 Apr 2019 at 19:20, Steve Downey <sdowney_at_[hidden]> wrote:
>> Execution encoding is a trap, and has to be avoided or worked around for
>> server side programs. Data comes in often with self described encoding. You
>> can not use locale mechanisms to deal with it.
>> This is also why I am asking for byte, octet, options for decoding.
>> Utf-16 does not arrive as char16_t, it arrives as a stream of bytes, and
>> using char or even unsigned char is misleading. It's the reason we have
>> std::byte.
>> There should never be a question of endianness for char16_t. It is in
>> host order. Any other choice is simply too dangerous. Specified order
>> encoding should be read or written from or to byte oriented streams.
> +1
> Modulo BOM, endianness is a concern for the underlying stream, not the
> text encoder.
>> On Sat, Apr 27, 2019, 12:43 JeanHeyd Meneide <phdofthehouse_at_[hidden]>
>> wrote:
>>> On Sat, Apr 27, 2019 at 9:18 AM Ville Voutilainen <
>>> ville.voutilainen_at_[hidden]> wrote:
>>>> On Sat, 27 Apr 2019 at 15:21, Tom Honermann <tom_at_[hidden]> wrote:
>>>> >
>>>> > On 4/27/19 6:28 AM, Henri Sivonen wrote:
>>>> > > I'm happy to see that so far there has not been opposition to the
>>>> core
>>>> > > point on my write-up: Not adding new features for non-UTF execution
>>>> > > encodings. With that, let's talk about the details.
>>>> >
>>>> > I see no need to take a strong stance against adding such new
>>>> features.
>>>> > If there is consensus that a feature is useful (at least to some
>>>> subset
>>>> > of users), implementors are not opposed, and the feature won't
>>>> > complicate further language evolution, then I see no reason to be
>>>> > opposed to it. There are, and will be for a long time to come,
>>>> programs
>>>> > that do not require Unicode and that need to operate in non-Unicode
>>>> > environments. We don't need to make them a priority, but we don't
>>>> need
>>>> > to stand in their way either.
>>>> I do think Henri's concerns about supporting non-UTF encodings are
>>>> valid; there's questionable
>>>> bang-for-buck in doing so, and we're not exactly removing something
>>>> users of such encodings had,
>>>> and catering to those encodings is certainly not free and probably not
>>>> cheap.
>>>> There is some value in possibly providing such users with migration
>>>> paths to UTF encodings, though.
>>> I agree with Tom and Ville here. The way we support people who use
>>> non-UTF encodings is to give them ways to transcode from their execution
>>> encoding to Unicode (by using the UTF encoding of choice), and then they
>>> use the Unicode Support that everyone else will be getting.
>>> By now, people who are using non-UTF encodings have already rolled their
>>> own libraries for it: they can continue to use those libraries. The
>>> standard need not promise arbitrary range-based
>>> to_lower/to_upper/casefold/etc. based on wchar_t and char_t: those are dead
>>> ends.
>>> The most the standard should offer (and the paper I am working on) is
>>> that we will accept any type that conforms to the Encoding interface to
>>> transcode. As long as that Encoding spits out Unicode, a user can enjoy the
>>> abstractions of the standard. This allows many people who currently work in
>>> environments where UTF is not used everywhere to transcode at program
>>> boundaries and keep a well-supported Unicode machine internally.
>>> I am undecided on having strong types for every encoding or for
>>> unicode_code_point / unicode_scalar_value for each encoding (or at all).
>>> I am strongly opposed to ALL encodings taking std::byte as the code
>>> unit. This interface means that implementers must now be explicitly
>>> concerned with endianness for anything that uses code units wider than 8
>>> bits and is a multiple of 2 (UTF16 and UTF32). We work with the natural
>>> width and endianness of the machine by using the natural char8_t, char16_t,
>>> and char32_t. If someone wants bytes in / bytes out, we should provide
>>> encoding-form wrappers that put it in Little Endian or Big Endian on
>>> explicit request:
>>> encoding_form<utf16, little_endian> ef{}; // a wrapper that makes
>>> it so it works on a byte-by-byte basis, with the specified endianness
>>> (Again, encoding should not be mixed with IO types. Using std::byte is
>>> making IO and endianness a concern of the encoding: this is a failure of
>>> separation of concerns.)
>>> Finally, exceptions are exactly the wrong mechanism for this and are
>>> explicitly recommended against by the Unicode Technical Reports themselves.
>>> As previously pointed out, Denial of Service from sending bad text is a bad
>>> default to give programs. The replacement character is the chief mechanism
>>> of error handling here that has industry experience and seen extensive
>>> wins. However, someone should be able to choose a different error handling
>>> mechanism if they pass it in or put it in the type system.
>>>> _______________________________________________
>>>> SG16 Unicode mailing list
>>>> Unicode_at_[hidden]
>>>> http://www.open-std.org/mailman/listinfo/unicode
>>> _______________________________________________
>>> SG16 Unicode mailing list
>>> Unicode_at_[hidden]
>>> http://www.open-std.org/mailman/listinfo/unicode
>> _______________________________________________
>> SG16 Unicode mailing list
>> Unicode_at_[hidden]
>> http://www.open-std.org/mailman/listinfo/unicode

Received on 2019-04-28 07:42:29