sg16: Re: [SG16-Unicode] It???s Time to Stop Adding New Features for Non-Unicode Execution Encodings in C++

From: Corentin <corentin.jabot_at_[hidden]>
Date: Mon, 29 Apr 2019 01:35:15 +0200

On Sun, Apr 28, 2019, 10:01 PM <keld_at_[hidden]> wrote:

> On Sun, Apr 28, 2019 at 11:04:58AM +0300, Henri Sivonen wrote:
> > On Sat, Apr 27, 2019 at 2:15 PM Lyberta <lyberta_at_[hidden]> wrote:
> > > Where is SIMD is applicable?
> >
> > The most common use cases are skipping over ASCII in operations where
> > ASCII is neutral and adding leading zeros or removing leading zeros
> > when converting between different code unit widths. However, there are
> > other operations, not all of them a priori obvious, that can benefit
> > from SIMD. For example, I've used SIMD to implement a check for
> > whether text is guaranteed-left-to-right or potentially-bidirectional.
> >
> > > Ranges are generalization of std::span. Since no major compiler
> > > implements them right now, nobody except authors of ranges is properly
> > > familiar with them.
> >
> > If a function takes a ContiguousRange and is called with two different
> > concrete argument types in two places of the program, does the binary
> > end up with one copy of the function or two copies? That is, do Ranges
> > monomorphize per concrete type?
> >
> > > For transcoding you don't need contiguous memory and
> > > with Ranges you can do transcoding straight from and to I/O using
> > > InputRange and OutputRange. Not sure how useful in practice, but why
> > > prohibiting it outright?
> >
> > For the use case I designed for, the converter wasn't allowed to pull
> > from the input stream but instead the I/O subsystem hands the
> > converter buffers and the event loop potentially spins between buffers
> > arriving. At the very least it would be prudent to allow for designs
> > where the conversion is suspended in such a way while the event loop
> > spins. I don't know if this means anything for evaluating Ranges.
> >
> > > From what I know only 8 bit, 16 bit and 32 bit byte systems actually
> > > support modern C++.
> >
> > Do systems with 16-bit or 32-bit bytes need to process text, or are
> > they used for image/video/audio processing only?
> >
> > On Sat, Apr 27, 2019 at 3:01 PM Ville Voutilainen
> > <ville.voutilainen_at_[hidden]> wrote:
> > >
> > > On Sat, 27 Apr 2019 at 13:28, Henri Sivonen <hsivonen_at_[hidden]>
> wrote:
> > > > Having types that enforce Unicode validity can be very useful when
> the
> > > > language has good mechanisms for encapsulating the enforcement and
> for
> > > > clearly marking cases where for performance reasons the
> responsibility
> > > > of upholding the invariance is transferred from the type
> > > > implementation to the programmer. This kind of thing requires broad
> > > > vision and buy-in from the standard library.
> > > >
> > > > Considering that the committee has recently
> > > > * Added std::u8string without UTF-8 validity enforcement
> > > > * Added std::optional in such a form that the most ergonomic way of
> > > > extracting the value, operator*(), is unchecked
> > > > * Added std::span in a form that, relative to gsl::span, removes
> > > > safety checks from the most ergonomic way of indexing into the span,
> > > > operator[]()
> > > > what reason is there to believe that validity-enforcing Unicode types
> > > > could make it through the committee?
> > >
> > > Both std::optional and std::span provide 'safe' ways for extracting
> > > and indexing.
> > > The fact that the most-ergonomic way of performing those operations is
> > > zero-overhead
> > > rather than 'safe' should be of no surprise to anyone.
> >
> > Indeed, I'm saying that the pattern suggests that unchecked-by-default
> > is what the committee consistently goes with, so I'm not suggesting
> > that anyone be surprised.
> >
> > > The reason to
> > > 'believe' that
> > > validity-enforcing Unicode types could make it through the committee
> depends
> > > on the rationale for such types, not on strawman arguments about
> > > things completely
> > > unrelated to the success of proposals for such types.
> >
> > The pattern of unchecked-byte-default suggests that it's unlikely that
> > validity-enforcing Unicode types could gain pervasive buy-in
> > throughout the standard library and that the unchecked types could
> > fall out of use in practice. Having validity-enforcing Unicode types
> > _in addition to_ unchecked Unicode types is considerably less valuable
> > and possibly even anti-useful compared to only having
> > validity-enforcing types or only having unchecked types.
> >
> > For example, consider some function taking a view of guaranteed-valid
> > UTF-8 and what you have is std::u8string_view that you got from
> > somewhere else. That situation does not compose well if you need to
> > pass the possibly-invalid view to an API that takes a guaranteed-valid
> > view. The value of guaranteed-valid views is lost if you end up doing
> > validation in random places instead of UTF-8 validation having been
> > consistently pushed to the I/O boundary such that everything inside
> > the application uses guaranteed-valid views.
> >
> > (Being able to emit the error condition branch when iterating over
> > UTF-8 by scalar value is not the only benefit of guaranteed-valid
> > UTF-8 views. If you can assume UTF-8 to be valid, you can also use
> > SIMD in ways that check for the presence of lead bytes in certain
> > ranges without having to worry about invalid sequences fooling such
> > checks. Either way, if you often end up validating the whole view
> > immediately before performing such an operation, the validation
> > operation followed by the optimized operation is probably less
> > efficient than just performing a single-pass operation that can deal
> > with invalid sequences.)
> >
> > On Sat, Apr 27, 2019 at 3:13 PM Tom Honermann <tom_at_[hidden]> wrote:
> > >
> > > On 4/27/19 6:28 AM, Henri Sivonen wrote:
> > > > I'm happy to see that so far there has not been opposition to the
> core
> > > > point on my write-up: Not adding new features for non-UTF execution
> > > > encodings. With that, let's talk about the details.
> > >
> > > I see no need to take a strong stance against adding such new features.
> > > If there is consensus that a feature is useful (at least to some subset
> > > of users), implementors are not opposed,
> >
> > On the flip side are there implementors who have expressed interest in
> > implementing _new_ text algorithms that are not in terms of Unicode?
> >
> > > and the feature won't
> > > complicate further language evolution, then I see no reason to be
> > > opposed to it.
> >
> > Text_view as proposed complicates language evolution for the sake of
> > non-Unicode numberings of abstract characters by making the "character
> > type" abstract.
> >
> > >There are, and will be for a long time to come, programs
> > > that do not require Unicode and that need to operate in non-Unicode
> > > environments.
> >
> > How seriously do such programs need _new_ text processing facilities
> > from the standard library?
> >
> > On Sat, Apr 27, 2019 at 7:43 PM JeanHeyd Meneide
> > <phdofthehouse_at_[hidden]> wrote:
> > > By now, people who are using non-UTF encodings have already rolled
> their own libraries for it: they can continue to use those libraries. The
> standard need not promise arbitrary range-based
> to_lower/to_upper/casefold/etc. based on wchar_t and char_t: those are dead
> ends.
> >
> > Indeed.
> >
> > > I am strongly opposed to ALL encodings taking std::byte as the code
> unit. This interface means that implementers must now be explicitly
> concerned with endianness for anything that uses code units wider than 8
> bits and is a multiple of 2 (UTF16 and UTF32). We work with the natural
> width and endianness of the machine by using the natural char8_t, char16_t,
> and char32_t. If someone wants bytes in / bytes out, we should provide
> encoding-form wrappers that put it in Little Endian or Big Endian on
> explicit request:
> > >
> > > encoding_form<utf16, little_endian> ef{}; // a wrapper that makes
> it so it works on a byte-by-byte basis, with the specified endianness
> >
> > I think it is a design error to try to accommodate UTF-16 or UTF-32 as
> > Unicode Encoding Forms in the same API position as Unicode Encoding
> > Schemes and other encodings. Converting to/from byte-oriented I/O or
> > narrow execution encoding is a distinct concern from converting
> > between Unicode Encoding Forms within the application. Notably, the
> > latter operation is less likely to need streaming.
> >
> > Providing a conversion API for non-UTF wchar_t makes the distinction
> > less clear, though. Again, that's the case of z/OS causing abstraction
> > obfuscation for everyone else. :-(
> >
> > On Sat, Apr 27, 2019 at 2:59 PM <keld_at_[hidden]> wrote:
> > >
> > > well, I am much against leaving the principle of character set
> neutrality in c++,
> > > and I am working to enhance cheracter set features in a pan-character
> set way
> >
> > But why? Do you foresee a replacement for Unicode for which
> > non-commitment to Unicode needs to be kept alive? What value is there
> > from pretending, on principle, that Unicode didn't win with no
> > realistic avenue for getting replaced--especially when other
> > programming languages, major GUI toolkits, and the Web Platform have
> > committed to the model where all text is conceptually (and
> > implementation-wise internally) Unicode but may be interchanged in
> > legacy _encodings_?
>
> I believe there are a number of encodings in East Asia that there will
> still be
> developed for for quite some time.
>
> major languages and toolkits and operating systems are still character set
> independent.
> some people believe that unicode has not won

Some people are wrong

and some people are not happy with

> the unicode consortium.

Some people will never be happy. Yet it is incredibly unlikely that someone
would come up with a set of characters which is a strict superset of what
is offered by Unicode, and nothing short of that would make it suitable to
handle text.

Operating systems that are encoding independent are mostly a myth at this
point. Probably always were. Linux is mostly utf-8, Osx is Unicode, windows
is slowly getting there etc.

All of that is driven by marker forces. Users don't tolerate mojibake and
the _only_ way to avoid that is to use Unicode.

This means in no way that c++ wouldn't be able to transcode inputs from all
kind of encoding at i/o boundary.

why abandon a model that still delivers for all?

> keld
>
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode
>

Received on 2019-04-29 01:35:34