sg16: Re: [SG16-Unicode] It???s Time to Stop Adding New Features for Non-Unicode Execution Encodings in C++

From: keld_at <keld_at_[hidden]>
Date: Sun, 28 Apr 2019 22:01:12 +0200

On Sun, Apr 28, 2019 at 11:04:58AM +0300, Henri Sivonen wrote:
> On Sat, Apr 27, 2019 at 2:15 PM Lyberta <lyberta_at_[hidden]> wrote:
> > Where is SIMD is applicable?
>
> The most common use cases are skipping over ASCII in operations where
> ASCII is neutral and adding leading zeros or removing leading zeros
> when converting between different code unit widths. However, there are
> other operations, not all of them a priori obvious, that can benefit
> from SIMD. For example, I've used SIMD to implement a check for
> whether text is guaranteed-left-to-right or potentially-bidirectional.
>
> > Ranges are generalization of std::span. Since no major compiler
> > implements them right now, nobody except authors of ranges is properly
> > familiar with them.
>
> If a function takes a ContiguousRange and is called with two different
> concrete argument types in two places of the program, does the binary
> end up with one copy of the function or two copies? That is, do Ranges
> monomorphize per concrete type?
>
> > For transcoding you don't need contiguous memory and
> > with Ranges you can do transcoding straight from and to I/O using
> > InputRange and OutputRange. Not sure how useful in practice, but why
> > prohibiting it outright?
>
> For the use case I designed for, the converter wasn't allowed to pull
> from the input stream but instead the I/O subsystem hands the
> converter buffers and the event loop potentially spins between buffers
> arriving. At the very least it would be prudent to allow for designs
> where the conversion is suspended in such a way while the event loop
> spins. I don't know if this means anything for evaluating Ranges.
>
> > From what I know only 8 bit, 16 bit and 32 bit byte systems actually
> > support modern C++.
>
> Do systems with 16-bit or 32-bit bytes need to process text, or are
> they used for image/video/audio processing only?
>
> On Sat, Apr 27, 2019 at 3:01 PM Ville Voutilainen
> <ville.voutilainen_at_[hidden]> wrote:
> >
> > On Sat, 27 Apr 2019 at 13:28, Henri Sivonen <hsivonen_at_[hidden]> wrote:
> > > Having types that enforce Unicode validity can be very useful when the
> > > language has good mechanisms for encapsulating the enforcement and for
> > > clearly marking cases where for performance reasons the responsibility
> > > of upholding the invariance is transferred from the type
> > > implementation to the programmer. This kind of thing requires broad
> > > vision and buy-in from the standard library.
> > >
> > > Considering that the committee has recently
> > > * Added std::u8string without UTF-8 validity enforcement
> > > * Added std::optional in such a form that the most ergonomic way of
> > > extracting the value, operator*(), is unchecked
> > > * Added std::span in a form that, relative to gsl::span, removes
> > > safety checks from the most ergonomic way of indexing into the span,
> > > operator[]()
> > > what reason is there to believe that validity-enforcing Unicode types
> > > could make it through the committee?
> >
> > Both std::optional and std::span provide 'safe' ways for extracting
> > and indexing.
> > The fact that the most-ergonomic way of performing those operations is
> > zero-overhead
> > rather than 'safe' should be of no surprise to anyone.
>
> Indeed, I'm saying that the pattern suggests that unchecked-by-default
> is what the committee consistently goes with, so I'm not suggesting
> that anyone be surprised.
>
> > The reason to
> > 'believe' that
> > validity-enforcing Unicode types could make it through the committee depends
> > on the rationale for such types, not on strawman arguments about
> > things completely
> > unrelated to the success of proposals for such types.
>
> The pattern of unchecked-byte-default suggests that it's unlikely that
> validity-enforcing Unicode types could gain pervasive buy-in
> throughout the standard library and that the unchecked types could
> fall out of use in practice. Having validity-enforcing Unicode types
> _in addition to_ unchecked Unicode types is considerably less valuable
> and possibly even anti-useful compared to only having
> validity-enforcing types or only having unchecked types.
>
> For example, consider some function taking a view of guaranteed-valid
> UTF-8 and what you have is std::u8string_view that you got from
> somewhere else. That situation does not compose well if you need to
> pass the possibly-invalid view to an API that takes a guaranteed-valid
> view. The value of guaranteed-valid views is lost if you end up doing
> validation in random places instead of UTF-8 validation having been
> consistently pushed to the I/O boundary such that everything inside
> the application uses guaranteed-valid views.
>
> (Being able to emit the error condition branch when iterating over
> UTF-8 by scalar value is not the only benefit of guaranteed-valid
> UTF-8 views. If you can assume UTF-8 to be valid, you can also use
> SIMD in ways that check for the presence of lead bytes in certain
> ranges without having to worry about invalid sequences fooling such
> checks. Either way, if you often end up validating the whole view
> immediately before performing such an operation, the validation
> operation followed by the optimized operation is probably less
> efficient than just performing a single-pass operation that can deal
> with invalid sequences.)
>
> On Sat, Apr 27, 2019 at 3:13 PM Tom Honermann <tom_at_[hidden]> wrote:
> >
> > On 4/27/19 6:28 AM, Henri Sivonen wrote:
> > > I'm happy to see that so far there has not been opposition to the core
> > > point on my write-up: Not adding new features for non-UTF execution
> > > encodings. With that, let's talk about the details.
> >
> > I see no need to take a strong stance against adding such new features.
> > If there is consensus that a feature is useful (at least to some subset
> > of users), implementors are not opposed,
>
> On the flip side are there implementors who have expressed interest in
> implementing _new_ text algorithms that are not in terms of Unicode?
>
> > and the feature won't
> > complicate further language evolution, then I see no reason to be
> > opposed to it.
>
> Text_view as proposed complicates language evolution for the sake of
> non-Unicode numberings of abstract characters by making the "character
> type" abstract.
>
> >There are, and will be for a long time to come, programs
> > that do not require Unicode and that need to operate in non-Unicode
> > environments.
>
> How seriously do such programs need _new_ text processing facilities
> from the standard library?
>
> On Sat, Apr 27, 2019 at 7:43 PM JeanHeyd Meneide
> <phdofthehouse_at_[hidden]> wrote:
> > By now, people who are using non-UTF encodings have already rolled their own libraries for it: they can continue to use those libraries. The standard need not promise arbitrary range-based to_lower/to_upper/casefold/etc. based on wchar_t and char_t: those are dead ends.
>
> Indeed.
>
> > I am strongly opposed to ALL encodings taking std::byte as the code unit. This interface means that implementers must now be explicitly concerned with endianness for anything that uses code units wider than 8 bits and is a multiple of 2 (UTF16 and UTF32). We work with the natural width and endianness of the machine by using the natural char8_t, char16_t, and char32_t. If someone wants bytes in / bytes out, we should provide encoding-form wrappers that put it in Little Endian or Big Endian on explicit request:
> >
> > encoding_form<utf16, little_endian> ef{}; // a wrapper that makes it so it works on a byte-by-byte basis, with the specified endianness
>
> I think it is a design error to try to accommodate UTF-16 or UTF-32 as
> Unicode Encoding Forms in the same API position as Unicode Encoding
> Schemes and other encodings. Converting to/from byte-oriented I/O or
> narrow execution encoding is a distinct concern from converting
> between Unicode Encoding Forms within the application. Notably, the
> latter operation is less likely to need streaming.
>
> Providing a conversion API for non-UTF wchar_t makes the distinction
> less clear, though. Again, that's the case of z/OS causing abstraction
> obfuscation for everyone else. :-(
>
> On Sat, Apr 27, 2019 at 2:59 PM <keld_at_[hidden]> wrote:
> >
> > well, I am much against leaving the principle of character set neutrality in c++,
> > and I am working to enhance cheracter set features in a pan-character set way
>
> But why? Do you foresee a replacement for Unicode for which
> non-commitment to Unicode needs to be kept alive? What value is there
> from pretending, on principle, that Unicode didn't win with no
> realistic avenue for getting replaced--especially when other
> programming languages, major GUI toolkits, and the Web Platform have
> committed to the model where all text is conceptually (and
> implementation-wise internally) Unicode but may be interchanged in
> legacy _encodings_?

I believe there are a number of encodings in East Asia that there will still be
developed for for quite some time.

major languages and toolkits and operating systems are still character set independent.
some people believe that unicode has not won, and some people are not happy with
the unicode consortium. why abandon a model that still delivers for all?

keld

Received on 2019-04-28 22:01:12