sg16: Re: [SG16-Unicode] It’s Time to Stop Adding New Features for Non-Unicode Execution Encodings in C++

From: Henri Sivonen <hsivonen_at_[hidden]>
Date: Sun, 28 Apr 2019 11:04:58 +0300

On Sat, Apr 27, 2019 at 2:15 PM Lyberta <lyberta_at_[hidden]> wrote:
> Where is SIMD is applicable?

The most common use cases are skipping over ASCII in operations where
ASCII is neutral and adding leading zeros or removing leading zeros
when converting between different code unit widths. However, there are
other operations, not all of them a priori obvious, that can benefit
from SIMD. For example, I've used SIMD to implement a check for
whether text is guaranteed-left-to-right or potentially-bidirectional.

> Ranges are generalization of std::span. Since no major compiler
> implements them right now, nobody except authors of ranges is properly
> familiar with them.

If a function takes a ContiguousRange and is called with two different
concrete argument types in two places of the program, does the binary
end up with one copy of the function or two copies? That is, do Ranges
monomorphize per concrete type?

> For transcoding you don't need contiguous memory and
> with Ranges you can do transcoding straight from and to I/O using
> InputRange and OutputRange. Not sure how useful in practice, but why
> prohibiting it outright?

For the use case I designed for, the converter wasn't allowed to pull
from the input stream but instead the I/O subsystem hands the
converter buffers and the event loop potentially spins between buffers
arriving. At the very least it would be prudent to allow for designs
where the conversion is suspended in such a way while the event loop
spins. I don't know if this means anything for evaluating Ranges.

> From what I know only 8 bit, 16 bit and 32 bit byte systems actually
> support modern C++.

Do systems with 16-bit or 32-bit bytes need to process text, or are
they used for image/video/audio processing only?

On Sat, Apr 27, 2019 at 3:01 PM Ville Voutilainen
<ville.voutilainen_at_[hidden]> wrote:
>
> On Sat, 27 Apr 2019 at 13:28, Henri Sivonen <hsivonen_at_[hidden]> wrote:
> > Having types that enforce Unicode validity can be very useful when the
> > language has good mechanisms for encapsulating the enforcement and for
> > clearly marking cases where for performance reasons the responsibility
> > of upholding the invariance is transferred from the type
> > implementation to the programmer. This kind of thing requires broad
> > vision and buy-in from the standard library.
> >
> > Considering that the committee has recently
> > * Added std::u8string without UTF-8 validity enforcement
> > * Added std::optional in such a form that the most ergonomic way of
> > extracting the value, operator*(), is unchecked
> > * Added std::span in a form that, relative to gsl::span, removes
> > safety checks from the most ergonomic way of indexing into the span,
> > operator[]()
> > what reason is there to believe that validity-enforcing Unicode types
> > could make it through the committee?
>
> Both std::optional and std::span provide 'safe' ways for extracting
> and indexing.
> The fact that the most-ergonomic way of performing those operations is
> zero-overhead
> rather than 'safe' should be of no surprise to anyone.

Indeed, I'm saying that the pattern suggests that unchecked-by-default
is what the committee consistently goes with, so I'm not suggesting
that anyone be surprised.

> The reason to
> 'believe' that
> validity-enforcing Unicode types could make it through the committee depends
> on the rationale for such types, not on strawman arguments about
> things completely
> unrelated to the success of proposals for such types.

The pattern of unchecked-byte-default suggests that it's unlikely that
validity-enforcing Unicode types could gain pervasive buy-in
throughout the standard library and that the unchecked types could
fall out of use in practice. Having validity-enforcing Unicode types
_in addition to_ unchecked Unicode types is considerably less valuable
and possibly even anti-useful compared to only having
validity-enforcing types or only having unchecked types.

For example, consider some function taking a view of guaranteed-valid
UTF-8 and what you have is std::u8string_view that you got from
somewhere else. That situation does not compose well if you need to
pass the possibly-invalid view to an API that takes a guaranteed-valid
view. The value of guaranteed-valid views is lost if you end up doing
validation in random places instead of UTF-8 validation having been
consistently pushed to the I/O boundary such that everything inside
the application uses guaranteed-valid views.

(Being able to emit the error condition branch when iterating over
UTF-8 by scalar value is not the only benefit of guaranteed-valid
UTF-8 views. If you can assume UTF-8 to be valid, you can also use
SIMD in ways that check for the presence of lead bytes in certain
ranges without having to worry about invalid sequences fooling such
checks. Either way, if you often end up validating the whole view
immediately before performing such an operation, the validation
operation followed by the optimized operation is probably less
efficient than just performing a single-pass operation that can deal
with invalid sequences.)

On Sat, Apr 27, 2019 at 3:13 PM Tom Honermann <tom_at_[hidden]> wrote:
>
> On 4/27/19 6:28 AM, Henri Sivonen wrote:
> > I'm happy to see that so far there has not been opposition to the core
> > point on my write-up: Not adding new features for non-UTF execution
> > encodings. With that, let's talk about the details.
>
> I see no need to take a strong stance against adding such new features.
> If there is consensus that a feature is useful (at least to some subset
> of users), implementors are not opposed,

On the flip side are there implementors who have expressed interest in
implementing _new_ text algorithms that are not in terms of Unicode?

> and the feature won't
> complicate further language evolution, then I see no reason to be
> opposed to it.

Text_view as proposed complicates language evolution for the sake of
non-Unicode numberings of abstract characters by making the "character
type" abstract.

>There are, and will be for a long time to come, programs
> that do not require Unicode and that need to operate in non-Unicode
> environments.

How seriously do such programs need _new_ text processing facilities
from the standard library?

On Sat, Apr 27, 2019 at 7:43 PM JeanHeyd Meneide
<phdofthehouse_at_[hidden]> wrote:
> By now, people who are using non-UTF encodings have already rolled their own libraries for it: they can continue to use those libraries. The standard need not promise arbitrary range-based to_lower/to_upper/casefold/etc. based on wchar_t and char_t: those are dead ends.

Indeed.

> I am strongly opposed to ALL encodings taking std::byte as the code unit. This interface means that implementers must now be explicitly concerned with endianness for anything that uses code units wider than 8 bits and is a multiple of 2 (UTF16 and UTF32). We work with the natural width and endianness of the machine by using the natural char8_t, char16_t, and char32_t. If someone wants bytes in / bytes out, we should provide encoding-form wrappers that put it in Little Endian or Big Endian on explicit request:
>
> encoding_form<utf16, little_endian> ef{}; // a wrapper that makes it so it works on a byte-by-byte basis, with the specified endianness

I think it is a design error to try to accommodate UTF-16 or UTF-32 as
Unicode Encoding Forms in the same API position as Unicode Encoding
Schemes and other encodings. Converting to/from byte-oriented I/O or
narrow execution encoding is a distinct concern from converting
between Unicode Encoding Forms within the application. Notably, the
latter operation is less likely to need streaming.

Providing a conversion API for non-UTF wchar_t makes the distinction
less clear, though. Again, that's the case of z/OS causing abstraction
obfuscation for everyone else. :-(

On Sat, Apr 27, 2019 at 2:59 PM <keld_at_[hidden]> wrote:
>
> well, I am much against leaving the principle of character set neutrality in c++,
> and I am working to enhance cheracter set features in a pan-character set way

But why? Do you foresee a replacement for Unicode for which
non-commitment to Unicode needs to be kept alive? What value is there
from pretending, on principle, that Unicode didn't win with no
realistic avenue for getting replaced--especially when other
programming languages, major GUI toolkits, and the Web Platform have
committed to the model where all text is conceptually (and
implementation-wise internally) Unicode but may be interchanged in
legacy _encodings_?

-- 
Henri Sivonen
hsivonen_at_[hidden]
https://hsivonen.fi/

Received on 2019-04-28 10:05:13