sg16: Re: [SG16-Unicode] It’s Time to Stop Adding New Features for Non-Unicode Execution Encodings in C++

From: Henri Sivonen <hsivonen_at_[hidden]>
Date: Sat, 27 Apr 2019 13:28:14 +0300

I'm happy to see that so far there has not been opposition to the core
point on my write-up: Not adding new features for non-UTF execution
encodings. With that, let's talk about the details.

On Fri, Apr 26, 2019 at 3:42 AM Lyberta <lyberta_at_[hidden]> wrote:
> char32_t is too dumb. It can hold surrogate code points or values >
> 10FFFF. We are not C, we can use strong types that can't hold those
> values.

Having types that enforce Unicode validity can be very useful when the
language has good mechanisms for encapsulating the enforcement and for
clearly marking cases where for performance reasons the responsibility
of upholding the invariance is transferred from the type
implementation to the programmer. This kind of thing requires broad
vision and buy-in from the standard library.

Considering that the committee has recently
* Added std::u8string without UTF-8 validity enforcement
* Added std::optional in such a form that the most ergonomic way of
extracting the value, operator*(), is unchecked
* Added std::span in a form that, relative to gsl::span, removes
safety checks from the most ergonomic way of indexing into the span,
operator[]()
what reason is there to believe that validity-enforcing Unicode types
could make it through the committee?

I might not disagree with you on what an ideal from-scratch design
would look like, but when trying to assess where the overall committee
Overton Window is, it seems to me that what you're saying is further
outside the Overtime Window, and I'm trying to position what I said in
my write-up to at most within baby steps from the boundaries of the
Overton Window as I perceive it from the outside the committee.

> UTF-8, UTF-16 and UTF-32 are equally easy to implement because they are
> on the code unit level. The rest of the code work on higher levels and
> is totally abstracted.

This makes sense in theory, but in practice there can be performance
benefits from SIMD by breaking this abstraction within the
implementation.

> Ranges are better than std::span.

Can you expand on this a bit? I have implementation experience with
span (in non-std:: form) but I'm not properly familiar with Ranges.
Considering that both std::span and Ranges are being added to C++ at
the same time, presumably neither is unambiguously better than the
other for all use cases. What should I read to evaluate your statement
about Ranges?

> I think default error handling should be throwing exception

I tried to avoid talking about exceptions in my write-up in order to
avoid ratholing on the prominent point of contention between the C++
committee and a substantial portion of C++ usage. While the committee
is of the opinion that exceptions are part of the language, in
practice there is substantial real-world C++ usage, including the
major Web engines, that rejects exceptions, so throwing an exception
aborts the program.

Therefore, exceptions are an acceptable error signaling mechanism for
the program itself being incorrect on the same level as a
release-enabled assertion failing, but it's not an acceptable error
signaling mechanism for errors in the input of the program. Therefore,
in validity-checking APIs where the values potentially come from
outside the program, there should be a way to check for errors in a
non-exception-throwing manner, for example by returning std::optional.

> std::basic_string and std::basic_string_view are a temporary hack and
> will be replaced by std::unicode::code_unit_sequence in my proposal.
>
> See the first part of my design:
> https://github.com/Lyberta/cpp-unicode-fundamental

This repo seems to contain only a README with some high-level notes,
but I don't see a design that could be evaluated. Specifically, I
don't see how the assumed UTF-8 validity enforcement maintains
encapsulation.

It does seem a bit odd to simultaneously take the position that
charN_t is on its way out and to design for non-8 CHAR_BITS.

On Fri, Apr 26, 2019 at 4:21 AM Steve Downey <sdowney_at_[hidden]> wrote:
>
> I agree that a scalar_value type would be useful for checked code, but I think contracts are the right mechanism for stating the pre and post conditions. Exceptions, particularly in the middle of text processing, would mostly insert unnecessary checks. Contracts would at least make that optional.

If std::scalar_value is less ergonomic to type than char32_t and
there's a good probability that its benefit gets turned off, what
reason is there to believe that the ecosystem as a whole would be
motivated to move to std::scalar_value?

That is, I'm skeptical of designs whose safety can be toggled on or
off on a very coarse-grained level instead of the safety properties
being reliable at the point of the programmer using a given facility.

> I think Range and specializing on ContiguousRange will work, and span should be a model of CR.

What does CR mean in this context?

> Encode, Decode, and Transcode should really be in terms of std::byte sequences, but with additional overloads for ergonomics.

I'm getting off-topic here, but I don't understand the utility of
std::byte. What problem does std::byte solve compared to uint8_t?
Taking away the ability to ergonomically treat bytes as unsigned
eight-bit integers seems like an anti-feature from the perspective of
what programmers want to do with bytes. How well does std::byte
compose with byte-oriented IO facilities these days?

-- 
Henri Sivonen
hsivonen_at_[hidden]
https://hsivonen.fi/

Received on 2019-04-27 12:28:30