sg16: Re: [SG16-Unicode] It???s Time to Stop Adding New Features for Non-Unicode Execution Encodings in C++

From: keld_at <keld_at_[hidden]>
Date: Sat, 27 Apr 2019 13:59:14 +0200

well, I am much against leaving the principle of character set neutrality in c++,
and I am working to enhance cheracter set features in a pan-character set way, including some
stuff for unicode compability, just like much of the unicode works were to be compatible with posix/c
and other iso i18n features. Doing unicode only stuff that could easily be extended to other
character sets would be against the spirit of c++ imho.

keld

, Apr 27, 2019 at 0d1:28:14PM +0300, Henri Sivonen wrote:
> I'm happy to see that so far there has not been opposition to the core
> point on my write-up: Not adding new features for non-UTF execution
> encodings. With that, let's talk about the details.
>
> On Fri, Apr 26, 2019 at 3:42 AM Lyberta <lyberta_at_[hidden]> wrote:
> > char32_t is too dumb. It can hold surrogate code points or values >
> > 10FFFF. We are not C, we can use strong types that can't hold those
> > values.
>
> Having types that enforce Unicode validity can be very useful when the
> language has good mechanisms for encapsulating the enforcement and for
> clearly marking cases where for performance reasons the responsibility
> of upholding the invariance is transferred from the type
> implementation to the programmer. This kind of thing requires broad
> vision and buy-in from the standard library.
>
> Considering that the committee has recently
> * Added std::u8string without UTF-8 validity enforcement
> * Added std::optional in such a form that the most ergonomic way of
> extracting the value, operator*(), is unchecked
> * Added std::span in a form that, relative to gsl::span, removes
> safety checks from the most ergonomic way of indexing into the span,
> operator[]()
> what reason is there to believe that validity-enforcing Unicode types
> could make it through the committee?
>
> I might not disagree with you on what an ideal from-scratch design
> would look like, but when trying to assess where the overall committee
> Overton Window is, it seems to me that what you're saying is further
> outside the Overtime Window, and I'm trying to position what I said in
> my write-up to at most within baby steps from the boundaries of the
> Overton Window as I perceive it from the outside the committee.
>
> > UTF-8, UTF-16 and UTF-32 are equally easy to implement because they are
> > on the code unit level. The rest of the code work on higher levels and
> > is totally abstracted.
>
> This makes sense in theory, but in practice there can be performance
> benefits from SIMD by breaking this abstraction within the
> implementation.
>
> > Ranges are better than std::span.
>
> Can you expand on this a bit? I have implementation experience with
> span (in non-std:: form) but I'm not properly familiar with Ranges.
> Considering that both std::span and Ranges are being added to C++ at
> the same time, presumably neither is unambiguously better than the
> other for all use cases. What should I read to evaluate your statement
> about Ranges?
>
> > I think default error handling should be throwing exception
>
> I tried to avoid talking about exceptions in my write-up in order to
> avoid ratholing on the prominent point of contention between the C++
> committee and a substantial portion of C++ usage. While the committee
> is of the opinion that exceptions are part of the language, in
> practice there is substantial real-world C++ usage, including the
> major Web engines, that rejects exceptions, so throwing an exception
> aborts the program.
>
> Therefore, exceptions are an acceptable error signaling mechanism for
> the program itself being incorrect on the same level as a
> release-enabled assertion failing, but it's not an acceptable error
> signaling mechanism for errors in the input of the program. Therefore,
> in validity-checking APIs where the values potentially come from
> outside the program, there should be a way to check for errors in a
> non-exception-throwing manner, for example by returning std::optional.
>
> > std::basic_string and std::basic_string_view are a temporary hack and
> > will be replaced by std::unicode::code_unit_sequence in my proposal.
> >
> > See the first part of my design:
> > https://github.com/Lyberta/cpp-unicode-fundamental
>
> This repo seems to contain only a README with some high-level notes,
> but I don't see a design that could be evaluated. Specifically, I
> don't see how the assumed UTF-8 validity enforcement maintains
> encapsulation.
>
> It does seem a bit odd to simultaneously take the position that
> charN_t is on its way out and to design for non-8 CHAR_BITS.
>
> On Fri, Apr 26, 2019 at 4:21 AM Steve Downey <sdowney_at_[hidden]> wrote:
> >
> > I agree that a scalar_value type would be useful for checked code, but I think contracts are the right mechanism for stating the pre and post conditions. Exceptions, particularly in the middle of text processing, would mostly insert unnecessary checks. Contracts would at least make that optional.
>
> If std::scalar_value is less ergonomic to type than char32_t and
> there's a good probability that its benefit gets turned off, what
> reason is there to believe that the ecosystem as a whole would be
> motivated to move to std::scalar_value?
>
> That is, I'm skeptical of designs whose safety can be toggled on or
> off on a very coarse-grained level instead of the safety properties
> being reliable at the point of the programmer using a given facility.
>
> > I think Range and specializing on ContiguousRange will work, and span should be a model of CR.
>
> What does CR mean in this context?
>
> > Encode, Decode, and Transcode should really be in terms of std::byte sequences, but with additional overloads for ergonomics.
>
> I'm getting off-topic here, but I don't understand the utility of
> std::byte. What problem does std::byte solve compared to uint8_t?
> Taking away the ability to ergonomically treat bytes as unsigned
> eight-bit integers seems like an anti-feature from the perspective of
> what programmers want to do with bytes. How well does std::byte
> compose with byte-oriented IO facilities these days?
>
> --
> Henri Sivonen
> hsivonen_at_[hidden]
> https://hsivonen.fi/
> _______________________________________________
> SG16 Unicode mailing list
> Unicode_at_[hidden]
> http://www.open-std.org/mailman/listinfo/unicode

Received on 2019-04-27 13:59:15