sg16: Re: [SG16-Unicode] It’s Time to Stop Adding New Features for Non-Unicode Execution Encodings in C++

From: Lyberta <lyberta_at_[hidden]>
Date: Sat, 27 Apr 2019 11:15:00 +0000

> I might not disagree with you on what an ideal from-scratch design
> would look like, but when trying to assess where the overall committee
> Overton Window is, it seems to me that what you're saying is further
> outside the Overtime Window, and I'm trying to position what I said in
> my write-up to at most within baby steps from the boundaries of the
> Overton Window as I perceive it from the outside the committee.

I guess I'm a bit of a purist and don't like compromises much in terms
of design. I value correctness and safety the most. And we still have at
least 3 years to implement and write proposals. I hope Overton Window
will move into more sane territory by now.

>
>> UTF-8, UTF-16 and UTF-32 are equally easy to implement because they are
>> on the code unit level. The rest of the code work on higher levels and
>> is totally abstracted.
>
> This makes sense in theory, but in practice there can be performance
> benefits from SIMD by breaking this abstraction within the
> implementation.

Where is SIMD is applicable? I assume going through code units -> scalar
values? Since transcoding in my design uses ranges, I think it will be
possible to use SIMD in transcoding code.

>
>> Ranges are better than std::span.
>
> Can you expand on this a bit? I have implementation experience with
> span (in non-std:: form) but I'm not properly familiar with Ranges.
> Considering that both std::span and Ranges are being added to C++ at
> the same time, presumably neither is unambiguously better than the
> other for all use cases. What should I read to evaluate your statement
> about Ranges?

Ranges are generalization of std::span. Since no major compiler
implements them right now, nobody except authors of ranges is properly
familiar with them. For transcoding you don't need contiguous memory and
with Ranges you can do transcoding straight from and to I/O using
InputRange and OutputRange. Not sure how useful in practice, but why
prohibiting it outright?

> I tried to avoid talking about exceptions in my write-up in order to
> avoid ratholing on the prominent point of contention between the C++
> committee and a substantial portion of C++ usage. While the committee
> is of the opinion that exceptions are part of the language, in
> practice there is substantial real-world C++ usage, including the
> major Web engines, that rejects exceptions, so throwing an exception
> aborts the program.

In my Overton Window herbceptions are already in the standard and
everybody is happy about their performance. Wishful thinking, I know.
But we'll see.

>
> Therefore, exceptions are an acceptable error signaling mechanism for
> the program itself being incorrect on the same level as a
> release-enabled assertion failing, but it's not an acceptable error
> signaling mechanism for errors in the input of the program. Therefore,
> in validity-checking APIs where the values potentially come from
> outside the program, there should be a way to check for errors in a
> non-exception-throwing manner, for example by returning std::optional.

Worst case scenario it is possible to go the std::filesystem route where
nonthrowing overloads take std::error_code. But that's worst case.

> This repo seems to contain only a README with some high-level notes,
> but I don't see a design that could be evaluated. Specifically, I
> don't see how the assumed UTF-8 validity enforcement maintains
> encapsulation.

I only got wording for base types and outlined plans for higher layers.
In my own code I got most of the code unit and scalar value level done
but haven't tested it yet.

The idea is to have different types for different levels of encapsulation.

>
> It does seem a bit odd to simultaneously take the position that
> charN_t is on its way out and to design for non-8 CHAR_BITS.

From what I know only 8 bit, 16 bit and 32 bit byte systems actually
support modern C++.

My opinion is that charN_t are unneeded because standard library types
are better then them. I think u8'A' should return std::utf8_code_unit
and not char8_t.

>
> On Fri, Apr 26, 2019 at 4:21 AM Steve Downey <sdowney_at_[hidden]> wrote:
>>
>> I agree that a scalar_value type would be useful for checked code, but I think contracts are the right mechanism for stating the pre and post conditions. Exceptions, particularly in the middle of text processing, would mostly insert unnecessary checks. Contracts would at least make that optional.
>
> If std::scalar_value is less ergonomic to type than char32_t and
> there's a good probability that its benefit gets turned off, what
> reason is there to believe that the ecosystem as a whole would be
> motivated to move to std::scalar_value?
>
> That is, I'm skeptical of designs whose safety can be toggled on or
> off on a very coarse-grained level instead of the safety properties
> being reliable at the point of the programmer using a given facility.

I'm not even sure how to write code without checks... Seems a bit
pointless. It seems there will need to be a different version of
transcoding. Not some compiler switch.

I'm currently focused on producing correctness-first classes. Other
customizations will come later.

>
>> I think Range and specializing on ContiguousRange will work, and span should be a model of CR.
>
> What does CR mean in this context?

ContiguousRange. Afaik std::span satisfies the concept of
ContiguousRange. If it doesn't, this should be fixed ASAP.

>
>> Encode, Decode, and Transcode should really be in terms of std::byte sequences, but with additional overloads for ergonomics.
>
> I'm getting off-topic here, but I don't understand the utility of
> std::byte. What problem does std::byte solve compared to uint8_t?
> Taking away the ability to ergonomically treat bytes as unsigned
> eight-bit integers seems like an anti-feature from the perspective of
> what programmers want to do with bytes. How well does std::byte
> compose with byte-oriented IO facilities these days?
>

std::uint8_t doesn't exist on systems with non-8-bit bytes. And there
are no sane byte oriented facilities in standard library. I'm working on
a proposal to fix this:

https://github.com/Lyberta/cpp-io

Other than that, I don't see how std::byte would make transcoding
easier. I think right now consensus is to use scalar values as
intermediate objects so you only need to define functions to convert
from and to scalar values to get your custom encoding being supported.
And for that I guess you only need a custom code unit type, not std::byte.

Received on 2019-04-27 13:15:50