sg16: Re: [SG16-Unicode] It’s Time to Stop Adding New Features for Non-Unicode Execution Encodings in C++

From: Lyberta <lyberta_at_[hidden]>
Date: Fri, 26 Apr 2019 00:42:00 +0000

Henri Sivonen:
> At the turn of the year, I commented on Text_view on Slack, and Tom
> Honermann asked me to write my comments up in long form to this
> mailing list. I have now written my comments at
> https://hsivonen.fi/non-unicode-in-cpp/ (also pasted below for on-list
> quotability.) My apologies for it taking me this long to get this
> written.
I have a few comments:

char32_t is too dumb. It can hold surrogate code points or values >
10FFFF. We are not C, we can use strong types that can't hold those
values. I proposed std::unicode::scalar_value that will throw an
exception on invalid value.

UTF-8, UTF-16 and UTF-32 are equally easy to implement because they are
on the code unit level. The rest of the code work on higher levels and
is totally abstracted.

Ranges are better than std::span.

I think default error handling should be throwing exception. In fact, my
design for std::unicode::scalar_value_sequence will ensure that given
sequence is well formed inside constructor and all mutating operations,
there will be an option to skip validation though.

std::basic_string and std::basic_string_view are a temporary hack and
will be replaced by std::unicode::code_unit_sequence in my proposal.

See the first part of my design:
https://github.com/Lyberta/cpp-unicode-fundamental

I do agree that char and wchar_t must die, but I also think that
char8_t, char16_t and char32_t are also hacks. We don't need dumb
fundamental types when we can define strong user defined types.

I hope to see [w]char[NN_t] will be completely removed by 2050.

For now, I have std::unicode::utf32_code_unit and
std::unicode::scalar_value as different types. Mostly because they are
on the different level of abstraction. After GCC 9 release I'm gonna
port my entire codebase to be char8_t based and then I'm gonna have
proper usage experience with my proposed design.

The way from code unit level to scalar value level is giving
std::unicode::code_unit_sequence[_view] to constructor of
std::unicode::scalar_value_sequence[_view]. It throws exception by
default on ill formed sequence. But I expect most users to use std::text
(that will be different in my proposal) and never go down to those low
levels in practice.

C++ constexpr can statically prevent programs from putting invalid
values into UTF-32 buffer. That's the whole point of strong types.
Forget about char32_t.

Transcoding will be range-based as soon as I get ranges in libstdc++ so
I can finish my design. All gains of std::span will be saved by having
implementation defined overload that takes ContiguousRange.

Received on 2019-04-26 02:42:31